Decomposed Gaussian Processes for Efficient Regression Models with Low Complexity

Fradi, Anis; Tran, Tien-Tam; Samir, Chafik

doi:10.3390/e27040393

Open AccessArticle

Decomposed Gaussian Processes for Efficient Regression Models with Low Complexity

by

Anis Fradi

^1,*

,

Tien-Tam Tran

² and

Chafik Samir

³

¹

Université Lumière Lyon 2, Université Claude Bernard Lyon 1, ERIC, 69007 Lyon, France

²

Faculty of Applied Sciences, International School, Vietnam National University, Hanoi 10000, Vietnam

³

LIMOS CNRS (UMR 6158), University of Clermont Auvergne, 63000 Clermont-Ferrand, France

^*

Author to whom correspondence should be addressed.

Entropy 2025, 27(4), 393; https://doi.org/10.3390/e27040393

Submission received: 18 March 2025 / Revised: 2 April 2025 / Accepted: 4 April 2025 / Published: 7 April 2025

(This article belongs to the Special Issue Gaussian Fields and Their Application in Computational Engineering and Mathematical Physics)

Download

Browse Figures

Review Reports Versions Notes

Abstract

In this paper, we address the challenges of inferring and learning from a substantial number of observations (

N ≫ 1

) with a Gaussian process regression model. First, we propose a flexible construction of well-adapted covariances originally derived from specific differential operators. Second, we prove its convergence and show its low computational cost scaling as

O (N m^{2})

for inference and

O (m^{3})

for learning instead of

O (N^{3})

for a canonical Gaussian process where

N ≫ m

. Moreover, we develop an implementation that requires less memory

O (m^{2})

instead of

O (N^{2})

. Finally, we demonstrate the effectiveness of the proposed method with simulation studies and experiments on real data. In addition, we conduct a comparative study with the aim of situating it in relation to certain cutting-edge methods.

Keywords:

regression; computational complexity; Gaussian process; covariance functions; functional data

1. Introduction

Gaussian processes are powerful and flexible statistical models that have gained significant popularity in different fields: signal processing, medical imaging, data science, machine learning, econometrics, shape analysis, etc. [1,2,3,4,5]. They provide a nonparametric approach for modeling complex relationships and uncertainty estimation in data [6]. The core idea of Gaussian processes is the assumption that any finite set of data points can be jointly modeled as a multivariate Gaussian distribution [7]. Rather than explicit formulations, Gaussian processes allow for the incorporation of prior knowledge and inference of a nonparametric function f that generates the Gaussian process for a set

{(t_{i}, y_{i})}_{i = 1}^{N}

with

y_{i} = f (t_{i}) + τ_{i}

;

t_{i} \in I \subset R^{d}

and noisy measurements

y_{i} \in R

. If f is modeled with a Gaussian process prior, then it can be fully characterized by a mean

μ

and a covariance function k, satisfying

\begin{matrix} μ (t) & = & E (f (t)); t \in I \end{matrix}

(1)

\begin{matrix} k (t, s) & = & E ((f (t) - μ (t)) (f (s) - μ (s))); t, s \in I . \end{matrix}

(2)

The mean function is usually assumed to be zero (

μ (t) = 0

), whereas the covariance

k (t, s)

provides the dependence between two inputs, t and s. Gaussian processes can be applied for various tasks, including regression [8], classification [9], and time series analysis [10]. In regression, Gaussian processes can capture complex and non-linear patterns in data, while, in classification, they enable probabilistic predictions and can handle imbalanced datasets [11]. Additionally, Gaussian processes have been successfully employed in optimization, experimental design, reinforcement learning, and more [12].

One of the key advantages of Gaussian processes is their ability to provide a rich characterization of uncertainty. This makes them particularly suitable for applications where robust uncertainty quantification is crucial, such as in decision-making processes or when dealing with limited or noisy data. Significant efforts have been dedicated to the development of asymptotically efficient or approximate computational methods for modeling with Gaussian processes. However, Gaussian processes may also suffer from some computational challenges. When the number of observations N increases, the computational complexity for inference and learning grows significantly and incurs an

O (N^{3})

computational cost, which is unfeasible for many modern problems [13]. Another limitation of Gaussian processes is the memory scaling

O (N^{2})

in a direct implementation. To address these issues, various approximations and scalable algorithms, such as sparse Gaussian processes [14,15] and variational inference [16], have been developed to make Gaussian processes applicable to larger datasets.

Certain approximations, as demonstrated in [17,18], involve reduced-rank Gaussian processes that rely on approximating the covariance function. For example, Ref. [19] addressed the computational challenge of working with large-scale datasets by approximating the covariance matrix, which is often required for computations involving kernel methods. In addition, Ref. [20] proposed an FFT-based method for stationary covariances as a technique that leverages the Fast Fourier Transform (FFT) to efficiently compute and manipulate covariance functions in the frequency domain. The link between state space models (SSMs) and Gaussian processes was explored in [21]. This could avoid the cubic complexity in time using Kalman filtering inference methods [22]. Recently, Ref. [23] presented a novel method for approximating covariance functions as an eigenfunction expansion of the Laplace operator defined on a compact domain. More recently, Ref. [24] introduced a reduced-rank algorithm for Gaussian process regression with a numerical scheme.

In this paper, we consider a specific Karhunen–Loève expansion of a Gaussian process with many advantages over other low-rank compression techniques [25]. Initially, we express a Gaussian process as a series of basis functions and random coefficients. By selecting a subset of the most vital basis functions based on the dominant eigenvalues, the rank of the Gaussian process can be reduced. This proves especially advantageous when working with extensive datasets as it can alleviate computational and storage demands. By scrutinizing the eigenvalues linked to the eigenfunctions, one can gauge each eigenfunction’s contribution to noise. This insight can be used for noise modeling, estimation, and separation. The Karhunen–Loève expansion offers a natural framework for model selection and regularization in Gaussian process modeling. By truncating the decomposition to a subset of significant eigenfunctions, we can prevent overfitting. This regularization has the potential to enhance the Gaussian process generalization capability and mitigate the influence of noise or irrelevant features.

In contrast to conventional Karhunen–Loève expansions where eigenfunctions are derived directly from the covariance function or the integral operator, our approach involves differential operators in which the corresponding orthogonal polynomials serve as eigenfunctions. This choice holds significant importance because polynomials are tailored to offer numerical stability and good conditioning, resulting in more precise and stable computations, particularly when dealing with rounding errors. Additionally, orthogonal polynomials frequently possess advantageous properties for integration and differentiation. These properties streamline efficient calculations involving interpolated functions, making them exceptionally valuable for applications necessitating complex computations. On the whole, decomposing Gaussian processes using orthogonal polynomials provides benefits such as numerical stability, quicker convergence, and accurate approximations [26]. However, their incorporation within the Gaussian process framework of the machine learning community has been virtually non-existent. The existing research predominantly revolves around the analysis of integral operators and numerical approximations to compute Karhunen–Loève expansions.

The paper is structured as follows. The first three sections review the necessary theoretical foundations: Section 2 introduces operators in Hilbert spaces, the expansion theorem, and the general Karhunen–Loève theorem; Section 3 focuses on the Gaussian process case; and Section 4 highlights the challenges of canonical Gaussian process regression. Section 5 explores low-complexity Gaussian processes and their computational advantages. Section 6 presents the proposed solutions for various differential operators using orthogonal polynomial bases. Finally, Section 7 discusses the experimental results, followed by a comprehensive discussion and conclusion in Section 8.

2. Operators on Hilbert Spaces

In this section, we recall and prove some useful results about linear compact symmetric operators on the particular

L^{2}

Hilbert space. More general results of any Hilbert space H are moved to Appendix A.

Let

(Ω, F, P)

be a probability space, and let X and Y be second-order real-valued random variables, meaning

E (X^{2}) < \infty

and

E (Y^{2}) < \infty

. By the Cauchy–Schwarz inequality,

E (X Y) < \infty

, allowing us to center the variables, assuming without loss of generality that they have zero mean. Thus,

E (X^{2})

is the variance of X and

E (X Y)

is the covariance of X and Y. The set of these random variables forms a Hilbert space with inner product

E (X Y)

and norm

{| | X | |}_{2} = \sqrt{E (X^{2})}

, leading to mean square convergence:

X_{n}

converges in norm to X if

| | X_{n} - X {| |}_{2} \to 0

; equivalently,

E ({(X_{n} - X)}^{2}) \to 0

as

n \to \infty

. X and Y are orthogonal, written

X ⊥ Y

, if

E (X Y) = 0

. If

X ⊥ Y

, then

E {(X + Y)}^{2} = E (X^{2}) + E (Y^{2})

. For mutually orthogonal variables

X_{1}, \dots, X_{n}

, we have

E {(X_{1} + \dots + X_{n})}^{2} = E (X_{1}^{2}) + \dots + E (X_{n}^{2})

.

Lemma 1.

Let

X = {X_{t}}_{t}

,

t \in I \subset R^{d}

be a real-valued second-order random process with zero mean and covariance function

k (t, s) : = E (X_{t} X_{s})

. Then, the covariance k is a symmetric non-negative definite function.

Proof.

First, by definition of

k (t, s) : = E (X_{t} X_{s})

, its symmetry is trivial. Next, it holds that, for all possible choices of

t_{1}, \dots, t_{N} \in I

,

N \in N^{*}

and all possible functions

ϕ : I \to R

, we have

\begin{matrix} \sum_{i = 1}^{N} \sum_{l = 1}^{N} ϕ (t_{i}) k (t_{i}, t_{l}) ϕ (t_{l}) & = & \sum_{i = 1}^{N} \sum_{l = 1}^{N} ϕ (t_{i}) E (X_{t_{i}} X_{t_{l}}) ϕ (t_{l}), \\ = & E (\sum_{i = 1}^{N} \sum_{l = 1}^{N} ϕ (t_{i}) X_{t_{i}} X_{t_{l}} ϕ (t_{l})), \\ = & E ({(\sum_{i = 1}^{N} ϕ (t_{i}) X_{t_{i}})}^{2}) \geq 0 . \end{matrix}

Thus, k is non-negative definite. □

In this section, we are interested in the Hilbert space

H : = L^{2} (I, ρ)

, the space of all the real-valued Borel-measurable functions

ϕ

on the interval

I \subset R^{d}

such that

\int_{I} ϕ^{2} (t) ρ (t) d t < \infty

with a positive weight function

ρ (t)

, which inherits all proprieties from Appendix A. Consider the Hilbert–Schmidt integral operator

K : L^{2} (I, ρ) \mapsto L^{2} (I, ρ)

, expressed as

\begin{matrix} (K ϕ) (t) : = \int_{I} k (t, s) ϕ (s) ρ (s) d s . \end{matrix}

(3)

Theorem 1 (Mercer’s Theorem).

Let

k : I \times I \to R

be continuous symmetric non-negative definite and let

K

be the corresponding Hilbert–Schmidt operator. Let

{ϕ_{j}}_{j}

be an orthonormal basis for the space spanned by the eigenvectors corresponding to the non-zero eigenvalues of

K

. If

ϕ_{j}

is the eigenvector corresponding to the eigenvalue

λ_{j}

, then

k (t, s) = \sum_{j = 1}^{\infty} λ_{j} ϕ_{j} (t) ϕ_{j} (s),

(4)

where

(i): the series converges absolutely in both variables jointly.
(ii): the series converges to $k (t, s)$ uniformly in both variables jointly.
(iii): the series converges to $k (t, s)$ in $L^{2} (I \times I, ρ)$ .

Theorem 2 (Karhunen–Loève Theorem).

For a real-valued second-order random process

X = {X_{t}}_{t}

with zero mean and continuous covariance function

k (t, s)

on

I \subset R^{d}

, we can decompose each

X_{t}

as

X_{t} = \sum_{j = 1}^{\infty} a_{j} ϕ_{j} (t),

(5)

where

(i): ${ϕ_{j}}_{j}$ are eigenfunctions of $K$ , which form an orthonormal basis, i.e., $\int_{I} ϕ_{j} (t) ϕ_{l} (t) ρ (t) d t = δ_{j l}$ , where $δ_{j l}$ is the Kronecker’s delta.
(ii): $a_{j} = \int_{I} X_{t} ϕ_{j} (t) ρ (t) d t$ is the coefficient given by the projection of $X_{t}$ onto the j-th deterministic element of the Karhunen–Loève basis in $L^{2} (Ω, F, P)$ .
(iii): ${a_{j}}_{j}$ are pairwise orthogonal random variables with zero mean and variance $λ_{j}$ , corresponding to the eigenvalue of the eigenfunction $ϕ_{j}$ .

Moreover, the series

\sum_{j = 1}^{\infty} a_{j} ϕ_{j} (t)

converges to

X_{t}

in mean square uniformly for all

t \in I

.

Proof.

We have

E (a_{j}) = 0

and

E (a_{j} a_{l}) = λ_{l} δ_{j l}

due to orthonormality of

{ϕ_{j}}_{j}

. Thus,

{a_{j}}_{j}

are pairwise orthogonal in

L^{2} (Ω, F, P)

and

v (a_{j}) = λ_{j}

. To show the mean square convergence, let

S_{t}^{m} : = \sum_{j = 1}^{m} a_{j} ϕ_{j} (t)

and

ε_{t}^{m} : = S_{t}^{m} - X_{t}

. Then,

\begin{matrix} E ({(ε_{t}^{m})}^{2}) & = & \sum_{j = 1}^{m} λ_{j} ϕ_{j}^{2} (t) - 2 \sum_{j = 1}^{m} ϕ_{j} (t) E (a_{j} X_{t}) + k (t, t), \\ = & k (t, t) - \sum_{j = 1}^{m} λ_{j} ϕ_{j}^{2} (t) \underset{m \to \infty}{\to} 0, \end{matrix}

uniformly in

t \in I

by Mercer’s theorem. □

Definition 1.

The mean square error is defined as

E ({(ε_{t}^{m})}^{2})

. The mean integrated square error (MISE), denoted by

ξ_{m}

, is then given by

ξ^{m} : = \int_{I} E ({(ε_{t}^{m})}^{2}) ρ (t) d t

, representing the mean square error integrated over the basis

{ϕ_{j}}_{j}

onto which

X_{t}

is projected for every

t \in I

.

Proposition 1.

The MISE

ξ^{m}

tends to 0 as

m \to \infty

.

Proof.

The MISE of

S_{t}^{m}

satisfies

\begin{matrix} ξ^{m} & = & \int_{I} E ({(ε_{t}^{m})}^{2}) ρ (t) d t, \\ = & E (\int_{I} (\sum_{j = m + 1}^{\infty} a_{j} ϕ_{j} {(t))}^{2} ρ (t) d t), \\ = & \sum_{j = m + 1}^{\infty} E (a_{j}^{2}), \int_{I} ϕ_{j} {(t)}^{2} ρ (t) d t, \\ = & \sum_{j = m + 1}^{\infty} E (a_{j}^{2}), \\ = & \sum_{j = m + 1}^{\infty} λ_{j}, \end{matrix}

which tends to 0 as

m \to \infty

since

λ_{j}

are absolutely summable. □

Now, we highlight the crucial role of the Karhunen–Loève basis in minimizing the error incurred by truncating the expansion of

X_{t}

. By aligning the basis functions with the dominant modes of variation captured by the process, we achieve an optimal representation in terms of mean integrated square error.

Proposition 2.

The MISE

ξ^{m}

is minimized if and only if

{ϕ_{j}}_{j}

constitutes an orthonormalization of the eigenfunctions of the Fredholm equation

(K ϕ) (t) = λ ϕ (t),

(6)

with

{ϕ_{j}}_{j}

arranged to correspond to the eigenvalues

{λ_{j}}_{j}

in decreasing magnitude:

λ_{1} > λ_{2} > \dots > 0

.

3. Expansion and Convergence of Gaussian Processes

A stochastic process

f = {f (t)}_{t}

is said to be a Gaussian process (GP) if for all positive integers

N \in N^{*}

and all choices of

t_{1}, \dots, t_{N}

the random variables

f (t_{1}), \dots, f (t_{N})

form a Gaussian random vector, which means that they are jointly Gaussian. One of the main advantages of a GP is that it can be represented as a series expansion involving a complete set of deterministic basis functions with corresponding random Gaussian coefficients.

Theorem 3.

If

f = {f (t)}_{t}

,

t \in I \subset R^{d}

is a zero-mean GP of covariance

k (t, s)

denoted

f (t) \sim GP (0, k (t, s))

, then the Karhunen–Loève expansion projections

a_{j}

are independent Gaussian random variables:

a_{j} \sim N (0, λ_{j})

.

Now, we provide a result that establishes important convergence properties, connecting mean square convergence, convergence in probability, and convergence almost surely for sequences of random variables.

Lemma 2.

1. The mean square convergence of any sequence

{a_{n}}_{n}

of real-valued random variables implies its convergence in probability.

2.: If ${a_{n}}_{n}$ is a sequence of independent real-valued random variables, then the convergence of the series $\sum_{n = 1}^{\infty} a_{n}$ in probability implies its convergence almost surely.

Corollary 1.

For each

t \in I

,

\sum_{j = 1}^{\infty} a_{j} ϕ_{j} (t)

converges to

f (t)

almost surely.

Proof.

Since

{a_{j}}_{j}

are independent,

{a_{j} ϕ_{j} (t)}_{j}

are also independent because the eigenfunctions are deterministic. Thus, Lemma 2 can be applied to the series of independent random variables

\sum_{j = 1}^{\infty} a_{j} ϕ_{j} (t)

. This means that its convergence in probability implies almost sure convergence. Therefore, we only need to prove convergence in probability. This argument is straightforward because, by Lemma 2, convergence in mean square implies convergence in probability. The mean square convergence of the partial sums

f^{m} (t) : = \sum_{j = 1}^{m} a_{j} ϕ_{j} (t)

is a result of the Karhunen–Loève theorem. □

Example 1.

1. The Karhunen–Loève expansion of the Brownian motion on

I = [0, 1]

as a centered GP with covariance

k (t, s) = min (t, s)

is given by

f (t) = \frac{2 \sqrt{2}}{π} \sum_{j = 1}^{\infty} \frac{a_{j}^{*}}{2 j - 1} sin (\frac{(2 j - 1) π}{2} t),

where

a_{j}^{*} = \frac{a_{j}}{\sqrt{λ_{j}}}

and

λ_{j} = \frac{4}{{(2 j - 1)}^{2} π^{2}}

.

2.: The Karhunen–Loève expansion of the Brownian bridge on $I = [0, 1]$ as a centered GP with covariance $k (t, s) = min (t, s) - t s$ is given by

$f (t) = \frac{\sqrt{2}}{π} \sum_{j = 1}^{\infty} \frac{a_{j}^{*}}{j} sin (j π t),$

where $a_{j}^{*} = j π a_{j}$ and $λ_{j} = \frac{1}{j^{2} π^{2}}$ .

4. Ill-Conditioned Canonical Gaussian Process Regression

In a regression task, a nonparametric function f is assumed to be a realization of a stochastic GP prior, whereas the likelihood term holds from observations corrupted by a noise term according to the canonical form

\begin{matrix} y_{i} & = f (t_{i}) + τ_{i}; i = 1, \dots, N, \\ f (t) & \sim GP (0, k (t, s)), \end{matrix}

(7)

where

τ_{i} \sim N (0, σ_{N}^{2})

is a Gaussian noise. Given a training dataset

D = (t, y) = {(t_{i}, y_{i})}_{i = 1}^{N}

, the posterior distribution over

f = f (t) = {(f (t_{1}), \dots, f (t_{N}))}^{⊤}

is also Gaussian:

P (f | D) = N (μ, Σ)

. From Bayes’ rule, we state that the mean and the covariance posterior are expressed as

\begin{matrix} μ & = & K {(K + σ_{N}^{2} I_{N})}^{- 1} y, \end{matrix}

(8)

\begin{matrix} Σ & = & {(K^{- 1} + \frac{1}{σ_{N}^{2}} I_{N})}^{- 1}, \end{matrix}

(9)

where

K = {[k (t_{i}, t_{j})]}_{i, j = 1}^{N}

is the prior covariance matrix and

I_{N}

is the

N \times N

identity matrix. The predictive distribution at any test input

t_{★}

can be computed in closed form as

f (t_{★}) | D, t_{★} \sim N ({\bar{f}}_{★}, v (f_{★})),

with

\begin{matrix} {\bar{f}}_{★} & = k {(t_{★})}^{⊤} {(K + σ_{N}^{2} I_{N})}^{- 1} y, \end{matrix}

(10)

\begin{matrix} v (f_{★}) & = k (t_{★}, t_{★}) - k {(t_{★})}^{⊤} {(K + σ_{N}^{2} I_{N})}^{- 1} k (t_{★}), \end{matrix}

(11)

where

k (t_{★}) = {[k (t_{i}, t_{★})]}_{i = 1}^{N}

.

The covariance function k usually depends on a set of hyperparameters denoted

θ_{k}

that need to be estimated from the training dataset. The log marginal likelihood for GP regression serves as an indicator of the degree to which the selected model accurately captures the observed patterns. The log marginal likelihood is typically used for model selection and optimization. Let

θ = (θ_{k}, σ_{N}^{2})

denote the set of all model parameters, and then the log marginal likelihood

log P (y | t, θ)

is given by

\begin{matrix} l (θ) = - \frac{1}{2} log | K + σ_{N}^{2} I_{N} | - \frac{1}{2} y^{⊤} {(K + σ_{N}^{2} I_{N})}^{- 1} y - \frac{N}{2} log (2 π) . \end{matrix}

(12)

Here,

| . |

denotes the determinant. The goal is to estimate

θ

that maximizes the log marginal likelihood. This can be achieved using different methods, such as gradient-based algorithm [27]. The weakness of inferring the posterior mean or the mean prediction or even learning the hyperparameters from the log marginal likelihood is the need to invert the

N \times N

Gram matrix

K + σ_{N}^{2} I_{N}

. This operation costs

O (N^{3})

, which limits the applicability of standard GPs when the sample size N increases significantly. Furthermore, the memory requirements for GP regression scale with a computational complexity of

O (N^{2})

.

A covariance function

k (t, s)

is said to be stationary (isotropic) if it is invariant to translation, i.e., a function of

| | t - s | |

only. Two commonly used stationary covariance functions for GP regression are the Squared Exponential (SE) and Matérn-

ν

kernels defined by

\begin{matrix} k (t, s) & = & σ^{2} e^{- ε^{2} {∥ t - s ∥}^{2}}; t, s \in R \end{matrix}

(13)

\begin{matrix} k (t, s) & = & σ^{2} \frac{2^{1 - ν}}{Γ (ν)} {(ε \sqrt{2 ν} ∥ t - s ∥)}^{ν} K_{ν} (ε \sqrt{2 ν} ∥ t - s ∥); t, s \in R \end{matrix}

(14)

respectively, where

σ^{2}

is the variance parameter controlling the amplitude of the covariance,

ε

is the shape parameter, and

ν = k + 1 / 2

;

k \in N

is the half integer smoothness parameter controlling its differentiability. Here,

Γ

is the gamma function and

K_{ν}

is the modified Bessel function of the second kind. Both the SE and Matérn covariance functions have hyperparameters that needs to be estimated from the data during the model training process. A GP with a Matérn-

ν

covariance is

⌈ ν ⌉ - 1

times differentiable in the mean-square sense. The SE covariance is the limit of Matérn-

ν

as the smoothness parameter

ν

approaches infinity. When choosing between the SE and Matérn covariance functions, it is often a matter of balancing the trade-off between modeling flexibility and computational complexity. The SE covariance function is simpler and more computationally efficient but may not capture complex patterns in data as well as the Matérn covariance function with an appropriate choice of smoothness parameter.

In order to compute (10), we also need to invert the

N \times N

Gram matrix

K + σ_{N}^{2} I_{N}

. This task is impractical when the size sample N is large because inverting the matrix leads to

O (N^{2})

memory and

O (N^{3})

time complexities [28]. There are several methods to overcome this difficulty. For instance, the variational inference proceeds by introducing n inducing points and corresponding n inducing variables. The variational parameters of inducing variables are learned by minimizing the Kullback–Leibler divergence [16]. Picking

n ≪ N

, the complexity reduces to

O (N n^{2} + n^{3})

in prediction and

O (n^{2})

in minimizing the Kullback–Leibler divergence. The computational complexity of conventional sparse Gaussian process (SGP) approximations typically scales as

O (N n^{2})

in time for each step of evaluating the marginal likelihood [14]. The storage demand scales as

O (N n)

. This arises from the unavoidable cost of re-evaluating all results involving the basis functions at each step and the need to store the matrices required for these calculations.

5. Low-Complexity Gaussian Process Regression

In order to avoid the inversion of the

N \times N

Gram matrix

K + σ_{N}^{2} I_{N}

, we use the approximation scheme presented in Section 3 and rewrite the GP with a truncated set of m basis functions. Hence, the truncated f at an arbitrary order

m \in N^{*}

is given by

\begin{matrix} f^{m} (t) : = \sum_{j = 1}^{m} a_{j} ϕ_{j} (t), \end{matrix}

(15)

with an approximation error

e^{m} (t) : = \sum_{j = m + 1}^{\infty} a_{j} ϕ_{j} (t)

. The canonical GP regression model (7) becomes

\begin{matrix} y_{i} & = f^{m} (t_{i}) + τ_{i}; i = 1, \dots, N, \\ f^{m} (t) & \sim GP (0, k^{m} (t, s)), \end{matrix}

(16)

with a covariance function approximated by

k^{m} (t, s) = \sum_{j = 1}^{m} λ_{j} ϕ_{j} (t) ϕ_{j} (s)

. The convergence degree of the Mercer series in (4), that is,

k^{m} \underset{m \to \infty}{\to} k

with

k (t, s) = \sum_{j = 1}^{\infty} λ_{j} ϕ_{j} (t) ϕ_{j} (s),

depends heavily on the eigenvalues and the differentiability of the covariance function. Ref. [29] showed that the speed of the uniform convergence varies in terms of the decay rate of eigenvalues and demonstrated that, for a

2 β

times differentiable covariance k, the truncated covariance

k^{m}

approximates k as

O ({(\sum_{j = m + 1}^{\infty} λ_{j})}^{\frac{β}{β + 1}})

. For infinitely differentiable covariances, the latter is

O ({(\sum_{j = m + 1}^{\infty} λ_{j})}^{1 - ϵ})

for any

ϵ > 0

. To summarize, smoother covariance functions tend to exhibit faster convergence, while less smooth or non-differentiable covariance functions may exhibit slower or no convergence.

The resulting covariance falls into the class of reduced-rank approximations based on approximating the covariance matrix

K

with a matrix

\tilde{K} = {[k^{m} (t_{i}, t_{j})]}_{i, j = 1}^{N} = Φ Γ Φ^{⊤}

, where

Γ

is an

m \times m

diagonal matrix with eigenvalues such that

Γ_{j j} = λ_{j}

and

Φ

is an

N \times m

matrix with eigenfunctions such that

Φ_{i j} = ϕ_{j} (t_{i})

. Note that the approximate covariance matrix

\tilde{K}

becomes ill-conditioned when the ratio

λ_{1} / λ_{m}

is large. This ill-conditioning occurs particularly when the observation points

t_{i}

are too close to each other [30]. In practice, this can lead to significant numerical errors when inverting

\tilde{K}

, resulting in unstable solutions, amplified errors in parameter estimation, and unreliable model predictions.

Now, we show how the approximated regression models make use of GP decomposition to achieve low complexity. We write down the expressions needed for inference and discuss the computational requirements. Applying the matrix inversion lemma [31], we rewrite the predictive distribution (10) and (11) as

\begin{matrix} {\bar{f}}_{★} & \approx ϕ_{★}^{⊤} {(Φ^{⊤} Φ + σ_{N}^{2} Γ^{- 1})}^{- 1} Φ^{⊤} y, \end{matrix}

(17)

\begin{matrix} v (f_{★}) & \approx σ_{N}^{2} ϕ_{★}^{⊤} {(Φ^{⊤} Φ + σ_{N}^{2} Γ^{- 1})}^{- 1} ϕ_{★}, \end{matrix}

(18)

where

ϕ_{★}

is an m-dimensional vector with the j-th entry being

ϕ_{j} (t_{★})

. When the number of observations is much higher than the number of required basis functions (

N ≫ m

), the use of this approximation is advantageous. Thus, any prediction mean evaluation is dominated by the cost of constructing

Φ^{⊤} Φ

, which means that the method has an overall asymptotic computational complexity of

O (N m^{2})

.

The approximate log marginal likelihood satisfies

\begin{matrix} l (θ) \approx & - \frac{1}{2} log | Φ Γ Φ^{⊤} + σ_{N}^{2} I_{N} | - \frac{1}{2} y^{⊤} {(Φ Γ Φ^{⊤} + σ_{N}^{2} I_{N})}^{- 1} y - \frac{N}{2} log (2 π), \\ = & - \frac{1}{2} (N - m) log σ_{N}^{2} - \frac{1}{2} log | Φ^{⊤} Φ + σ_{N}^{2} Γ^{- 1} | - \frac{1}{2} \sum_{j = 1}^{m} log λ_{j} \\ - \frac{1}{2 σ_{N}^{2}} (y^{⊤} y - y^{⊤} Φ {(Φ^{⊤} Φ + σ_{N}^{2} Γ^{- 1})}^{- 1} Φ^{⊤} y) - \frac{N}{2} log (2 π) . \end{matrix}

(19)

Consequently, evaluating the approximate log marginal likelihood has a complexity of

O (m^{3})

needed to inverse the

m \times m

matrix

M = Φ^{⊤} Φ + σ_{N}^{2} Γ^{- 1}

. In practice, if the sample size N is large, it is preferable to store the result of

Φ^{⊤} Φ

in memory, leading to a memory requirement that scales as

O (m^{2})

. For efficient implementation, matrix-to-matrix multiplications can often be avoided, and the inversion of

M

can be performed using Cholesky factorization for numerical stability. This factorization (

L L^{⊤} = M

) can also be used to compute the term

log | M | = 2 \sum_{j = 1}^{m} log L_{j j}

. Algorithm 1 outlines the main steps for estimating the hyperparameters of the low-complexity GP.

Algorithm 1 Gradient Ascent for Hyperparameter Learning.

Require:: Data $D = (t, y)$ , initial hyperparameters $θ = (θ_{k}, σ_{N}^{2})$ , learning rate $η$ , tolerance $ϵ$ , maximum iterations T.
Ensure:: Optimized hyperparameters $θ^{*}$ .
1:: Initialize $θ = {(θ_{j})}_{j}$ .
2:: for $t = 1$ to T do
3:: Construct $Φ$ and $Γ$ depending on $θ$ .
4:: Compute $M = Φ^{⊤} Φ + σ_{N}^{2} Γ^{- 1}$ .
5:: Evaluate $L = Cholesky (M)$ .
6:: for each hyperparameter $θ_{j}$ do
7:: Compute gradient $g_{j} = function (M)$ .
8:: Update hyperparameter: $θ_{j} \leftarrow θ_{j} + η g_{j}$ .
9:: end for
10:: if $∥ \nabla l (θ) ∥ < ϵ$ then
11:: break
12:: end if
13:: end for
14:: return optimized hyperparameters $θ^{*} = (θ_{k}^{*}, σ_{N}^{2, *})$ .

6. Closed Solutions from Differential Operators

In this section, we highlight the connection between Hilbert–Schmidt integral operators in (3) and differential operators. Thus, we describe explicit solutions for the low complexity GP (LCGP) with covariances derived from differential operators. Unlike previous works where the eigen-decomposition is determined from the Mercer series or the integral operator directly, this paper focuses on constructing covariance functions that incorporate orthogonal polynomials as eigenfunctions. It is worth noting that polynomials can approximate a wide range of functions with various degrees of complexity. They can be adjusted to predict different data patterns and capture both linear and non-linear relationships [32].

The connection between a differential operator denoted

L

and the integral operator

K

has been largely used. We follow the same idea in [33,34,35] and define Green’s function G of the differential operator

L

as its “right inverse”, i.e.,

\begin{matrix} (L G) (t, s) = δ (t - s); t, s \in I, \end{matrix}

(20)

where

δ (.)

denotes the Dirac delta function.

Theorem 4.

Let

k : I \times I \to R

be continuous symmetric non-negative definite and let

K

be the corresponding Hilbert–Schmidt operator. Let

{ϕ_{j}}_{j}

be an orthonormal basis for the space spanned by the eigenfunctions corresponding to the non-zero eigenvalues of

K

. If

ϕ_{j}

is the eigenfunction associated with the eigenvalue

λ_{j}

and the covariance function acts as a Green’s function of a differential operator

L

, then the eigenvalues of

K

correspond to reciprocal eigenvalues of

ρ {(t)}^{- 1} L

, while the corresponding eigenfunctions are still the same.

Proof.

We have

\begin{matrix} λ_{j} (L ϕ_{j}) (t) & = & L (K ϕ_{j}) (t), \\ = & L \int_{I} k (t, s) ϕ_{j} (s) ρ (s) d s, \\ = & \int_{I} L k (t, s) ϕ_{j} (s) ρ (s) d s, \\ = & \int_{I} δ (t - s) ϕ_{j} (s) ρ (s) d s, \\ = & ϕ_{j} (t) ρ (t) . \end{matrix}

Finally, we obtain

\begin{matrix} ρ^{- 1} (t) (L ϕ_{j}) (t) = \frac{1}{λ_{j}} ϕ_{j} (t), \end{matrix}

which completes the proof. □

At this stage, we compute eigenvalues and eigenfunctions of

ρ {(t)}^{- 1} L

, from which we deduce the Mercer decomposition of

k (t, s)

given in (4) replacing

λ_{j}

by

γ_{j} = \frac{1}{λ_{j}}

. We select a list of some interesting and useful differential operators that act on

L^{2} (I, ρ)

, for example: Matérn, Legendre, Laguerre, Hermite, Chebyshev, and Jacobi, from which we explicitly find the corresponding eigen-decompositions. Table 1 summarizes each class of

L

, the index set I, the weight function

ρ

, the eigenvalues

γ_{j}

, the eigenfunctions as polynomials

ϕ_{j}

, and the resulting MISE. Note that, for Laguerre, Hermite, and Chebyshev polynomials, the eigenvalues

γ_{j}

of

L

were initially negatives. Therefore, we consider the iterated operator

L^{2} : = L \circ L

with squared eigenvalues and unchanged eigenfunctions. Further, for Legendre, Hermite, and Chebyshev, we state that

∥ ϕ_{j} ∥_{L^{2}} \neq 1

, which means that

ϕ_{j}

should be normalized to form an orthonormal basis. For Jacobi, the differential operator is

L_{J} = (t^{2} - 1) \frac{d^{2}}{d t^{2}} + [α - β + (α + β + 2) t] \frac{d}{d t}

with parameters

α, β

greater than

- 1

. More details about the corresponding eigenfunctions

J_{j}^{α, β}

and the

L^{2}

-norm are provided in [36].

Figure 1 illustrates the behavior of the eigenvalues

λ_{j} = \frac{1}{γ_{j}}

of the integral operator

K

as the index j varies from 1 to 40. This is attributed to the smoothness of the true covariance as m is growing. Figure 2 visually depicts several GP realizations across various differential operator settings.

7. Experiments

In this section, we assess the effectiveness of the proposed LCGP by conducting evaluations on multiple datasets and reporting comparisons with some state-of-the-art methods. The comparative analysis will enable us to gain insights into the strengths and weaknesses of the proposed framework and determine its competitiveness. Here is an overview of the provided Python (version 3.12.4, packaged by Anaconda) code (https://github.com/anisfradi/Low-Complexity-Regression-Models-with-Gaussian-Process-Prior.git, accessed on 3 April 2025). The computations were executed on a computer with 125 GB of memory and a Xeon(R) W 2275 CPU @ 3.30 GHz. Throughout all experiments, we set the truncation parameter m to 25.

7.1. Simulations

In simulation study, we examine two functions: a parametric beta density function represented by

f (t) = B (t | a = 2, b = 5)

(Simulation 1) and a nonparametric quasi-periodic function satisfying

f (t) = t sin (10 t)

(Simulation 2). Both functions are defined on the unit interval

I = [0, 1]

. For these experiments, we generated a total of 140 observations. Out of these, we allocated 40 observations for training and the remaining for testing. The input points

t_{i}

are uniformly distributed on

[0, 1]

. To introduce variability and simulate real-world conditions, each observed point was calculated as

y_{i} = f (t_{i}) + τ_{i}

, where

τ_{i}

represents Gaussian noise drawn from

N (0, σ_{N}^{2} = 0.1)

. This procedure allows us to evaluate the performance of our models using noisy data.

In Figure 3, we show an illustration of predicting the true function from simulations. We observe that different types of polynomial eigenfunctions have distinct advantages in prediction with truncated GP. Matérn (MGP) and Legendre (LGP) are well suited for functions with exponential decay, making them suitable for decay processes. They also incorporate function values at data points, allowing for accurate predictions even with rapidly changing functions and effective in approximating functions with oscillatory behavior. Hermite (HGP) and Chebyshev (CGP) are accurate with more slowly varying processes. Jacobi (JGP) accurately fits functions passing through given data points. The performance of the Laguerre operator is comparatively weaker, leading to its exclusion from the experimental analysis. Figure 4 illustrates the uncertainty level of the proposed MGP when applied to the predicted data.

7.2. Real Data

In this part, we conduct a real study using two challenging datasets. The first dataset comprises more than

10,000

observations collected by the California Cooperative Oceanic Fisheries Investigations (CalCOFI) (https://www.kaggle.com/datasets/sohier/calcofi?resource=download, accessed on 3 April 2025). It investigates the ecological aspects surrounding the collapse of the sardine population of the coast of California, which is recognized as the longest and most comprehensive time series of oceanographic and larval fish data worldwide. It encompasses abundance data for over 250 fish species’ larvae, as well as larval length frequency data, egg abundance data for important commercial species, and oceanographic data. Data collected at depths up to 500 meters include temperature, salinity, oxygen, phosphate, silicate, nitrate and nitrite, chlorophyll, phytoplankton biodiversity, etc. In this experiment, we are specifically targeting climate change indicators on the California coast when we focus on data illustrating the temperature (°C) as function of the salinity (ppt). Some examples (

N = 500

) are given in Figure 5 (left).

The second dataset used in this study pertains to Medical Cost Personal (mCP) (https://www.kaggle.com/datasets/mirichoi0218/insurance/discussion, accessed on 3 April 2025, which was sourced from demographic statistics provided by the US Census Bureau. It primarily focuses on the cost of treatment, which is influenced by various factors, including age, sex, body mass index (BMI), and smoking status. Specifically, this paper examines the relationship between treatment costs (charges in thousand dollars) and the BMI factor for both smokers and non-smokers. See some examples for smokers (

N = 532

) and non-smokers (

N = 137

) in Figure 5 (right). For both real datasets, a random split of

50 %

is designated for training, with the remaining for testing. This strategy ensures a balanced distribution of data for model training and facilitates a comprehensive assessment of model performance. We compare our proposed LCGP with several baseline methods to determine if there are significant performance differences. The baseline methods include (i) the standard GP (std-GP) with a Matérn covariance function (as described in Section 4 and (ii) the sparse GP (SGP) [14]. The standard GP model was implemented in the scikit-learn library (https://scikit-learn.org/stable/modules/generated/sklearn.gaussian_process.GaussianProcessRegressor.html, accessed on 3 April 2025), and the sparse GP was implemented in GPy (https://nbviewer.org/github/SheffieldML/notebook/blob/master/GPy/sparse_gp_regression.ipynb, accessed on 3 April 2025). To evaluate the performance of the proposed methods, two commonly used metrics include the following:

ISE: The integrated squared error as the average squared difference between the predicted function and the true function.
R-squared: The coefficient of determination as a statistical measure used in regression analysis that represents the proportion of the variance in the dependent variable that is predictable from the independent variable.

Table 2, Table 3 and Table 4 present the prediction criteria obtained from real data and the computational time needed for this evaluation in seconds (s). Notably, Legendre exhibits slightly superior performance compared to other methods on CalCOFI and mCP smoker datasets, while Matérn emerges as the top performer on mCP non-smoker data. The number of inducing points is fixed to

n = 10 < m

for the sparse GP. These results suggest that Legendre demonstrates greater adaptability in capturing a diverse array of patterns and structures within real data. Despite its equivalent complexity to other proposed models, the Chebyshev polynomial exhibits the shortest computational time. This efficiency can be attributed to a faster evaluation of the corresponding eigenfunctions. For further insights into the quality of predictions, refer to Figure 6 and Figure 7.

We also estimate the number of operations required for each approach. The sparse GP typically involves a computational complexity of

O (N \times 10^{2})

, where N is the number of training points and

n = 10

the number of inducing points. This results in approximately

c_{1} \times N \times 10^{2}

operations, where

c_{1}

is a constant depending on matrix operations. In contrast, our LCGP method operates with a complexity of

O (N \times 25^{2})

, leading to

c_{2} \times N \times 25^{2}

operations, where

m = 25

is the number of basis functions and

c_{2}

a corresponding constant.

To assess the robustness of our method, we have conducted a series of experiments on CalCOFI dataset by gradually increasing the sample size N to

12,000

. The results shown in Figure 8 demonstrate that our methods are highly competitive in terms of accuracy: low ISE and high R-squared values. However, it is worth noting that the gap in computational time (measured in minutes) between the methods becomes significant.

8. Conclusions

In this paper, we have introduced a novel regression model with a Gaussian process prior. This nonparametric model is designed for inferring, predicting, and learning. Contrary to previous methods, the proposed one is derived from specific differential operators. We study and test different configurations with well adapted eigenfunctions’ bases, enabling straightforward implementations with closed-form expressions. In summary, a key advantage of our method is its ability to overcome the limitations associated with standard Gaussian processes. This includes reducing the computational cost to

O (N m^{2})

for inference and

O (m^{3})

for learning. We assess the effectiveness of our proposed model using a variety of simulated and real-world data. The experimental results and comparisons demonstrate its high accuracy, low computational overhead, and analytical simplicity in comparison to the existing methods.

Author Contributions

Conceptualization, A.F. and C.S.; methodology, A.F. and T.-T.T.; software, T.-T.T.; validation, C.S. and A.F.; formal analysis, T.-T.T.; investigation, T.-T.T.; resources, C.S.; data curation, A.F.; writing—original draft preparation, A.F.; writing—review and editing, A.F.; visualization, A.F.; supervision, C.S.; project administration, C.S.; funding acquisition, C.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. The Expansion Theorem

In this section, we recall and prove some useful results about linear compact symmetric operators in a Hilbert space H.

Lemma A1.

Let

A : H \to H

be a linear operator. Then,

1.: A is continuous if and only if A is bounded.
2.: If A is compact, then A is continuous.
3.: If A is continuous and invertible, then $A^{- 1}$ is continuous.
4.: If A is symmetric and compact, then A has at least one eigenvalue.

Proposition A1.

Let

λ, μ \neq 0

be distinct eigenvalues of the compact symmetric linear operator

A : H \to H

. Then, the corresponding eigenvectors are orthogonal.

Proof.

Since A is symmetric, by definition,

〈 A x, y 〉 = 〈 x, A y 〉

. If x is the eigenvector corresponding to

λ

and y is the eigenvector corresponding to

μ

, then

〈 λ x, y 〉 = 〈 x, μ y 〉

, which gives us

λ 〈 x, y 〉 = μ 〈 x, y 〉

. Therefore,

(λ - μ) 〈 x, y 〉 = 0

. Since

λ \neq μ

, it follows that

〈 x, y 〉 = 0

, meaning that x and y are orthogonal. □

Theorem A1 (Compact symmetric linear operators).

Let

A : H \to H

be a compact symmetric linear operator. The only possible limit point of the sequence of eigenvalues is 0. Furthermore, the set of eigenvalues of A is either finite or countably infinite.

Proof.

First, assume that there exist a sequence of eigenvalues

{λ_{j}}_{j}

such that

λ_{j} \to λ

as

j \to \infty

. Let

x_{j}

be the eigenvector corresponding to

λ_{n}

. We can normalize

x_{j}

such that

∥ x_{j} ∥ = 1

. By compactness of A and the fact that eigenvectors corresponding to different eigenvalues are orthogonal (Proposition A1), we can extract a convergent subsequence where

λ_{j} \neq λ_{k}

for

j \neq k

. Assume

λ \neq 0

. Then,

∥ A x_{j} - A x_{k} ∥^{2} = {∥ λ_{j} x_{j} - λ_{k} x_{k} ∥}^{2} = λ_{j}^{2} + λ_{k}^{2} \underset{j, k \to \infty}{\to} λ^{2} + λ^{2} = 2 λ^{2} > 0

Therefore,

{A x_{j}}_{j}

has no convergent subsequence. This contradicts the compactness of A, implying

λ = 0

. Thus, the only limit point of the sequence of eigenvalues is 0.

Next, let D be the set of all eigenvalues of A. Since

D \subset R

, we can write

D = {0} \cup (⋃_{j = 1}^{\infty} [1 / j, j] \cap D) \cup (⋃_{j = 1}^{\infty} [- j, - 1 / j] \cap D)

Suppose D is uncountably infinite. Then, there exists some

j \in N^{*}

such that either

[1 / j, j] \cap D

or

[- j, - 1 / j] \cap D

is uncountably infinite. Without loss of generality, assume

[1 / j, j] \cap D

is uncountably infinite. Since

[1 / j, j]

is bounded, it must have an accumulation point within the interval, which is non-zero. This contradicts the earlier result that the only limit point is 0. Therefore, D must be finite or countably infinite. □

Definition A1.

If

S \subset H

, the smallest closed subspace of H containing S is called the space spanned by the subset S satisfying

span (S) : = \{\sum_{j = 1}^{m} λ_{j} x_{j} ∣ m \in N^{*}, x_{j} \in S, λ_{j} \in R\} .

(A1)

Lemma A2.

Let

A : H \to H

be a compact symmetric linear operator. The eigenvectors of A span the entire space H.

Theorem A2 (Expansion Theorem).

Let

A : H \to H

be a compact symmetric linear operator. Let

{e_{j}}_{j \in N^{*}}

be an orthonormal basis for the space spanned by the eigenvectors corresponding to the non-zero eigenvalues of A. Then, if

x \in H

and h is the projection of x on

N (A) : = {x \in H ∣ A x = 0}

, the null space of A, the following representation of x holds

x = h + \sum_{j = 1}^{\infty} 〈 x, e_{j} 〉 e_{j} .

(A2)

Proof.

To show that

\sum_{j = 1}^{\infty} 〈 x, e_{j} 〉 e_{j}

is convergent in H, note that

\sum_{j = 1}^{m} 〈 x, e_{j} 〉 e_{j}

is a Cauchy sequence in H, and, since H is complete,

\sum_{j = 1}^{m} 〈 x, e_{j} 〉 e_{j}

converges as

m \to \infty

.

We can write

x = (h + \sum_{j = 1}^{\infty} 〈 x, e_{j} 〉 e_{j}) + (x - h - \sum_{j = 1}^{\infty} 〈 x, e_{j} 〉 e_{j}) .

Define

y : = x - h - \sum_{j = 1}^{\infty} 〈 x, e_{j} 〉 e_{j}

. We claim that y is orthogonal to every

e_{j}

:

〈 y, e_{j} 〉 = 〈 x, e_{j} 〉 - 〈 x, e_{j} 〉 = 0,

since

〈 h, e_{j} 〉 = 0

for

h \in N (A)

by Proposition A1.

Moreover, y is orthogonal to

N (A)

. Indeed, if

z \in N (A)

, then

〈 y, z 〉 = 〈 x, z 〉 - 〈 h, z 〉 = 〈 x - h, z 〉 = 0 .

As a result, y is orthogonal to all eigenvectors of A, which span the whole space H by Lemma A2. Therefore, y is orthogonal to all elements of H, including itself, implying that

y = 0

, and the result follows. □

Corollary A1.

Given the same hypotheses as the expansion theorem, if we denote the non-zero eigenvalue corresponding to the eigenvector

e_{j}

by

λ_{j}

, then, by the continuity and linearity of A, we have

A x = \sum_{j = 1}^{\infty} 〈 x, e_{j} 〉 λ_{j} e_{j},

(A3)

since

A h = 0

.

References

Fradi, A.; Feunteun, Y.; Samir, C.; Baklouti, M.; Bachoc, F.; Loubes, J.M. Bayesian regression and classification using Gaussian process priors indexed by probability density functions. Inf. Sci. 2021, 548, 56–68. [Google Scholar] [CrossRef]
Fradi, A.; Samir, C. Bayesian cluster analysis for registration and clustering homogeneous subgroups in multidimensional functional data. Commun. Stat.-Theory Methods 2020, 49, 1–17. [Google Scholar] [CrossRef]
Ulrich, K.R.; Carlson, D.E.; Dzirasa, K.; Carin, L. GP kernels for cross-spectrum analysis. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; pp. 1999–2007. [Google Scholar]
Alvarado, P.A.; Alvarez, M.A.; Stowell, D. Sparse Gaussian process audio source separation using spectrum priors in the time-domain. In Proceedings of the 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing-Proceedings, Brighton, UK, 12–17 May 2019; pp. 995–999. [Google Scholar]
Rasmussen, C.E.; Williams, C.K.I. Gaussian Processes for Machine Learning; The MIT Press: Cambridge, UK; London, UK, 2006. [Google Scholar]
Adams, R.P.; Murray, I.; MacKay, D.J.C. The Gaussian process density sampler. In Proceedings of the 21st International Conference on Neural Information Processing Systems (NIPS), Vancouver, BC, Canada, 8–10 December 2008; pp. 9–16. [Google Scholar]
Muschinski, T.; Mayr, G.J.; Simon, T.; Umlauf, N.; Zeileis, A. Cholesky-based multivariate Gaussian regression. Econom. Stat. 2022, 2452–3062. [Google Scholar]
Quiñonero Candela, J.; Rasmussen, C.E. A unifying view of sparse approximate Gaussian process regression. J. Mach. Learn. Res. 2005, 6, 1939–1959. [Google Scholar]
Williams, C.K.I.; Barber, D. Bayesian classification with Gaussian processes. IEEE Trans. Pattern Anal. Mach. Intell. 1998, 20, 1342–1351. [Google Scholar]
Corani, G.; Benavoli, A.; Zaffalon, M. Time series forecasting with Gaussian processes needs priors. In Proceedings of the ECML/PKDD, Lecture Notes in Computer Science, Bilbao, Spain, 13–17 September 2021; pp. 103–117. [Google Scholar]
Williams, C.K.I.; Seeger, M. The effect of the input density distribution on kernel-based classifiers. In Proceedings of the Seventeenth International Conference on Machine Learning, San Francisco, CA, USA, 29 June–2 July 2000; pp. 1159–1166. [Google Scholar]
Chen, P.Y.; Chen, R.B.; Chen, Y.S.; Wong, W.K. Numerical methods for finding a-optimal designs analytically. Econom. Stat. 2023, 28, 155–162. [Google Scholar] [CrossRef]
Melkumyan, A.; Ramos, F. A sparse covariance function for exact Gaussian process inference in large datasets. In Proceedings of the 21st International Joint Conference on Artificial Intelligence (IJCAI), Pasadena, CA, USA, 11–17 July 2009; pp. 1936–1942. [Google Scholar]
Snelson, E.; Ghahramani, Z. Sparse Gaussian processes using pseudo-inputs. In Proceedings of the Advances in Neural Information Processing Systems, Cambridge, MA, USA, 5–8 December 2005; pp. 1257–1264. [Google Scholar]
Lázaro-Gredilla, M.; Quiñonero Candela, J.; Rasmussen, C.E.; Figueiras-Vidal, A.R. Sparse spectrum Gaussian process regression. J. Mach. Learn. Res. 2010, 11, 1865–1881. [Google Scholar]
Damianou, A.; Titsias, M.; Lawrence, N. Variational Gaussian process dynamical systems. In Proceedings of the Advances in Neural Information Processing Systems, Granada, Spain, 12–15 December 2011; pp. 2510–2518. [Google Scholar]
Jorgensen, P.; Tian, F. Decomposition of Gaussian processes, and factorization of positive definite kernels. Opusc. Math. 2019, 39, 497–541. [Google Scholar] [CrossRef]
Deheuvels, P.; Martynov, G. A Karhunen-Loève decomposition of a Gaussian process generated by independent pairs of exponential random variables. J. Funct. Anal. 2008, 255, 2363–2394. [Google Scholar]
Williams, C.K.I.; Seeger, M. Using the Nyström method to speed up kernel machines. In Proceedings of the Advances in Neural Information Processing Systems, Cambridge, MA, USA, 27 November–2 December 2000; pp. 585–591. [Google Scholar]
Fritz, J.; Nowak, W.; Neuweiler, I. Application of FFT-based algorithms for large-scale universal kriging problems. Math. Geosci. 2009, 51, 199–221. [Google Scholar] [CrossRef]
Solin, A.; Särkkä, S. Explicit link between periodic covariance functions and state space models. In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, Reykjavik, Iceland, 22–25 April 2014; pp. 904–912. [Google Scholar]
Kalman, R.E. A new approach to linear filtering and prediction problems. J. Basic Eng. 1960, 82, 35. [Google Scholar]
Solin, A.; Särkkä, S. Hilbert space methods for reduced-rank Gaussian process regression. Stat. Comput. 2020, 30, 419–446. [Google Scholar]
Greengard, P.; O’Neil, M. Efficient reduced-rank methods for Gaussian processes with eigenfunction expansions. arXiv 2022, arXiv:2108.05924. [Google Scholar]
Ghanem, R.G.; Spanos, P.D. Stochastic Finite Elements: A Spectral Approach; Springer: Berlin/Heidelberg, Germany, 1991. [Google Scholar]
Akhiezer, N.I.; Glazman, I.M. Theory of Linear Operators in Hilbert Space; Dover Books on Mathematics; Dover Publications: New York, NY, USA, 2013. [Google Scholar]
Barzilai, J.; Borwein, J.M. Two-point step size gradient methods. IMA J. Numer. Anal. 1988, 8, 141–148. [Google Scholar]
Djolonga, J.; Krause, A.; Cevher, V. High-Dimensional Gaussian Process Bandits; NIPS’13; Curran Associates, Inc.: New York, NY, USA, 2013; pp. 1025–1033. [Google Scholar]
Takhanov, R. On the speed of uniform convergence in Mercer’s theorem. J. Math. Anal. Appl. 2023, 518, 126718. [Google Scholar] [CrossRef]
Cavoretto, R.; Fasshauer, G.; McCourt, M. An introduction to the Hilbert-Schmidt SVD using iterated Brownian bridge kernels. Numer. Algorithms 2014, 68, 393–422. [Google Scholar] [CrossRef]
Golub, G.H.; Van Loan, C.F. Matrix Computations, 3rd ed.; The Johns Hopkins University Press: Baltimore, MD, USA, 1996. [Google Scholar]
Chihara, T.S. An Introduction to Orthogonal Polynomials; Ellis Horwood Series in Mathematics and Its Applications; Gordon and Breach: New York, NY, USA, 1978. [Google Scholar]
Fasshauer, G.E. Green’s functions: Taking another look at kernel approximation, radial Basis functions, and splines. In Proceedings of the Approximation Theory XIII, New York, NY, USA, 7–10 March 2012; pp. 37–63. [Google Scholar]
Aristidi, E. Representation of signals as series of orthogonal functions. In Proceedings of the EAS Publications Series, Nice, France, 11–16 October 2015; pp. 99–126. [Google Scholar]
Griffiths, D.J.; Schroeter, D.F. Introduction to Quantum Mechanics, 3rd ed.; Cambridge University Press: Cambridge, UK, 2018. [Google Scholar]
Shen, J.; Tang, T.; Wang, L.L. Orthogonal polynomials and related approximation results. In Spectral Methods: Algorithms, Analysis and Applications; Springer: Berlin/Heidelberg, Germany, 2011; pp. 47–140. [Google Scholar]

Figure 1. The eigenvalues

λ_{j} = \frac{1}{γ_{j}}

for different differential operators using a base-logarithmic scale with

σ^{2} = 2, ε = 2

,

α = 1

for Matérn and

α = - 0.5, β = - 0.3

for Jacobi.

Figure 1. The eigenvalues

λ_{j} = \frac{1}{γ_{j}}

for different differential operators using a base-logarithmic scale with

σ^{2} = 2, ε = 2

,

α = 1

for Matérn and

α = - 0.5, β = - 0.3

for Jacobi.

Figure 2. Ten sample realizations from GPs corresponding to different differential operators. Each subfigure represents a distinct operator, with 10 colored curves illustrating different GP realizations.

Figure 3. The prediction with LCGP regarding Simulation 1 (left) and Simulation 2 (right).

Figure 4. The uncertainty with MGP regarding Simulation 1 (left) and Simulation 2 (right).

Figure 5. Some observations from real datasets: CalCOFI (left) and mCP (right).

Figure 6. The prediction regarding CalCOFI with LCGP (left) and other methods (right).

Figure 7. Illustration of the prediction results regarding mCP with the proposed LCGP (left) and comparative methods (right): (top) smoker class and (bottom) non-smoker class.

Figure 8. Results regarding CalCOFI dataset with an increasing sample size:

N = k \times 1000

, where

k = 1, \dots, 12

.

Figure 8. Results regarding CalCOFI dataset with an increasing sample size:

N = k \times 1000

, where

k = 1, \dots, 12

.

Table 1. Various differential operators and their corresponding decompositions.

Operator	$L$	I	$ρ$	$γ_{j}$	$ϕ_{j} (t)$	$∥ ϕ_{j} ∥_{L^{2}}$	MISE
Matérn ¹	$σ^{- 2} {(ε - \frac{d^{2}}{d t^{2}})}^{α}$	$[0, 1]$	1	$σ^{- 2} {(ε + j^{2} π^{2})}^{α}$	$\sqrt{2} sin (j π t)$	1	$σ^{2} \sum_{j = m + 1}^{\infty} {(ε + j^{2} π^{2})}^{- α}$
Legendre	$- (1 - t^{2}) \frac{d^{2}}{d t^{2}} + 2 t \frac{d}{d t}$	$[- 1, 1]$	1	$j (j + 1)$	$\frac{1}{2^{j} j!} \frac{d^{j}}{d t^{j}} {(t^{2} - 1)}^{j}$	$\sqrt{\frac{2}{2 j + 1}}$	$\frac{1}{m + 1}$
Laguerre	${(t \frac{d^{2}}{d t^{2}} + (1 - t) \frac{d}{d t})}^{2}$	$[0, \infty)$	$e^{- t}$	$j^{2}$	$\frac{e^{t}}{j!} \frac{d^{j}}{d t^{j}} (e^{- t} t^{j})$	1	$\frac{π^{2}}{6} - \sum_{j = 1}^{m} \frac{1}{j^{2}}$
Hermite	${(\frac{d^{2}}{d t^{2}} - 2 t \frac{d}{d t})}^{2}$	$R$	$e^{- t^{2}}$	$4 j^{2}$	${(- 1)}^{j} e^{t^{2}} \frac{d^{j}}{d t^{j}} e^{- t^{2}}$	$\sqrt{\sqrt{π} 2^{j} j!}$	$\frac{1}{4} (\frac{π^{2}}{6} - \sum_{j = 1}^{m} \frac{1}{j^{2}})$
Chebyshev	${(- (1 - t^{2}) \frac{d^{2}}{d t^{2}} + t \frac{d}{d t})}^{2}$	$[- 1, 1]$	$\frac{1}{\sqrt{1 - t^{2}}}$	$j^{2}$	$cos (j arccos t)$	$\sqrt{\frac{π}{2}}$	$\frac{π^{4}}{90} - \sum_{j = 1}^{m} \frac{1}{j^{4}}$
Jacobi	$L_{J}$	$[- 1, 1]$	${(1 - t)}^{α} {(1 + t)}^{β}$	$j (j + α + β + 1)$	$J_{j}^{α, β} (t)$	see [36]	$\sum_{j = m + 1}^{\infty} \frac{1}{j (j + α + β + 1)}$

¹ Matérn hyperparameters:

σ^{2}

for variance,

ε

for shape, and

α

for smoothness:

α = ν + 1 / 2 = k + 1

;

k \in N

.

Table 2. Results on the CalCOFI dataset. Optimal values across all methods are highlighted in red.

Method	ISE	R-Squared	Time (s)
MGP	1.5186	0.8588	0.0048
LGP	1.4988	0.8607	0.0033
HGP	1.5095	0.8597	0.0059
CGP	1.4991	0.8606	0.0022
JGP	1.4992	0.8606	0.0171
std-GP (standard)	1.4997	0.8606	0.2259
SGP (sparse)	1.5635	0.8547	0.1489

Table 3. Results on the mCP non-smoke dataset. Optimal values across all methods are highlighted in red.

Method	ISE	R-Squared	Time (s)
MGP	0.2556	−0.0029	0.0042
LGP	0.2568	−0.0076	0.0046
HGP	0.2578	−0.0117	0.0045
CGP	0.2572	−0.0091	0.0023
JGP	0.2570	−0.0086	0.0175
std-GP (standard)	0.2570	−0.0086	0.5880
SGP (sparse)	0.2569	−0.0083	0.1803

Table 4. Results on the mCP smoke dataset. Optimal values across all methods are highlighted in red.

Method	ISE	R-Squared	Time (s)
MGP	0.2152	0.7611	0.0018
LGP	0.1809	0.7992	0.0018
HGP	0.2518	0.7205	0.0020
CGP	0.1822	0.7977	0.0011
JGP	0.1836	0.7962	0.0139
std-GP (standard)	0.2018	0.7760	0.0288
SGP (sparse)	0.2227	0.7528	0.1696

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Fradi, A.; Tran, T.-T.; Samir, C. Decomposed Gaussian Processes for Efficient Regression Models with Low Complexity. Entropy 2025, 27, 393. https://doi.org/10.3390/e27040393

AMA Style

Fradi A, Tran T-T, Samir C. Decomposed Gaussian Processes for Efficient Regression Models with Low Complexity. Entropy. 2025; 27(4):393. https://doi.org/10.3390/e27040393

Chicago/Turabian Style

Fradi, Anis, Tien-Tam Tran, and Chafik Samir. 2025. "Decomposed Gaussian Processes for Efficient Regression Models with Low Complexity" Entropy 27, no. 4: 393. https://doi.org/10.3390/e27040393

APA Style

Fradi, A., Tran, T.-T., & Samir, C. (2025). Decomposed Gaussian Processes for Efficient Regression Models with Low Complexity. Entropy, 27(4), 393. https://doi.org/10.3390/e27040393

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Decomposed Gaussian Processes for Efficient Regression Models with Low Complexity

Abstract

1. Introduction

2. Operators on Hilbert Spaces

3. Expansion and Convergence of Gaussian Processes

4. Ill-Conditioned Canonical Gaussian Process Regression

5. Low-Complexity Gaussian Process Regression

6. Closed Solutions from Differential Operators

7. Experiments

7.1. Simulations

7.2. Real Data

8. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

Appendix A. The Expansion Theorem

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI