Wavelet Multiresolution Analysis-Based Takagi–Sugeno–Kang Model, with a Projection Step and Surrogate Feature Selection for Spectral Wave Height Prediction

Korkidis, Panagiotis; Dounis, Anastasios

doi:10.3390/math13152517

Open AccessFeature PaperArticle

Wavelet Multiresolution Analysis-Based Takagi–Sugeno–Kang Model, with a Projection Step and Surrogate Feature Selection for Spectral Wave Height Prediction

by

Panagiotis Korkidis

^*

and

Anastasios Dounis

Department of Biomedical Engineering, University of West Attica, 12243 Athens, Greece

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(15), 2517; https://doi.org/10.3390/math13152517

Submission received: 6 July 2025 / Revised: 31 July 2025 / Accepted: 3 August 2025 / Published: 5 August 2025

(This article belongs to the Special Issue Applications of Mathematics in Neural Networks and Machine Learning)

Download

Browse Figures

Versions Notes

Abstract

The accurate prediction of significant wave height presents a complex yet vital challenge in the fields of ocean engineering. This capability is essential for disaster prevention, fostering sustainable development and deepening our understanding of various scientific phenomena. We explore the development of a comprehensive predictive methodology for wave height prediction by integrating novel Takagi–Sugeno–Kang fuzzy models within a multiresolution analysis framework. The multiresolution analysis emerges via wavelets, since they are prominent models characterised by their inherent multiresolution nature. The maximal overlap discrete wavelet transform is utilised to generate the detail and resolution components of the time series, resulting from this multiresolution analysis. The novelty of the proposed model lies on its hybrid training approach, which combines least squares with AdaBound, a gradient-based algorithm derived from the deep learning literature. Significant wave height prediction is studied as a time series problem, hence, the appropriate inputs to the model are selected by developing a surrogate-based wrapped algorithm. The developed wrapper-based algorithm, employs Bayesian optimisation to deliver a fast and accurate method for feature selection. In addition, we introduce a projection step, to further refine the approximation capabilities of the resulting predictive system. The proposed methodology is applied to a real-world time series pertaining to spectral wave height and obtained from the Poseidon operational oceanography system at the Institute of Oceanography, part of the Hellenic Center for Marine Research. Numerical studies showcase a high degree of approximation performance. The predictive scheme with the projection step yields a coefficient of determination of

0.9991

, indicating a high level of accuracy. Furthermore, it outperforms the second-best comparative model by approximately

49%

in terms of root mean squared error. Comparative evaluations against powerful artificial intelligence models, using regression metrics and hypothesis test, underscore the effectiveness of the proposed methodology.

Keywords:

wave height prediction; Takagi–Sugeno–Kang model; wavelets multiresolution analysis; Bayesian optimisation; feature selection

MSC:

68T05

1. Introduction

The ocean, which spans approximately

71%

of the Earth’s surface, plays a vital role in shaping weather patterns, sustaining ecosystems, and influencing human endeavours. Its deep impact cannot be overstated; therefore, studying various physical factors such as waves, ocean temperature, currents, and sea levels is essential. Wave height is a factor closely connected to human activities. Accurate wave height prediction is important for numerous reasons. From a navigation perspective, it enhances safety by preventing potential disasters. Additionally, it optimises marine operations, yielding improvements in vessel routing and achieving cost savings.

At the same time, the ocean is a significant source of abundant and renewable energy. Among various forms of ocean energy, such as tides, currents, temperature gradients, etc., waves are the primary source. As the global demand for consistently high levels of electricity from renewable sources continues to grow, leveraging ocean energy can make a meaningful contribution to our energy system. The estimated potential of global ocean wave energy is about 2 TW [1]. Furthermore, it has the potential to play a vital role in reducing greenhouse gas emissions and the reliance on fossil fuels, thereby supporting our commitment to the protection of the environment. In this context, the deployment of Wave Energy Converters (WECs) [2], which convert ocean wave energy to electricity, is essential for harnessing ocean potential. By combining this technological infrastructure and high-accuracy wave prediction methods, ocean energy could meet about

10%

of the EU’s power demand by 2050.

The accurate prediction of wave height is essential, not only for ensuring safety in marine activities but also for the prediction of wave energy flux at a given location, as given by Equation (1) [3].

P = \frac{1}{64 π} ρ g^{2} H_{s}^{2} T

(1)

where P denotes the wave energy (measured in kW/m),

ρ

the seawater density (∼

1028 {kg / m}^{3}

), g is the gravitational acceleration,

H_{s}

represents the significant wave height, and T denotes the wave energy period.

Due to the increasing interest on wave energy systems, as well as the challenging task of accurate wave height prediction, various approaches have been proposed. These approaches can be distinguished into physical-driven and data-driven [4]. The former refers to methods based on numerical models that rely on principles derived from spectral energy or dynamic spectral equilibrium equations. An example is the WAM wave model [5]—a third-generation wave model that explicitly solves the wave transport equation without making assumptions about the shape of the wave spectrum. It accurately represents the physics of wave evolution and utilises the complete set of degrees of freedom in a two-dimensional wave spectrum [3]. Another example of a numerical model utilising physical principles is Simulating Waves Nearshore (SWAN) [6], while numerical models are capable of predicting wave parameters across extensive domains, they entail significant computational costs, since they demand large amounts of oceanographic and meteorological data. Consequently, they might be impractical for engineering applications, e.g., in scenarios where only short-term predictions are needed for a specific location.

At an age where data are readily available and the interest toward artificial intelligence models increases, data-driven methods are absolutely relevant. Such methods have been adopted by numerous researchers in terms of generating high-accuracy wave height prediction in a time-efficient manner. Data-driven methods are based on statistical approaches, machine and deep learning, as well as hybrid models. Given the broad interpretation of the term hybrid, in this study, such models specifically refer to the systematic combination of methods from predictive modelling, optimisation, and decomposition approaches.

In most cases, the available data consist of time series corresponding to wave height, wind speed, and other meteorological variables, sourced from observations made by deployed buoys. Ikram et al. in [7], study metaheuristic regression models, such as multivariate adaptive regression splines, Gaussian processes, random forests, etc., for short-term significant wave height prediction. They follow an autoregressive approach by considering previous values of wave height to form the input space of the models. In [1], the authors develop nested neural networks by replacing the activation functions of the neurons with an evolutionary-based trained adaptive-network-based fuzzy inference system (ANFIS). Their optimisation process is based on particle swarm optimisation (PSO). Their primary task involves predicting significant wave height in the North Sea, by adopting two features; the wind speed and the wind direction. An enhanced PSO algorithm combined with principal component analysis (PCA) for dimensionality reduction is proposed by Yang et al. in [8], to upscale the prediction performance of support vector machines. In the light of evolutionary algorithms, Zanganeh [9] develops a PSO-based ANFIS, to study wind-driven waves in Lake Michigan. A thorough study evaluating the efficiency of several machine learning models, such as support vector machines (SVMs), ANFIS, etc., for wave height prediction, is studied in [10]. The authors argue that wave height is influenced by wind speed; therefore, the model input includes historical wind speed data. Machine learning models are also considered by Gracia et al. in [11] to further improve predictions generated from numerical models. Yet another comparative study is provided in [12], highlighting the prediction performance of adaptive network-based fuzzy inference system, which outperforms powerful machine learning models. In the latter, authors consider the model predictors to consist of wind direction, wind speed, and temperature, thus treating the problem in a classical function approximation framework. A hybrid model that combines singular value decomposition (SVD) with fuzzy modelling is proposed in [13]. In this study, the correlation coefficient of past significant wave height values is considered as a method for selecting the inputs. The authors in [14] propose a fuzzy-based cascade ensemble in a layer configuration and use evolutionary algorithms, in particular, a coral reef optimisation algorithm with substrate layers (CRO-SL) for effectively tuning antecedent parameters.

Deep learning techniques are also widely considered, when developing wave height predictive methodologies. It should be noted that predictive models are often combined within a time series decomposition framework, which enhances the accuracy of the predictions. This widely adopted approach is effective since natural phenomena, e.g., wave height, are influenced by both deterministic and stochastic factors [15]. The latter approach is studied by Zhou et al. [16] by developing a bidirectional LSTM (BiLSTM) network into an optimised variational mode decomposition (VMD) framework. The model’s feature selection is based on mutual information (MI), and an analysis of prediction uncertainty is further conducted. Both empirical (EMD) and variational mode decomposition methods have been combined with an LSTM in [17], considering wave height and wind speed as model inputs. A synergy of convolutional neural network and LSTM is utilised in [18]. Convolutional networks and LSTMs are combined in various research studies, such as [19]. In [20], a combination with a seasonal autoregressive integrated moving average model (SARIMA), on the basis of a fast fourier transform (FFT) decomposition, is proposed. An interesting application of a deep learning technique has been proposed in [21], where the problem of wave height prediction is approached as a spatiotemporal modelling task and a novel deep network Meme-Unet is developed.

One issue of vital importance in the development of predictive methodologies is feature selection, which entails identifying the most relevant inputs for the models. Feature selection process significantly influences the accuracy of the predictive outcomes and also reflects the model’s complexity. Most models scale with the input dimension, demanding greater computational effort during training [22]. Despite its importance, relatively few studies have addressed this task from a systematic perspective. For instance, Luo et al. in [4] explore the feature selection problem through the lens of Shapley values, thereby offering an interpretable approach. A grouping genetic algorithm for feature selection has been incorporated into the predictive scheme in [23]. Relevant input selection has also been systematically studied in [24], where the features are ranked according to their corresponding prediction error. Despite the existence of relevant studies, as far as the feature selection is concerned, there are still grounds for improvements. Small, e.g., models with low complexity, and fast-trained models are key for practical applications.

The contributions of this paper consist of several parts. First, we develop a novel wrapper surrogate-based algorithm for feature selection by utilising Bayesian optimisation. This approach overcomes the time-intensive nature of traditional sequential greedy methods. Second, we develop a Takagi–Sugeno–Kang model that employs a deep learning algorithm within a hybrid learning framework, combining least squares with AdaBound. The latter addresses the challenges posed by time-consuming evolutionary tuning methods and enhances error convergence. Many methodologies in the field are based on deep learning models to carry out the prediction task. While these models are known for their accuracy, they are often perceived as black boxes. This study proposes an alternative prediction approach through the use of novel fuzzy models. This approach not only provides a basis for analysing results generated by simpler models but also avoids the complexity of deep learning methods. Additionally, fuzzy systems offer inherent interpretability, expressed through fuzzy rules. In addition, we propose the integration of wavelet multiresolution analysis, generated by the maximal overlap discrete wavelet transform, for efficient decomposition of the wave height time series. Finally, we improve the prediction accuracy of the novel fuzzy predictive scheme by introducing a projection step.

To the best of our knowledge, a Bayesian optimisation approach has not been studied yet in the context of wrapper-based methods. In addition, this version of the Takagi–Sugeno–Kang model, using a hybrid scheme of AdaBound with least squares has not yet been applied in the context of significant wave height prediction. Finally, as far as we know, wrapper-based algorithms are rarely studied when developing wave height prediction methods.

To enhance understanding and support implementation in programming languages such as Matlab, Python, C++, algorithmic pseudocodes are presented throughout the paper.

2. Materials and Methods

This section provides a brief overview of the methods used in this paper. It includes a small reference to the notation, discusses the problem from a mathematical perspective, and provides a comprehensive description of the models and methods employed to develop the proposed methodology for significant wave height prediction.

2.1. Notation

Throughout this paper the following notation has been used:

N

denotes the set of natural numbers and

Z

the set of integers. The field of real numbers is denoted by

R

. Moreover, for

N \in N

, the set

{1, \dots, N}

is denoted as

[N]

. If

K

is an arbitrary set, then

K_{+}^{*}

denotes

{k \in K | k > 0}

, i.e., the non-negative members of

K

. The

l^{th}

fuzzy rule is denoted as

R^{l}

, and

A_{j}^{l}

denotes the fuzzy set of the

j^{th}

input, corresponding to the

l^{th}

rule. The set of model parameters is denoted as

P

, and

M (\cdot | ϑ)

corresponds to a model realisation for a given

ϑ \in P

. Moreover, the m-dimensional input/feature space is denoted as

X

, whereas

Y

denotes the associated output space. By

| | \cdot ‖_{p}

, where

p \in [1, \infty]

, we denote the

l^{p}

-norm. The one-dimensional space of the input

x_{j}

is denoted as

X_{j}

. The sets

C (A)

and

S_{p} (A)

correspond to the core and support of the fuzzy set

A

, respectively. The k-times continuously differentiable functions from

Ω \to R

are denoted as

C^{k} (Ω)

, with compact support

Ω

. The space of all square integrable functions defined over

R

is denoted as

L^{2} (R)

.

2.2. Problem Statement

Let the data set

D = {x^{i}, y^{i}}_{i \in [N]}

, with m-dimensional features

x^{i} \in X \subseteq R^{d}

,

d \in N

, and the corresponding targets

y^{i} \in Y \subseteq R

,

\forall i \in [N]

. Let N denote the total number of available observations in the time series. The goal in predictive modelling is to develop a learning algorithm that utilises training data to generate a model

M ({x^{i}, y^{i}} | ϑ)

from a hypothesis set

H

, consisting of all model realisations for the given training data and possible model parameters, i.e.,

H = {M ({x^{i}, y^{i}} | ϑ) : ϑ \in P}

. The model

M ({x^{i}, y^{i}} | ϑ) \in H

should exhibit high generalisation performance, which is measured via a loss function

L

on unseen data.

This paper addresses the prediction problem within the context of univariate time series forecasting in discrete time, since we are dealing with observed temporal instances of significant wave height. Given time series data comprising N observations, the objective focuses on the prediction of a future realisation of the variable under consideration, based on past time observations. Thus, the future realisation, for a given horizon h, can be expressed using Equation (2).

x^{i + h} = \hat{f} (x^{i}) + ϵ_{i}

(2)

where

x^{i + h}

represents the future value at horizon h,

\hat{f}

denotes the

M ({x^{i}, y^{i}} | ϑ)

model’s prediction given features

x^{i} = (x_{1}^{i}, \dots, x_{m}^{i}) = (x (t - m), \dots, x (t - 1))

, and

ϵ_{i}

represents the independent and identically distributed residuals. Furthermore, it should be noted that

y^{i} = x^{i + h} = x (t)

. Therefore, for

h = 1

, the model’s prediction of the variable’s future realisation is governed by Equation (3).

{\hat{x}}^{i + 1} = \hat{f} (x (t - m), \dots, x (t - 1)), m \in N

(3)

Equation (3) describes the mapping

(x (t - m), \dots, x (t - 1)) \in X \subseteq R^{d} \to x (t) \in Y \subseteq R

. The space

X

is generated by selecting an embedding dimension,

m \in N

, with

m \leq d

, forming the model’s necessary predictors

x_{j}^{i}

, with

j \in [m]

. The selection of the embedding dimension m will be discussed in the section associated with feature selection.

The methodology development follows the standard procedure within a supervised learning framework. The time series data are divided into a training set

D^{tr}

, for model training and optimisation, and a test set

D^{*}

, for assessing the model’s generalisation capability.

2.3. Predictive Model

2.3.1. Structure

A Takagi–Sugeno–Kang (TSK) model, a nuance of fuzzy systems, has been adopted as the predictive model, hence realising

M ({x^{i}, y^{i}} | ϑ)

. Emerging from the computational intelligence literature, TSK models are recognised for their effectiveness in addressing function approximation problems. The strength of these models lies in their capability to accurately approximate unknown maps between the m-dimensional input and the output space while also providing a level of interpretability. The decision to employ an approach based on this type of model stems from the extensive and successful application of fuzzy models across various scientific and engineering disciplines. Our decision is further supported by considering that Wu et al., in [25], proved that a Takagi–Sugeno–Kang fuzzy model is functional-equivalent with neural networks, a mixture of experts as well as stacking ensemble regression models, all of which are powerful machine learning techniques. Fuzzy systems perform the approximation of unknown nonlinear functions, through an IF-THEN rule-based inference framework. However, TSK models differ from Mamdani fuzzy systems in terms of the fuzzy rule consequents; while the latter utilises fuzzy sets, the former constructs the fuzzy rule’s consequent as a parametrised function of the model’s inputs.

In what follows the mathematical formalism of the Takagi–Sugeno–Kang model is provided, along with its main universal approximation theorems. Furthermore, its neural representation, which will enhance understanding on how training and optimisation are performed, is illustrated.

Consider an m-dimensional input to the fuzzy model,

x = (x_{1}, \dots, x_{m}) \in X \subseteq R^{d}

, consisting of N observations. The

l^{th}

rule of the TSK fuzzy model is given in Equation (4).

R^{l} : I F x_{1} is A_{1}^{l} \land \dots \land x_{m} is A_{m}^{l}, T H E N φ^{l} (x) = \sum_{i \in [m]} α_{l i} x_{i} + α_{l 0}

(4)

where

A_{j}^{l}

corresponds to the fuzzy set of

x_{j}

input’s fuzzy partition, associated with the

l^{th}

rule. In addition,

l \in [r]

, where r is the number of fuzzy rules. The symbol ∧ denotes an arbitrary fuzzy connective, i.e., a t-norm. The consequents of each rule, i.e.,

φ^{l} (x)

, are given by a linear combination of the model’s inputs and the parameters

{α_{l i}}_{i \in [m \cup {0}]}

.

The fuzzy sets are generated by Gaussian membership functions, given by Equation (5).

μ_{A_{j}^{l}} (x_{j}) = \exp (- 1 / 2 ({(x_{j} - c_{l j})}^{2} / σ_{l j}^{2}))

(5)

where

c_{l j} \in C (A_{j}^{l}) \subseteq S_{p} (A_{j}^{l})

, and

σ_{l j}

is the standard deviation. It should be noted that

C (A_{j}) = {x \in X_{j} \subseteq X : μ_{A_{j}} (x) = 1}

and

S_{p} (A_{j}) = {x \in X_{j} \subseteq X : μ_{A_{j}} (x) \geq 0}

.

The combined membership for the

l^{th}

rule’s consequent, computed using any type of t-norm, is given by Equation (6).

μ^{l} = μ_{A_{1}^{l}} \land μ_{A_{2}^{l}} \land \dots \land μ_{A_{m}^{l}}

(6)

If the normalised combined membership for the

l^{th}

rule’s consequent (Equation (7)), i.e., the fuzzy basis functions, is computed by the following:

{\tilde{μ}}^{l} = μ^{l} / \sum_{j \in [r]} μ^{j}

(7)

then the output of the Takagi–Sugeno–Kang fuzzy model,

f_{tsk}

, can be expressed as an expansion in terms of the fuzzy basis functions (Equation (8)).

f_{tsk} (x) = \sum_{l \in [r]} ({\tilde{μ}}^{l} (\sum_{i \in [m]} α_{l i} x_{i} + α_{l 0}))

(8)

The neural representation of a Takagi–Sugeno–Kang fuzzy model is illustrated in Figure 1. The figure provides a layer-based illustration of the TSK’s functionality. The input layer, referred to as

{Layer}_{F}

, is where the m-dimensional input is introduced to the model. The fuzzification, i.e., the generation of fuzzy partitions for each input

x_{j}

, occurs within

{Layer}_{A}

. This layer includes the antecedent parameters, which consist of the cores and standard deviations of each

A_{j}^{l}

, for

j \in [m]

and

l \in [r]

. Within

{Layer}_{B}

, Equation (6) is implemented, while the fuzzy basis functions are computed in

{Layer}_{C}

. An aggregation of the fuzzy basis functions with the consequent parameters for each fuzzy rule is carried out in

{Layer}_{D}

. Thus,

{Layer}_{D}

includes the parameters corresponding to each fuzzy rule’s functional. Finally, the local models, i.e., the weighted

φ^{l} (x)

, for all fuzzy rules, are appropriately aggregated to yield the fuzzy model’s output, denoted as

f_{tsk} (x)

, as illustrated in

{Layer}_{O}

. The inclusion of the two rectangles aims to indicate the locations in which the antecedent and consequent parameters manifest in the scheme.

Fuzzy systems are endowed with an inherent universal approximation property. This key feature allows them to model and approximate any multivariate nonlinear function on a compact set to an arbitrary precision, and it forms the foundation of their theoretical basis, as well as their success in practical applications. The fundamental approximation theorems corresponding to Takagi–Sugeno–Kang fuzzy models, that incorporate linear fuzzy rule functionals, have been proved by Ying in [26]. To provide a solid theoretical basis of the fuzzy predictive model, the main approximation theorems are included in concise manner.

For a Takagi–Sugeno–Kang model with linear fuzzy rule consequents, the following holds:

\forall ε > 0, \exists n^{*} \in Z_{+}^{*}

such that

n > n^{*}

,

‖ f_{tsk}^{n} (x) - P_{M} {(x) | |}_{X} = {max}_{x \in X} | f_{tsk}^{n} (x) - P_{M} (x) | < ε

, where

P_{M}

is a multivariate polynomial of M degree defined in

X

, and

X

denotes m-dimensional product space. Mathematically,

f_{tsk}^{n} (x)

is a function sequence—a mapping

f_{tsk}^{n} : X \to R

.

Theorem 1.

Universal approximation theorem ([26]). The general multi-input single-output Takagi–Sugeno–Kang fuzzy model with linear rule consequents can uniformly approximate any multivariate continuous function on a compact domain to any degree of accuracy.

The proof of Theorem 1 proof is constructed in terms of the Weierstrass approximation theorem, using the polynomial

P_{M}

as a connection bridge. Following the arguments provided in [26], and according to the Weierstrass approximation theorem, a multivariate continuous function

f \in C^{0} (X)

can always be uniformly approximated by a

P_{M} (x)

, with accuracy

ε

. Hence,

\forall ε_{1} > 0, ‖ P_{M} - f ‖ < ε_{1}

. Since,

\forall ε_{2} > 0, ‖ f_{tsk}^{n} - P_{M} ‖ < ε_{2}

holds, if

ε_{1}

and

ε_{2}

are chosen such that

ε_{1} + ε_{2} < ε

, then it follows that

‖ f_{tsk}^{n} - f ‖ \leq ‖ f_{tsk}^{n} - P_{M} ‖ + ‖ P_{M} - f ‖ = ε_{2} + ε_{1} < ε

, which concludes that the TSK fuzzy model is a universal approximator.

Further theoretical research, in terms on the mathematical approximation properties of Takagi–Sugeno–Kang fuzzy models, is provided in [27], where Zeng et al. study the sufficient conditions for linear and simplified Takagi–Sugeno–Kang models to be universal approximators. In addition to that, a quite interesting study in terms of necessary conditions for TSK models is provided in [28].

2.3.2. Learning

To date, we have only discussed how to compute the output of the model, given by Equation (8), without addressing the selection of model parameters

ϑ \in P

, and the training methodology. The number of TSK model parameters if linear rule consequents are considered is

2 m n_{A} + (m + 1) {(n_{A})}^{m}

, where

n_{A}

is the number of membership functions consisting of the fuzzy partitions on each of the m inputs.

Definition 1.

Error Measure. Given a finite set of observations in a form of a set

D

, the error measure

E_{D}

is defined as

E_{D} (\hat{f}) = \frac{1}{2} \sum_{i \in [N]} L ({\hat{f}}_{i}, {x^{i}, y^{i}})

(9)

where

{\hat{f}}_{i}

denotes the prediction of a model

M ({x^{i}, y^{i}} | ϑ)

given

x^{i}

, and

L

is a squared-loss function, e.g.,

L ({\hat{f}}_{i}, {x^{i}, y^{i}}) = {({\hat{f}}_{i} - y^{i})}^{2}

.

Since

\hat{f} \equiv f_{tsk}

, we seek the fuzzy Takagi–Sugeno–Kang model which minimises the error measure, i.e.,

M ({x^{i}, y^{i}} | ϑ)

over the hypothesis space

H

, thus,

M^{⋆} = \underset{M \in H}{arg min} E_{D} (\hat{f})

(10)

The model

M^{⋆}

is anticipated to generate predictions that will almost certainly lead to minimal training errors, and it is hoped that it will also demonstrate high accuracy on the test set

D^{*}

.

In general, the training of TSK fuzzy models can be classified into three main categories as follows: training based on evolutionary algorithms, training based on neuro-fuzzy methods, and hybrid training approaches. Evolutionary-based training, although leading to accurate results, faces significant challenges due to its time-intensive nature since it is population-based. In this paper, and from a terminology perspective, the terms neuro-fuzzy and hybrid training methods are differentiated according to fuzzy model consequent parameter computation. Thus, neuro-fuzzy training methods refer to methods that utilise gradient-descent for optimising the whole set of model parameters

P

, whilst hybrid training methods correspond to the well-known ANFIS algorithm [29]; the antecedent parameters are computed via back-propagation while the consequents with least squares.

Let us adopt the notation

f_{tsk} (x) = F (x, P)

to denote the output generated by the fuzzy model, as depicted in Figure 1, where

F

is the overall function applied to the inputs x, as described in Equation (8), and

P

is the model’s parameters set. If one could write

P = P_{1} \oplus P_{2}

and find a function g such that

g \circ F

is linear in the elements of

P_{2}

, if the parameters in

P_{1}

are known, then by using the training observations

D^{tr} \subset D

, the matrix Equation (11) is obtained.

A ϑ_{cons} = y

(11)

where

y \in Y^{tr} \subset Y

and

ϑ_{cons} \in P_{2}

are the parameters of the fuzzy rule consequents. Each row

A_{i}

of the matrix

A \in R^{N \times r (m + 1)}

reads as follows:

A_{i} = [\begin{matrix} \underset{m + 1}{\underset{︸}{\begin{matrix} {\tilde{μ}}_{1}^{1} x_{1}^{1} & {\tilde{μ}}_{1}^{1} x_{2}^{1} & \dots & {\tilde{μ}}_{1}^{1} x_{m}^{1} & {\tilde{μ}}_{1}^{1} \end{matrix}}} & \dots & \underset{m + 1}{\underset{︸}{\begin{matrix} {\tilde{μ}}_{1}^{r} x_{1}^{1} & {\tilde{μ}}_{1}^{r} x_{2}^{1} & \dots & {\tilde{μ}}_{1}^{r} x_{m}^{1} & {\tilde{μ}}_{1}^{r} \end{matrix}}} \end{matrix}]

for all

i \in [N]

, hence

A = {[A_{i}]}_{i \in [N]}

and the consequent parameter vector is

ϑ_{cons} = {[α_{11}, α_{12}, \dots, α_{1 m}, α_{10}, \dots, α_{r 1}, α_{r 2}, \dots, α_{r m}, α_{r 0}]}^{T}

. Moreover,

{\tilde{μ}}_{i}^{l}

denotes the normalised combined membership for the

l^{th}

rule, which pertains to the

i^{th}

observation. The minimiser of Equation (12), provides a closed and optimal solution with respect to the 2-norm of the error.

min_{ϑ_{cons} \in P_{2}} {‖ A ϑ_{cons} - {y | |}_{2}^{2} + λ ‖ ϑ_{cons} {| |}_{2}^{2}}

(12)

hence,

ϑ_{cons}^{⋆} = {(A^{T} A + λ I)}^{- 1} A^{T} y

. The latter solves the augmented least squares problem, which includes a regulariser parameter

λ \in R_{+}

for limiting overfitting and providing numerical stability on ill-posed problems.

To this end, one can predefine the antecedent parameter set, by generating for example uniform fuzzy partitions on each

X_{j}

, and then compute the fuzzy approximation on a given set of features

x^{i}

, by determining the consequent’s parameters. Nevertheless, in most practical applications, this approximation requires further refinement, which is where hybrid training proves beneficial for the training process.

We follow the ANFIS training framework for TSK fuzzy models, yet, instead of using a vanilla version of the gradient descent algorithm, we implement a modified alternative of the Adam optimiser, known as AdaBound [30]. In the context of fuzzy systems, the latter has been studied in [31], but under a neuro-fuzzy training fashion, where both antecedent and consequent parameters were tuned via gradient descent. In this paper, we present the integration of the AdaBound algorithm for training the antecedent parameters, complemented by least squares for the optimal computation of the consequent parameters. This approach introduces a deep learning-based algorithm within the context of ANFIS hybrid training. A significant advantage of employing hybrid training is the substantial reduction in the parameter search space, which enhances the speed of convergence.

In the typical setting, the gradient descent algorithm is employed to approximately minimise the measure

E_{D}

, given by Equation (13).

E_{D} = \frac{1}{2} \sum_{i \in [N]} {(f_{tsk} (x^{i}, {(c_{l j}, σ_{l j})}_{l \in [r], j \in [m]}) - y^{i})}^{2}

(13)

where N denotes the number of observations, l is the fuzzy rule index, and j the number of input. We append the TSK’s output notation with

{(c_{l j}, σ_{l j})}_{l \in [r], j \in [m]}

, to emphasise the dependence on the antecedent parameters. The adaptation of the antecedent parameters

ϑ_{ant}

, is performed progressively over a number of iterations. Consequently, the parameters of the input’s fuzzy partitions, follow the dynamics given by Equation (14).

ϑ_{ant}^{k + 1} \leftarrow ϑ_{ant}^{k} - η^{k} \nabla_{ϑ_{ant}} E_{D} (ϑ_{ant}^{k})

(14)

where

k \in [K]

is the iteration index,

K \in N

is the maximum number of iterations, and

η^{k} \in R

is the learning rate at the k-th iteration.

The gradients of the error measure with respect to each of the antecedent parameters are computed via Equations (15) and (16).

\nabla_{c_{l j}} E_{D} = \frac{1}{2} \sum_{i \in [N]} \sum_{l \in [r]} \frac{\partial E_{D_{}}}{\partial f_{tsk} (x^{i})} \frac{\partial f_{{tsk}_{}} (x^{i})}{\partial μ_{i}^{l}} \frac{\partial μ_{i_{}}^{l}}{\partial μ_{A_{j}^{l}} (x_{j}^{i})} \frac{\partial μ_{A_{j}^{l}} (x_{j}^{i})}{\partial c_{l j}}

(15)

\nabla_{σ_{l j}} E_{D} = \frac{1}{2} \sum_{i \in [N]} \sum_{l \in [r]} \frac{\partial E_{D_{}}}{\partial f_{tsk} (x^{i})} \frac{\partial f_{{tsk}_{}} (x^{i})}{\partial μ_{i}^{l}} \frac{\partial μ_{i_{}}^{l}}{\partial μ_{A_{j}^{l}} (x_{j}^{i})} \frac{\partial μ_{A_{j}^{l}} (x_{j}^{i})}{\partial σ_{l j}}

(16)

The choice of the learning rate

η

is crucial in gradient-based algorithms because it influences the model’s training performance, including convergence speed and stability. Vanilla gradient descent uses a fixed learning rate for all adjustable parameters. However, this approach can lead to large adjustments when gradients are large and smaller adjustments when gradients are small. In addition, since the parametric’s search space landscape can be much steeper in one direction than the other, an a priori choice of

η

that ensures good progress in all directions is quite difficult. The deep learning literature offers valuable solutions in improving gradient descent methods.

Adaptive moment estimation, or Adam, builds on the idea of normalising the gradients, so that the candidate solutions move a fixed distance in all directions along the search space. By incorporating momentum, as detailed in Equations (17) and (18), Adam effectively computes a weighted average of both the gradient and the squared gradient over time.

m^{k + 1} \leftarrow β m^{k} + (1 - β) \nabla_{ϑ_{ant}} E_{D} (ϑ_{ant}^{k})

(17)

v^{k + 1} \leftarrow γ v^{k} + (1 - γ) (\nabla_{ϑ_{ant}} E_{D} (ϑ_{ant}^{k}) ⊙ \nabla_{ϑ_{ant}} E_{D} (ϑ_{ant}^{k}))

(18)

where

β

and

γ

are the momentum parameters, both in

[0, 1)

, and ⊙ is the Hadamard product. Furthermore, the statistics are modified as illustrated in Equations (19) and (20).

{\tilde{m}}^{k + 1} \leftarrow \frac{m^{k + 1}}{1 - β^{k + 1}}

(19)

{\tilde{v}}^{k + 1} \leftarrow \frac{v^{k + 1}}{1 - γ^{k + 1}}

(20)

and the final Adam update is given by Equation (21).

ϑ_{ant}^{k + 1} \leftarrow ϑ_{ant}^{k} - η {\tilde{m}}^{k + 1} ⊘ (\sqrt{{\tilde{v}}^{k + 1}} + ϵ)

(21)

where the square root is applied pointwise, ⊘ denotes the componentwise division, and

ϵ

is a small number used to prevent the denominator from vanishing. The AdaBound algorithm further develops the principle basis of Adam by considering gradient clipping, and thus avoiding the existence of extreme learning rates. The AdaBound learning rate and the parameter update are given by Equations (22) and (23), respectively.

{\tilde{η}}^{k + 1} \leftarrow C L I P ([η_{l}^{k}, η_{u}^{k}], \frac{η^{k}}{\sqrt{{\tilde{v}}^{k}} + ϵ})

(22)

ϑ_{ant}^{k + 1} \leftarrow ϑ_{ant}^{k} - {\tilde{η}}^{k + 1} ⊙ {\tilde{m}}^{k + 1}

(23)

where

η_{l}^{k}

and

η_{i}^{k}

denote lower and upper bounds of the learning rate, and they can be considered as scalars or varying functions of iterations. Equation (22) states that clipping occurs within the boundaries of the fold defined by the set

[η_{l}^{k}, η_{u}^{k}]

. Hence, the vanilla gradient descent, as well as Adam, can be considered as subsets of AdaBound, if the upper and lower bounds are appropriately selected.

Algorithm 1 summarises the hybrid training of the Takagi–Sugeno–Kang fuzzy model.

Algorithm 1 Hybrid learning of TSK with AdaBound

Require: time series data x of N observations, number of inputs m, number of fuzzy sets

A_{j} \forall j \in [m]

, training

D^{tr}

and test sets

D^{*}

, ∧ operator, functional form of

μ_{A_{j}}

, training budget maxIters, type of gradient descent optimiser, the

l_{2}

regulariser

λ

, AdaBound parameters

β

and

γ

,

ϵ = 10^{- 8}

, fuzzy rule

\forall l \in [r]

1: for all inputs

j \in [m]

do

2: / Generate Uniform Fuzzy Partitions

3: end for

4:

k \leftarrow 1

5: while

k \leq maxIters

do

6: procedure ForwardPhase

7: for all observations on

D^{tr}

do

8: for all inputs

j \in [m]

do

9: for all rules

l \in [r]

do

10: / Compute Memberships

11: / Compute Combined Memberships

12: / Compute Normalised Combined Memberships

13: end for

14: end for

15: end for

16: / Generate matrix

A

17: / Compute

ϑ_{cons}^{⋆}

by solving

{min}_{ϑ_{CONS} \in P_{2}} {| | A ϑ_{CONS} - {y | |}_{2}^{2} + λ | | ϑ_{CONS} {| |}_{2}^{2}}

18: / Compute the output ot the model

f_{tsk} (x)

19: / Compute the model’s approximation error

E_{D}

20: end procedure

21: procedure BackwardPhase(type)

22: for all observations in

D^{tr}

do

23: for all rules

l \in [r]

do

24: / Compute the gradients

\nabla_{c_{l j}} E_{D}

and

\nabla_{σ_{l j}} E_{D}

25: end for

26: end for

27: / Update

ϑ_{ant}

using AdaBound ▹ or an optimiser defined by type

28: / Update Fuzzy Partitions

29: end procedure

30:

k \leftarrow k + 1

31: end while

32: / Extract

ϑ^{⋆} = ϑ_{ant}^{⋆} \cup ϑ_{cons}^{⋆}

▹ the optimised model parameters

ϑ^{⋆} \in P

33: procedure TestingPhase

34: / Given

ϑ^{⋆}

compute the output over the test set

D^{*}

35: / Compute the model’s generalisation error

36: end procedure

2.4. Surrogate-Based Feature Selection

We develop a surrogate-based feature selection algorithm which leverages the model’s prediction accuracy as a measure for identifying optimal features. This algorithm is categorised as a wrapper-based method, which integrates the model directly in its evaluation process.

Typically, feature selection can be defined as the process of a subset

S \subset [d]

determination. To determine the optimal subset among all features, all the possible subsets should be considered. Given that the training and inference processes of the predictive model can be time-intensive, it is clear that considering wrapper-based approach for all possible combinations is not practical. Wrapper-based methods typically employ heuristic algorithms, e.g., greedy and/or evolutionary optimisation methods, rather than relying on brute-force schemes. However, it is important to note that evolutionary algorithm-based wrapper methods can be particularly demanding in terms of computational resources, since they are population-based; all candidate solutions must be evaluated in terms of their measured fitness, and this process is repeated for a number of generation evolution.

In the present study, we follow a Bayesian scheme, employing a Gaussian process as the surrogate for the objective function reflecting the generalisation performance of the fuzzy model, given a number of inputs. The task involves searching the space of possible lags, for the best ones that will generate the input space

X

and yield a minimum error on a set of unseen data. A concise description of the Bayesian optimisation framework is provided below, according to [32]. Thorough presentations of Bayesian optimisation can be found in [33,34].

Bayesian optimisation is a sequential optimisation, designed for black-box global optimisation problems that incur significant evaluation costs. It relies on surrogate-based modelling, which enables the sequential selection of promising solutions through an acquisition function. The surrogate model, gives rise to a prediction for all the points in

S

, and by minimising an acquisition function, promising locations to be explored are identified. A trade-off between exploration and exploitation emerges through the incorporation of surrogate model’s predictions and uncertainties into the acquisition function minimisation. Throughout the iterative process, the surrogate model is continuously updated with new solutions and their corresponding objective values, enhancing its accuracy and performance.

The procedure starts by randomly sampling points

x_{j}

in the search space, thus generating an initial set of n locations in the parametric space. The objective function is evaluated upon these points by simulating the predictive model n times. Thus, a set

(X, Y)

is generated, allowing the construction of a probabilistic Bayesian model in a supervised learning manner, with model fit occurring in the parametric space. The pseudocode of Bayesian optimisation is given in Algorithm 2.

Algorithm 2 Bayesian optimisation

Require: objective function

E

, surrogate model, i.e., Gaussian process (

GP)

, acquisition function

α

, search space

S

1:

X \leftarrow {x_{1}, x_{2}, \dots, x_{n}} \subset S

▹ randomly sample the search space

S

, to generate n points

2:

Y \leftarrow {E (x_{1}), E (x_{2}), \dots, E (x_{n})}

▹ evaluate the objective on the X points

3: model

\leftarrow GP (X, Y)

▹ fit the surrogate on the objective

E

4:

x^{⋆} \leftarrow {arg min}_{x \in X} α (x, model)

▹ minimise the acquisition function to get

x^{'} \in S

5:

y^{⋆} \leftarrow E (x^{⋆})

6: model

\leftarrow GP (X \cup {x^{'}}, Y \cup {y^{'}})

▹ refit the surrogate on the objective

E

The feature selection problem is treated as a constrained integer-based optimisation task. Thus, the solution can be represented to be a subset of

{1, 2, \dots, 20}

. We aim to identify the subset

S \subset [20]

, which indicates the time lags that will be used as model inputs. For example, if a maximun number of two features/inputs is considered, then the feature selection algorithm should explore the integer search-space, i.e.,

{1, \dots, 20} \times {1, \dots, 20}

, to find the unique combination of lags that will define the model’s feature space. For instance, if

S = {1, 2}

, the corresponding input space is

{x (t - 1), x (t - 2)}

, whereas if say three inputs are considered, such as

S = {1, 3, 5} \subset [20]

, then the generated input space becomes

{x (t - 1), x (t - 3), x (t - 5)}

. The algorithm follows a sequential scheme, where possible search spaces are considered. The search space of the developed feature selection algorithm is sequentially augmented from dimension two, up to dimension five, i.e., the maximum number of features considered. For all considered number of inputs, the algorithm identifies the best combination of lags, which yields the miminum generalisation approximation error of the fuzzy model.

In Algorithm 3, we are given a predictive model, the time series data, and the maximum number of inputs considered. It is quite straightforward to follow the algorithm’s principle; according to k, i.e., the number of inputs, the search spaces

S_{k}

are generated. The main functionality of the algorithm is performed in the PerformBayesOpt function. For an increasing number of model inputs, the search space

{1, \dots, 20}

, corresponding to the possible lags of an input, is generated. The function PerformBayesOpt performs the Bayesian optimisation scheme according to the model, the time series data, and the possible search space, given the number of inputs. The function returns the minimum generalisation error of the model

E_{j}^{⋆}

, which is obtained by running the model with the set of lags indicated by

{ind}_{j}^{⋆}

. The variable

{ind}_{j}^{⋆}

, can be thought as being an index variable of the search space

S_{J}

, e.g., say that

{ind}_{j}^{⋆} = {1, 3}

, which means that the first and the third lag are picked from the candidate set

S_{J}

as a result of the Bayesian optimisation procedure. This means that the optimisation scheme, if considering two inputs, determines that the most effective model used

x (t - 1)

and

x (t - 3)

as inputs. Consequently, the best lag combination is given by the variable

{ind}^{⋆}

.

Algorithm 3 Surrogate-based feature selection for optimal lags

Require: predictive model (predModel), time series data x, maxNumFeatures, objective function

E

1:

k \leftarrow 2

and

j \leftarrow 1

2: while

k \leq maxNumFeatures

do

3:

S_{j} \leftarrow {[20]}^{k} : {1, \dots, 20} \times \dots \times {1, \dots, 20}

4:

{{ind}_{j}^{⋆}, E_{j}^{⋆}} \leftarrow P E R F O R M B A Y E S O P T (predModel, x, S_{j})

5:

k \leftarrow k + 1

and

j \leftarrow j + 1

6: end while

7:

{ind}^{⋆} \leftarrow A R G M I N (E_{j}^{⋆})

Before we continue and discuss the next topics of the methodology, it is important to highlight two key details regarding Bayesian optimisation. Since Gaussian processes are distributions over functions, they express a rich variety of functions. However, the form of functions generated by a GP depends solely on the kernel function, from which the GP’s covariance matrix emerges. In this study, the Matérn 5/2 kernel is used to generate surrogates of the objective, without being unrealistically smooth. Furthermore, it is proved that these kernels perform better [35]. The other key detail, associated with the way Bayesian optimisation is performed, corresponds to the acquisition function. In this study, the expected improvement acquisition function is incorporated, which is usually favoured for minimisation problems [33].

2.5. Wavelet Multiresolution Analysis

The core principle of the predictive methodology is based on a multiresolution analysis scheme, where the original time series

x (t)

is decomposed into a set of

J_{0} + 1

components, each of which is individually approximated by a TSK predictive model. The multiresolution analysis is generated by wavelets, which are mathematical objects described by a quite beautiful and rigorous theory. Based on the idea of averaging over different scales, wavelet decomposition offers a glimpse of the time series’s behaviour over various resolutions. The wavelet transform of a function (signal/time series) is the decomposition of the function into a set of basis functions, consisting of contractions, expansions, and translations of a mother wavelet. For a chosen wavelet

ψ (t)

and a scale a, the collection of variables

{W (a, t) : a > 0, t \in R}

, generated by Equation (24), defines the continuous wavelet transform (CWT) of

x (t)

, i.e., the time series.

W (a, t) = \int_{- \infty}^{\infty} x (u) ψ_{a, t} (u) d u, where ψ_{a, t} (u) \equiv \frac{1}{\sqrt{a}} ψ (\frac{u - t}{a})

(24)

Multiresolution analysis is the heart of the wavelets theory, and we therefore discuss its basic notions based on [36]. Multiresolution analysis offers a framework to express an arbitrary function

x (t)

on

{\tilde{V}}^{j}

and

{\tilde{W}}^{j}

, called the approximation and detail spaces, respectively. Each of these subspaces encode information about the function at different scales.

The multiresolution analysis is defined as a sequence of nested closed subspaces

{\tilde{V}}^{j} \in L^{2} (R)

, with

j \in Z

, as summarised in Equation (25).

\dots \subset {\tilde{V}}^{3} \subset {\tilde{V}}^{2} \subset {\tilde{V}}^{1} \subset {\tilde{V}}^{0} \subset {\tilde{V}}^{- 1} \subset \dots

(25)

Consider that a scaling function

ϕ \in L^{2} (R)

exists such that

{ϕ_{j, k} : j, k \in Z}

forms an orthonormal basis for

{\tilde{V}}^{j}

, where

ϕ_{j, k} (t) = \frac{1}{2^{j}} ϕ (\frac{t}{2^{j}} - k)

. Each approximation space

{\tilde{V}}^{j}

is associated with scale

2^{j}

. The projections of

x (t)

onto the

{\tilde{V}}^{j}

will give successive approximations, as described by Equation (26)

s_{j} (t) = \sum_{j, k \in Z} δ_{j, k} ϕ_{j, k} (t)

(26)

where

δ_{j, k} \equiv 〈 x (t), ϕ_{j, k} (t) 〉

denote the scaling coefficients of scale

2^{j}

.

In the context of multiresolution analysis, it holds that

{\tilde{V}}^{j} = {\tilde{V}}^{j + 1} \oplus {\tilde{W}}^{j + 1}

, for

{\tilde{V}}^{j} \subset {\tilde{V}}^{j - 1}

and

{\tilde{W}}^{j} \subset {\tilde{V}}^{j - 1}

, where

{\tilde{W}}^{j}

is the orthogonal complement of

{\tilde{V}}^{j}

in

{\tilde{V}}^{j - 1}

. The subspace

{\tilde{W}}^{j}

is the detail space and it is associated with scale

2^{j - 1}

. Any element of

{\tilde{V}}^{j}

can be expressed as the sum of two orthogonal elements; one from

{\tilde{V}}^{j + 1}

and the other from

{\tilde{W}}^{j + 1}

, hence, if

x (t) \in {\tilde{V}}^{0}

, one can write the additive decomposition, given by Equation (27).

x (t) = s_{1} (t) + d_{1} (t)

(27)

where

d_{1} (t) \in {\tilde{W}}^{1}

expresses the detail in

x (t)

that is missing from the coarse approximation

s_{1} (t) \in {\tilde{V}}^{1}

. The details

d_{j} (t)

are given by Equation (28).

d_{j} (t) = \sum_{j, k \in Z} w_{j, k} ψ_{j, k} (t)

(28)

where

w_{j, k} \equiv 〈 s_{j - 1} (t), ψ_{j, k} (t) 〉

denote the wavelets coefficients of scale

2^{j - 1}

and

ψ_{j, k} (t)

denotes the wavelet function. At this point it should be mentioned that the difference between

x (t)

and its approximation

s_{j} (t)

, can be expressed in terms of projections onto the detail subspaces

{\tilde{W}}^{j}, {\tilde{W}}^{j - 1}, \dots {\tilde{W}}^{1}

.

Generally speaking, since

{\tilde{V}}^{j_{0}} = {\tilde{V}}^{j} \oplus {\tilde{W}}^{j} \dots \oplus {\tilde{W}}^{j_{0} + 1}

, it holds that if

x (t) \in {\tilde{V}}^{j_{0}}

for some

j_{0} \in Z

, and a level of decomposition

j > j_{0}

, then Equation (29) illustrates the additive decomposition scheme, generated by the wavelet multiresolution analysis.

x (t) = \sum_{k \in Z} δ_{j, k} ϕ_{j, k} (t) + \sum_{l = j_{0} + 1}^{j} \sum_{k \in Z} w_{l, k} ψ_{l, k} (t)

(29)

where the first summation term corresponds to the coarse approximation of the signal, while the second to the projections onto the detail subspaces.

In this study, the maximal overlap discrete wavelet transform (MODWT) is considered to generate the multiresolution analysis. This transform is a linear filtering technique that transforms the temporal sequence

x (t)

into coefficients related to variations over a set of scales. The maximal overlap discrete wavelet transform can be thought as a subsampling of the continuous wavelet transform at dyadic scales

2^{j - 1}

with

j = 1, 2, 3, \dots

for all times in which the time series is defined [36].

The MODWT of level

J_{0}

of a time series, yields the vectors

W^{j}

, with

j = 1, 2, \dots, J_{0}

, containing the wavelet coefficients associated with changes of the time series over the scale

2^{j - 1}

. In addition, MODWT includes the vector

V^{j}

, containing the scaling coefficients associated with variations at scale

2^{J_{0}}

.

We consider the decomposition level to be

J_{0} = 8

, which lies on the fact that the resulting smooth MRA’s component, i.e., the coarsest level of approximation, accurately represents the trend of the time series. Furthermore, by choosing

J_{0} = 8

, the necessary condition

J_{0} \leq {log}_{2} (N)

, reflecting the maximum level of decomposition, is satisfied since the number of time series observations is

N = 9646

. The rationale behind our decision for considering the maximal overlap discrete wavelet tranform is that the latter is defined for all sample sizes, unlike the DWT which requires sample sizes to be an integer multiple of the decomposition level

J_{0}

. Furthermore, the MODWT’s multiresolution analysis components are time-aligned, thus facilitating meaningful interpretations of the time series behaviour.

2.6. Projection Step

At this point, predictions of each component of the wavelet-generated multiresolution analysis are available, since we place a TSK prediction model upon each component. To further enhance the approximation capacity of the fuzzy predictive model, we introduce a projection step.

Let

U = span {u_{1} (x), u_{2} (x), \dots, u_{J_{0} + 1} (x)}

be a finite dimensional space, generated by the columns of the matrix

K = {(u_{d} (x))}_{d \in [J_{0} + 1]} \in R^{N \times J_{0} + 1}

. The matrix

K

is constructed by considering the

J_{0} + 1

sub-predictions of the TSK fuzzy model, i.e.,

u_{d} (x) \equiv f_{tsk}^{d} (x)

, associated with the

d^{th}

component of the wavelet multiresolution analysis. A projection

π_{U}

onto the subspace

U

, can be written as the linear combination of its basis vectors, hence

π_{U} (x) = \sum_{d \in [J_{0} + 1]} w_{d} u_{d} (x) = K w

, with

w \in R^{J_{0} + 1}

.

Given

π_{U} (x) \in U

, we wish to find

λ \in R^{J_{0} + 1}

such that

y - π_{U} ⊥ C (K)

, where

C (K)

denotes the column space of

K

, and y denotes the ground truth of the prediction problem. We therefore need the following:

K^{T} (y - π_{U}) = 0 \Rightarrow K^{T} y = K^{T} K w

. Thus, the parameter vector w solves the latter normal equation, and it provides the best approximation to the subspace generated by the sub-predictions of each fuzzy model.

2.7. Complete Methodology

We can now outline the final predictive methodology, based on the sub-components that have been described earlier. The original time series data are gathered and missing values are replaced through interpolation. The data are subsequently rescaled into

[0, 1]

.

A multiresolution analysis based on the maximal overlap discrete wavelet transform is generated, thus decomposing the data into

J_{0} + 1

number of partial sub-components. Training

D_{d}^{tr}

and test sets

D_{d}^{*}

, with

d \in [J_{0} + 1]

, are created for each of the sub-series sequences. This generation is based on the best inputs, as determined by the Bayesian wrapper-based feature selection algorithm.

For the

d^{th}

multiresolution analysis component, a TSK fuzzy prediction model is trained in a supervised framework, using the corresponding set

D_{d}^{tr}

. The training of each, among the

J_{0} + 1

fuzzy models, incorporates the hybrid scheme; combining least squares with AdaBound, for consequent and antecedent tuning, respectively. The trained fuzzy models generate the predictions

f_{tsk}^{d} (x)

, which are aggregated in a weighted manner to yield the prediction of the final model. The aggregation results from the projection step, which provides the optimal approximation of the unknown function, i.e., the ground truth of the forecasting problem.

To assess the performance of the proposed methodology in terms of accurate spectral wave height prediction, regression metrics are utilised, such as root mean squared error (

E_{RMSE}

) and mean absolute percentage error (

E_{MAPE}

). Furthermore, comparative plots are included to enhance understanding and clarity.

3. Numerical Studies

In this section, we present the results from the numerical studies. We discuss the reasons why these particular data captured our interest. Moreover, technical details pertaining to the time series, are provided. To enhance clarity and facilitate better comprehension for the readers, visual representations of the results, such as temporal evolutions of predictions, error histograms, and regression plots are demonstrated. Comparison results with other models are also presented.

We focus on one-step-ahead predictions,

h = 1

, hence based on past observations, the models generate a prediction of the next significant wave height realisation. The numerical experiments are implemented on a MacBook Air, with an M3 Chip, 8 GB of RAM, and 256 GB SSD, using Matlab 2024a and macOS Sequoia Version 15.2.

3.1. Data Analysis

The data used in the current paper were provided by the Poseidon (https://poseidon.hcmr.gr/components/observing-components/buoys (accessed on 2 August 2025)) operational oceanography system at the Institute of Oceanography, part of the Hellenic Center for Marine Research, which operates a network of fixed measuring floats deployed at various locations in the Aegean and Ionian Seas. The observations correspond to temporal observations of spectral significant wave height (Hm0), measured in meters, and they are generated by the multidisciplinary Pylos observatory mooring at a latitude of 36.8288 and a longitude of 21.6068 (https://poseidon.hcmr.gr/services/ocean-data/situ-data (accessed on 2 August 2025)). The latter is located in the southeastern Ionian Sea; the crossroads of the Adriatic and Eastern Mediterranean basins. The fixed station includes a surface Wavescan buoy equipped with sensors that monitor meteorological conditions, wave characteristics, and surface oceanographic parameters. The latter type of platforms are designed for deep basins and can incorporate an inductive coupling mooring cable so that the instruments can support the data transfer through these cables in order to be available in real time even at depths greater than 1000 m.

Our interest to study data from the specific location arises due to several reasons. Located in close proximity to Oinousses Deep in the Hellenic Trench, the deepest point in the entire Mediterranean Sea, and about 60 km west from the closest shoreline in the Peloponnese (Greece) [37], it is where intermediate and deep water masses from the Adriatic and Aegean seas converge. The region is geologically active, frequently experiencing earthquakes and landslides, and it poses a potential tsunami risk that could affect the Eastern Mediterranean Sea.

The original format of the gathered data is NetCDF; designed to support the creation, access, and sharing of scientific data. A sampling period of three hours is observed, i.e., every three hours a single wave height observation is obtained, thus,

Δ t = 1 / 8

of the day.

Figure 2 illustrates the temporal evolution of the significant wave height observations. The histogram of the data, along with a fitted kernel distribution are depicted in Figure 3. As illustrated, the wave height data follow a right-skewed distribution, a fact that is verified by the mean being greater that the median, as well as the positive skewness. The statistical characteristics of the significant wave height data are presented in Table 1.

The multiresolution analysis of the time series is generated via the maximal overlap discrete wavelet transform. Since we consider the maximum level of decomposition to be

J_{0} = 8

, there are in total nine components, each pertaining to the approximation of the time series onto subspaces of different scales. Figure 4, Figure 5 and Figure 6 illustrate the nine components, which are generated via the MODWT-based MRA.

On the top plot of Figure 4 the smooth component is illustrated. This component reflects averages over physical scales of

2^{8} Δ t = 32

days. The components depicted in the middle and bottom rows of Figure 4 and the rows of Figure 5 and Figure 6 are approximations from projections onto the detail spaces

{\tilde{W}}^{j}

, each associated with variations over physical scales

2^{j - 1} Δ t

, where

j \in [8]

. Hence, from the middle row of Figure 4 to the bottom row of Figure 6, the associated physical scales are

16 days, 8 days, 4 days, 2 days, 1 day, 12 h, 6 h

, and

3 h

, respectively.

3.2. Performance Metrics

The model performance is reflected on the basis of three widely used regression metrics, RMSE and MAPE, described by Equations (30) and (31), respectively.

E_{RMSE} = \sqrt{[\frac{1}{N} \sum_{i \in [N]} {(y^{i} - y_{model}^{i})}^{2}]}

(30)

E_{MAPE} = \frac{1}{N} \sum_{i \in [N]} |\frac{y^{i} - y_{model}^{i}}{y^{i}}|

(31)

where

y^{i}

is the target of the dataset

D

, i.e., the ground truth, N is the number of observations, and

y_{model}^{i}

represents the output of a predictive model. Models with lower

E_{RMSE}

reflect higher prediction performance. The

E_{MAPE}

metric is used because it is independent of scale. Additionally, a percentage-based performance forecast provides readers with better insights into the models. The coefficient of determination

r^{2}

, given by Equation (32), is also considered.

r^{2} = 1 - \frac{\sum_{i \in [N]} (y^{i} - y_{model}^{i})}{\sum_{i \in [N]} (y^{i} - \bar{y})}

(32)

where

\bar{y}

denotes the mean value of the ground truth.

3.3. Practical Considerations

In this section, we provide all the details pertaining to the simulations considered. The fuzzy Takagi–Sugeno–Kang model with linear consequents is adopted as the predictive model. Each of such model is placed over a corresponding component, generated by the multiresolution analysis, to play the role of the predictive model. The surrogate-based wrapper algorithm selects two inputs

m = 2

, and in particular,

x (t - 1)

and

x (t - 2)

, which leads to the best generalisation performance of the overall model. The fuzzy connectives are implemented by product t-norms. Gaussian membership functions are chosen to generate fuzzy partitions on each input domain by including two membership functions

n_{A} = 2

. Since, we only consider grid partitioning, the fuzzy model is associated with four fuzzy rules, i.e.,

r = 4

. By choosing Gaussian membership function, the number of tunable parameters for each fuzzy set is minimised, and at the same time their gradients are effectively computed. The regulariser parameter

λ

is set to

1 \times 10^{- 5}

.

Based on these inputs, training and testing sets for each of the

d \in [J_{0}]

MRA components are generated. The training set

D_{d}^{tr}

corresponds to the

70%

of the observations, while the remaining

30%

reflects the testing set

D_{d}^{*}

. At this point, it should be mentioned that since time series data are considered, the temporal continuity is preserved, thus, the input space is generated by forming the appropriate lags

x (t - 1)

and

x (t - 2)

and moving time t forward.

The level of decomposition is

J_{0} = 8

, which lies on the fact that the resulting smooth MRA’s component accurately represents the trend of the time series. Furthermore, by choosing

J_{0} = 8

, the necessary condition

J_{0} \leq {log}_{2} (N)

is satisfied, since the number of observations is

N = 9646

. The multiresolution analysis is generated using MODWT, associated with a Symlet wavelet.

As far as the antecedents optimisation is concerned, i.e., the AdaBound gradient descent algorithm, the following hold: the parameters

β

and

γ

are

0.9

and

0.999

, respectively, while the initial learning rate

η

is

0.01

. The maximum number of the hybrid learning scheme iterations, is 100. Lower and upper bound functions, i.e.,

η_{l}^{k}

and

η_{u}^{l}

, respectively, are the same as in [30].

The surrogate-based feature selection algorithm emerges from the Bayesian optimisation framework. The kernel of the Gaussian process is the Matérn 5/2. The acquisition function is the expected improvement, which is usually proposed for minimisation problems. The maximum number of Bayesian optimisation iterations is 30. Furthermore, the Bayesian optimisation is implemented as an integer-based optimisation problem since the search space is on the integers. In addition, a proper constraint is incorporated to ensure that the feasible set of solutions, generated by the optimisation scheme, is a set of vectors with unique elements. For example, if three inputs were considered then the vector

{1, 2, 5}

, associated with the possible input lags, belongs to the feasible set whilst

{1, 2, 2}

does not.

3.4. Results

We conduct a comparative analysis using various models to evaluate the effectiveness of the proposed methodology. Among the models considered is ANFIS, which is a Takagi–Sugeno–Kang model trained with a vanilla gradient descent algorithm. This model is implemented in a direct prediction scheme, which does not involve time series decomposition. This choice is intended to illustrate the benefits of multiresolution analysis in our methodology. In addition, we consider a vanilla TSK model as a potential alternative to our predictive model within the context of the proposed multiresolution analysis. This model employs uniform fuzzy partitions along each input dimension, computing only the consequent parameters. Furthermore, we explore our novel predictive model—the novel TSK with hybrid training—using an alternative decomposition scheme generated by the method of variational mode decomposition [38]. To ensure a fair comparison between the wavelet-based MRA and the one generated via variational mode decomposition, we consider nine intrinsic mode functions in the latter, so that there are in total nine components pertaining to the decomposed time series. Finally, the simulations also involve the refinement of the approximation, due to the projection step.

The quantitative results corresponding to the predictive accuracy of the simulated models are presented in Table 2, Table 3 and Table 4. The symbol ^⋆ is used to distinguish the methods that do not incorporate multiresolution analysis. The method, which yields the best results is highlighted with light grey colour.

Figure 7, Figure 8 and Figure 9 illustrate the regression plots of all comparative models over the testing set

D^{*}

. The test error distributions are demonstrated in Figure 10.

Figure 10 illustrates that the proposed model, which combines wavelet-based multiresolution analysis with an integrated projection step, achieves the best performance by yielding the lowest test approximation error. The next best results are generated by the wavelet MRA-based Takagi–Sugeno–Kang model, which does not include a projection step. Following closely, the predictive model that employs a variational mode decomposition scheme ranks as the third best, when compared to the wavelet approach. As expected, the adaptive network-based fuzzy inference system generates the highest approximation error on unseen data due to the lack of multiresolution analysis in the prediction scheme. Finally, we observe that the vanilla TSK fuzzy model, despite being incorporated within an MRA framework, displays a bias in the error, a phenomenon that is also observable in the right plot of Figure 7.

The temporal evolution of the models’ prediction over the test data

D^{*}

is presented in Figure 11, Figure 12 and Figure 13.

At this point we proceed with the conduction of a hypothesis test. In a hypothesis test context, we evaluate the potential rejection of a null hypothesis. When the null hypothesis is true, any observed difference between two predictions is not statistically significant. Therefore, null hypothesis rejection provides evidence that the differences in prediction accuracy are significant and that this distinction is not a result of randomness.

The Diebold–Mariano test is considered to evaluate whether there is an actual statistical significance among the models’ generated predictions. Since we only focus on one-step-ahead predictions, the Diebold–Mariano statistic is computed by Equation (33).

D M = \frac{\bar{d}}{\sqrt{\frac{2 π {\hat{f}}_{d} (0)}{N}}}

(33)

where

\bar{d}

denotes the mean value of the difference between the squared errors of the two comparing predictions. Moreover,

{\hat{f}}_{d} (0) = \frac{1}{2 π} {\hat{ω}}_{d} (0)

, where

{\hat{ω}}_{d} (0) = \frac{1}{N} \sum_{i \in [N]} {(d^{i} - \bar{d})}^{2}

. The

D M

statistic, under the null hypothesis, is asymptotically normally distributed, thus, whenever

D M \notin [- 1.96, 1.96]

, the null hypothesis is rejected. That is to say that at the

5%

significance level, the errors’ difference is not zero mean.

Table 5 and Table 6 illustrate the results obtained from the Diebold–Mariano statistical test. This analysis focuses on the following two scenarios: the wavelet MRA-based Takagi–Sugeno–Kang model without the projection step, and the same model incorporating the projection step. The predictions generated by both of these models successfully reject the null hypothesis when compared to the predictions of all the other models considered in this analysis.

4. Discussion

The numerical analysis results provide strong indications that the incorporation of multiresolution analysis is quite essential for achieving accurate predictions, particularly when dealing with complex time series. This study focuses on developing a comprehensive methodology for precise wave height prediction. By integrating wavelet multiresolution analysis into the predictive framework, exceptional performance is achieved in both training and, most importantly, testing datasets. The latter is a key measure of generalisation capability of the final system. As illustrated via numerical analysis, the most effective predictive scheme is a novel hybrid-trained Takagi–Sugeno–Kang model within a wavelet multiresolution analysis framework, which also incorporates a projection step. The MODWT-based MRA is used to decompose the spectral wave height time series into components of various scales. We introduce a novel training scheme by considering hybrid training of the Takagi–Sugeno–Kang model, i.e., combining least squares for the optimal computation of the consequent parameters and AdaBound for tuning the antecedent parameters.

By applying this fuzzy predictive model to the components generated by the multiresolution analysis, we optimise the approximation performance. The wavelet-based multiresolution analysis Takagi–Sugeno–Kang model with the projection step demonstrates superior performance compared to all the other models considered. By visual inspection on the Figures depicting the predictions temporal evolution, it is clear that the latter scheme displays extremely high approximation performance on the unseen data. This is also supported by the regression plots, as well as the plots which illustrate the distributions of the models’ error.

As illustrated in Figure 2, wave heights considerably exceeding the mean value of 0.92 m frequently occur, reaching heights of up to seven meters. This observation indicates that this location experiences natural events capable of generating extreme waves that deviate dramatically from the mean. Therefore, it can be concluded that extreme waves are present in this region. Moreover, Figure 13 showcases the test area where the proposed methodology excels in approximating extreme wave height values, exceeding six meters. Consequently, the test dataset reveals a higher frequency of extreme wave heights compared to the training dataset, and the method’s performance is a strong indication that when data are available to train the models, then high generalisation is expected. This evidence demonstrates that our approach effectively captures the extraordinary waves that stand out from typical patterns.

Not only do the predictions of the proposed scheme outperform others in terms of regression metrics, but this difference is also statistically significant. By employing the Diebold–Mariano statistical test and reporting extremely low p-values, we provide strong evidence that the proposed model’s superior performance is statistically significant, a fact which yields from rejecting the null hypothesis compared to all the other models. Regarding performance metrics, it is noteworthy that the wavelet-based MRA Takagi–Sugeno–Kang model with the projection step exhibits a

36%

improvement over the same model without this step, and a

49%

increase compared to the model utilising a variational mode decomposition scheme with the projection step. The latter improvements are measured in terms of root mean squared error on the test set.

Limitations of the methodology need to be discussed. Fuzzy models face the curse of dimensionality, which restricts their effectiveness in relatively low-dimensional feature spaces. The latter manifests through the rule explosion, thus, caution should be taken when considering grid partitioning rule generation. When considering the interpretability of fuzzy models, particularly with regard to semantic integrity, it is important to pay attention to the optimisation of fuzzy partition of each input variable. However, it is worth noting that time series problems are rarely associated with high-dimensional feature spaces. Thus, we believe that the proposed methodology can be effectively applied to most time series prediction problems. Several factors can influence the final methodology’s prediction performance, including the cardinality of fuzzy partitions on each input dimension, the features selected, as well as the level of decomposition generated by the multiresolution analysis. These issues can easily be resolved in an application-wise manner, hence, there is no need to include such factors in an optimisation scheme—adding computational complexity. With respect to the feature selection, the surrogate-based Bayesian algorithm effectively determines the optimal number of time lags, which lead to the best performance of the predictive method in terms of generalisation. Consequently, the integration of the novel developed surrogate algorithm in the overall scheme resolves the feature selection issue.

5. Conclusions

In this paper, the problem of accurate significant wave height prediction is studied using a complete methodology. The proposed method utilises a novel Takagi–Sugeno–Kang model within a multiresolution analysis framework generated by wavelets. The TSK model serves as the base predictive model, integrated into a MODWT-based MRA decomposition framework, and it generates predictions for each component of the decomposed data.

A novel surrogate-based wrapper algorithm is developed to determine the significant lags, which generate the feature space of the predictive models. This algorithm utilises Bayesian optimisation, thus fitting a Gaussian process to the objective function, which measures the generalisation performance of the predictive scheme. This algorithm allows us to accurately determine the feature space, in a time-efficient and scalable manner. The training of the fuzzy Takagi–Sugeno–Kang model is performed using a hybrid technique, using least squares for consequent computation and AdaBound for tuning of the parameters associated with the fuzzy partitions on each input. The final prediction of the system is generated by introducing a projection step, hence providing an optimal aggregated prediction of the significant wave height.

The accurate and time-efficient nature of the overall method, together with the reported numerical results, suggest that the proposed scheme provides a promising candidate to artificial intelligence-based time series forecast applications.

The directions we wish to explore in future research involve the evaluation of uncertainty in predictions by exploring schemes which generate interval predictions. Furthermore, we wish to apply the novel developed methodology onto various complex time series, function approximation problems, as well as biomedical engineering applications. Lastly, a comprehensive comparison of the proposed methodology against deep/ensemble learning methods is to be investigated.

Author Contributions

Conceptualisation, P.K. and A.D.; methodology, P.K. and A.D.; software, P.K.; validation, P.K. and A.D.; formal analysis, P.K.; investigation, P.K. and A.D.; resources, P.K.; writing—original draft preparation, P.K.; writing—review and editing, P.K. and A.D.; visualisation, P.K.; supervision, A.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data in this study were kindly provided, upon the authors’ request, by the Poseidon operational oceanography system of the Institute of Oceanography, part of the Hellenic Centre for Marine Research. The authors do not have permission to share the data.

Acknowledgments

The authors would like to express their gratitude to the Hellenic Center of Marine Research and especially the Poseidon operational oceanography system of the Institute of Oceanography for the kind provision of the data used in the current study. Furthermore, the authors would like to express their sincere gratitude to the editor and the reviewers for their valuable comments and constructive suggestions.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ANFIS	Adaptive network-based fuzzy inference system
SVM	Support vector machine
PCA	Principal component analysis
PSO	Particle swarm optimisation
SWAN	Simulating waves nearshore
WEC	Wave energy converters
CNN	Convolutional neural network
RNN	Recurrent neural network
LSTM	Long short-term memory
TSK	Takagi–Sugeno–Kang
EMD	Empirical mode decomposition
VMD	Variational mode decomposition
SVD	Singular value decomposition
MRA	Multiresolution analysis
BiLSTM	Bidirectional LSTM
MI	Mutual information
CRO-SL	Coral reef optimisation algorithm with substrate layers
FFT	Fast Fourier transform
SARIMA	Seasonal autoregressive integrated moving average
GP	Gaussian process
DWT	Discrete wavelet transform
MODWT	Maximal overlap DWT
CWT	Continuous wavelet transform
MAPE	Mean absolute percentage error
RMSE	Root mean squared error

References

Mahdavi-Meymand, A.; Sulisz, W. Application of nested artificial neural network for the prediction of significant wave height. Renew. Energy 2023, 209, 157–168. [Google Scholar] [CrossRef]
Batsis, G.; Partsinevelos, P.; Stavrakakis, G. A Deep Learning and GIS Approach for the Optimal Positioning of Wave Energy Converters. Energies 2021, 14, 6773. [Google Scholar] [CrossRef]
Zodiatis, G.; Galanis, G.; Nikolaidis, A.; Kalogeri, C.; Hayes, D.; Georgiou, G.C.; Chu, P.C.; Kallos, G. Wave energy potential in the Eastern Mediterranean Levantine Basin. An integrated 10-year study. Renew. Energy 2014, 69, 311–323. [Google Scholar] [CrossRef]
Luo, Q.; Xu, H. Analysis of intrinsic factors in accurate wave height prediction based on model interpretability. Ocean Eng. 2024, 309, 118493. [Google Scholar] [CrossRef]
Group, T.W. The WAM Model—A Third Generation Ocean Wave Prediction Model. J. Phys. Oceanogr. 1988, 18, 1775–1810. [Google Scholar] [CrossRef]
Booij, N.; Ris, R.C.; Holthuijsen, L.H. A third-generation wave model for coastal regions: 1. Model description and validation. J. Geophys. Res. Ocean. 1999, 104, 7649–7666. [Google Scholar] [CrossRef]
Ikram, R.M.A.; Cao, X.; Parmar, K.S.; Kisi, O.; Shahid, S.; Zounemat-Kermani, M. Modeling Significant Wave Heights for Multiple Time Horizons Using Metaheuristic Regression Methods. Mathematics 2023, 11, 3141. [Google Scholar] [CrossRef]
Yang, C.; Kong, Q.; Su, Z.; Chen, H.; Johanning, L. A hybrid model based on chaos particle swarm optimization for significant wave height prediction. Ocean Model. 2025, 195, 102511. [Google Scholar] [CrossRef]
Zanganeh, M. Improvement of the ANFIS-based wave predictor models by the Particle Swarm Optimization. J. Ocean Eng. Sci. 2020, 5, 84–99. [Google Scholar] [CrossRef]
Malekmohamadi, I.; Bazargan-Lari, M.R.; Kerachian, R.; Nikoo, M.R.; Fallahnia, M. Evaluating the efficacy of SVMs, BNs, ANNs and ANFIS in wave height prediction. Ocean Eng. 2011, 38, 487–497. [Google Scholar] [CrossRef]
Gracia, S.; Olivito, J.; Resano, J.; del Brio, B.M.; de Alfonso, M.; Álvarez, E. Improving accuracy on wave height estimation through machine learning techniques. Ocean Eng. 2021, 236, 108699. [Google Scholar] [CrossRef]
Abbas, M.; Min, Z.; Liu, Z.; Zhang, D. Unravelling oceanic wave patterns: A comparative study of machine learning approaches for predicting significant wave height. Appl. Ocean Res. 2024, 145, 103919. [Google Scholar] [CrossRef]
Çelik, A. Improving prediction performance of significant wave height via hybrid SVD-Fuzzy model. Ocean Eng. 2022, 266, 113173. [Google Scholar] [CrossRef]
Peláez-Rodríguez, C.; Pérez-Aracil, J.; Gómez-Orellana, A.; Guijo-Rubio, D.; Vargas, V.; Gutiérrez, P.; Hervás-Martínez, C.; Salcedo-Sanz, S. Fuzzy-based ensemble methodology for accurate long-term prediction and interpretation of extreme significant wave height events. Appl. Ocean Res. 2024, 153, 104273. [Google Scholar] [CrossRef]
Duarte, F.S.; Rios, R.A.; Hruschka, E.R.; de Mello, R.F. Decomposing time series into deterministic and stochastic influences: A survey. Digit. Signal Process. 2019, 95, 102582. [Google Scholar] [CrossRef]
Zhou, J.; Zhou, L.; Zhao, Y.; Wu, K. Significant wave height prediction based on improved fuzzy C-means clustering and bivariate kernel density estimation. Renew. Energy 2025, 245, 122787. [Google Scholar] [CrossRef]
Zhang, S.; Zhao, Z.; Wu, J.; Jin, Y.; Jeng, D.S.; Li, S.; Li, G.; Ding, D. Solving the temporal lags in local significant wave height prediction with a new VMD-LSTM model. Ocean Eng. 2024, 313, 119385. [Google Scholar] [CrossRef]
Patanè, L.; Iuppa, C.; Faraci, C.; Xibilia, M.G. A deep hybrid network for significant wave height estimation. Ocean Model. 2024, 189, 102363. [Google Scholar] [CrossRef]
Zhang, J.; Luo, F.; Quan, X.; Wang, Y.; Shi, J.; Shen, C.; Zhang, C. Improving wave height prediction accuracy with deep learning. Ocean Model. 2024, 188, 102312. [Google Scholar] [CrossRef]
Sun, Y.; Yu, L.; Zhu, D. A Hybrid Deep Learning Model Based on FFT-STL Decomposition for Ocean Wave Height Prediction. Appl. Sci. 2025, 15, 5517. [Google Scholar] [CrossRef]
Fang, T.; Li, X.; Shi, C.; Zhang, X.; Xiao, W.; Kou, Y.; Mumtaz, I.; ao Huang, Z. Memo-UNet: Leveraging historical information for enhanced wave height prediction. Neurocomputing 2025, 634, 129840. [Google Scholar] [CrossRef]
Mücke, S.; Heese, R.; Müller, S.; Wolter, M.; Piatkowski, N. Feature selection on quantum computers. Quantum Mach. Intell. 2023, 5, 11. [Google Scholar] [CrossRef]
Cornejo-Bueno, L.; Garrido-Merchán, E.; Hernández-Lobato, D.; Salcedo-Sanz, S. Bayesian optimization of a hybrid system for robust ocean wave features prediction. Neurocomputing 2018, 275, 818–828. [Google Scholar] [CrossRef]
Hashim, R.; Roy, C.; Motamedi, S.; Shamshirband, S.; Petković, D. Selection of climatic parameters affecting wave height prediction using an enhanced Takagi-Sugeno-based fuzzy methodology. Renew. Sustain. Energy Rev. 2016, 60, 246–257. [Google Scholar] [CrossRef]
Wu, D.; Lin, C.T.; Huang, J.; Zeng, Z. On the Functional Equivalence of TSK Fuzzy Systems to Neural Networks, Mixture of Experts, CART, and Stacking Ensemble Regression. IEEE Trans. Fuzzy Syst. 2020, 28, 2570–2580. [Google Scholar] [CrossRef]
Ying, H. Sufficient conditions on uniform approximation of multivariate functions by general Takagi-Sugeno fuzzy systems with linear rule consequent. IEEE Trans. Syst. Man Cybern. Part A Syst. Humans 1998, 28, 515–520. [Google Scholar] [CrossRef]
Zeng, K.; Zhang, N.Y.; Xu, W.L. A comparative study on sufficient conditions for Takagi-Sugeno fuzzy systems as universal approximators. IEEE Trans. Fuzzy Syst. 2000, 8, 773–780. [Google Scholar] [CrossRef]
Ying, H.; Ding, Y.; Li, S.; Shao, S. Comparison of necessary conditions for typical Takagi-Sugeno and Mamdani fuzzy systems as universal approximators. IEEE Trans. Syst. Man Cybern. Part A Syst. Humans 1999, 29, 508–514. [Google Scholar] [CrossRef]
Jang, J.S. ANFIS: Adaptive-network-based fuzzy inference system. IEEE Trans. Syst. Man Cybern. 1993, 23, 665–685. [Google Scholar] [CrossRef]
Luo, L.; Xiong, Y.; Liu, Y.; Sun, X. Adaptive Gradient Methods with Dynamic Bound of Learning Rate. arXiv 2019, arXiv:1902.09843. [Google Scholar] [CrossRef]
Wu, D.; Yuan, Y.; Huang, J.; Tan, Y. Optimize TSK Fuzzy Systems for Regression Problems: Minibatch Gradient Descent With Regularization, DropRule, and AdaBound (MBGD-RDA). IEEE Trans. Fuzzy Syst. 2020, 28, 1003–1015. [Google Scholar] [CrossRef]
Wang, H.; Yang, K. Bayesian Optimization. In Many-Criteria Optimization and Decision Analysis: State-of-the-Art, Present Challenges, and Future Perspectives; Springer International Publishing: Cham, Switzerland, 2023; pp. 271–297. [Google Scholar] [CrossRef]
Snoek, J.; Larochelle, H.; Adams, R.P. Practical Bayesian Optimization of Machine Learning Algorithms. arXiv 2012, arXiv:1206.2944. [Google Scholar] [CrossRef]
Garnett, R. Bayesian Optimization; Cambridge University Press: Cambridge, UK, 2023. [Google Scholar]
Xu, Z.; Wang, H.; Phillips, J.M.; Zhe, S. Standard Gaussian Process Can Be Excellent for High-Dimensional Bayesian Optimization. arXiv 2024, arXiv:cs.LG/2402.02746. [Google Scholar]
Percival, D.B.; Walden, A.T. Wavelet Methods for Time Series Analysis; Cambridge Series in Statistical and Probabilistic Mathematics; Cambridge University Press: Cambridge, UK, 2000. [Google Scholar]
Hanke, G.; Canals, M.; Vescovo, V.; MacDonald, T.; Martini, E.; Ruiz-Orejón, L.F.; Galgani, F.; Palma, M.; Papatheodorou, G.; Ioakeimidis, C.; et al. Marine litter in the deepest site of the Mediterranean Sea. Mar. Pollut. Bull. 2025, 213, 117610. [Google Scholar] [CrossRef]
Dragomiretskiy, K.; Zosso, D. Variational Mode Decomposition. IEEE Trans. Signal Process. 2014, 62, 531–544. [Google Scholar] [CrossRef]

Figure 1. Neural representation of the TSK fuzzy predictive model.

Figure 2. Temporal evolution of significant wave height data from the Pylos buoy.

Figure 3. Significant wave height histogram along with a kernel distribution fit.

Figure 4. MODWT-based MRA components. From top to bottom: The approximation of the time series onto

{\tilde{V}}^{8}

,

{\tilde{W}}^{8}

, and

{\tilde{W}}^{7}

subspaces, respectively.

Figure 4. MODWT-based MRA components. From top to bottom: The approximation of the time series onto

{\tilde{V}}^{8}

,

{\tilde{W}}^{8}

, and

{\tilde{W}}^{7}

subspaces, respectively.

Figure 5. MODWT-based MRA components. From top to bottom: The approximation of the time series onto

{\tilde{W}}^{6}

,

{\tilde{W}}^{5}

, and

{\tilde{W}}^{4}

detail subspaces, respectively.

Figure 5. MODWT-based MRA components. From top to bottom: The approximation of the time series onto

{\tilde{W}}^{6}

,

{\tilde{W}}^{5}

, and

{\tilde{W}}^{4}

detail subspaces, respectively.

Figure 6. MODWT-based MRA components. From top to bottom: The approximation of the time series onto

{\tilde{W}}^{3}

,

{\tilde{W}}^{2}

, and

{\tilde{W}}^{1}

detail subspaces, respectively.

Figure 6. MODWT-based MRA components. From top to bottom: The approximation of the time series onto

{\tilde{W}}^{3}

,

{\tilde{W}}^{2}

, and

{\tilde{W}}^{1}

detail subspaces, respectively.

Figure 7. (left) Regression plot for ANFIS prediction on

D^{*}

, (right) Regression plot for Vanilla TSK prediction on

D^{*}

.

Figure 7. (left) Regression plot for ANFIS prediction on

D^{*}

, (right) Regression plot for Vanilla TSK prediction on

D^{*}

.

Figure 8. (left) Regression plot for proposed model prediction with VMD on

D^{*}

. (right) Regression plot for proposed model prediction with VMD and projection step on

D^{*}

.

Figure 8. (left) Regression plot for proposed model prediction with VMD on

D^{*}

. (right) Regression plot for proposed model prediction with VMD and projection step on

D^{*}

.

Figure 9. (left) Regression plot for proposed model prediction with MRA on

D^{*}

. (right) Regression plot for proposed model prediction with MRA and projection step on

D^{*}

.

Figure 9. (left) Regression plot for proposed model prediction with MRA on

D^{*}

. (right) Regression plot for proposed model prediction with MRA and projection step on

D^{*}

.

Figure 10. Error distribution fit on

D^{*}

, for all comparison models. From upper left to bottom right: (top left) Normal distribution fit pertaining to MRA TSK with and without projection step. (bottom left) Normal distribution fit pertaining to VMD TSK with and without projection step. (bottom right) Normal distribution fit pertaining to Vanilla TSK and the ANFIS model.

Figure 10. Error distribution fit on

D^{*}

, for all comparison models. From upper left to bottom right: (top left) Normal distribution fit pertaining to MRA TSK with and without projection step. (bottom left) Normal distribution fit pertaining to VMD TSK with and without projection step. (bottom right) Normal distribution fit pertaining to Vanilla TSK and the ANFIS model.