2. Outline of the Method
A “good” denoiser should preserve important features while filtering out unnecessary noise. Our method can be considered risk-free as it disengages if it cannot separate the signal from white noise. This property has clear advantages, including redirecting the analyst to more suitable denoisers and minimizing the risk of producing distorted filtered series.
Our method processes a predefined set of wavelets individually, performing a meta-parameter grid search to optimize three parameters: decomposition levels, admissible vanishing moments, and threshold levels. The choice of the “best” wavelet involves two functions: a discriminating function that rules out waveforms incapable of producing a white noise sequence of residuals, and an information-theoretic function that selects the waveform maximizing the entropy of this residual sequence.
Our procedure has several strengths:
It reduces uncertainty related to:
- (a)
Model building, known as model uncertainty, which is linked to the various steps required for constructing the final model.
- (b)
Parameter tuning; a poorly calibrated model can exhibit excessive sensitivity or fail to capture relevant information, resulting in a biased representation of the original series.
It is agnostic. The method can handle any number and type of waveforms. If it cannot disentangle the signal from noise, it applies no filtering.
It is fully automatic. Once the grid search strategy is chosen, no further interventions are needed, avoiding the manual selection of wavelets and tuning parameters, thereby lowering the overall uncertainty of the analysis.
It is fast. In all tested configurations, the computational time has proven reasonable, considering the typical time spans of macroeconomic time series. A model-based approach was discarded to maintain low levels of uncertainty, as building, estimating, and validating suitable models for numerous time series can be cumbersome, especially under time constraints, which is often the case for press releases or policy feedback.
2.1. Statement of the Problem
Let denote the time series of a given stochastic process, which is expected to possess the following characteristics: (i) it is real-valued, (ii) it has a continuous spectral density, (iii) the data are identically and independently distributed, and (iv) it may not necessarily be stationary. The lack of stationarity, as well as other desirable properties, can arise from the nature of many real-life time series, such as the Business Confidence Indexes utilized in the empirical section of this paper. These indexes reflect the judgments made by economic actors, who are often irrational and influenced by biases stemming from personal involvement and a lack of complete information.
Formally, the underlying data generating process can be represented as follows:
Here, is the signal—i.e., the portion of the observed data containing the slow-varying, more consolidated components—expected to be (at least) locally stationary and linear. In a sense, we can think of as being generated by more rational interpretations of future economic scenarios, and thus it is less affected by contingent situations. On the other hand, the erratic, emotion-driven dynamics are captured by the term , which, like , cannot be observed except for its effects on . However, the treatment of the former is more critical because, as previously pointed out, the information of interest—which should be preserved as much as possible—is primarily contained in it (secondarily in the latter).
In more detail, while
is crucial for the real-time assessment of the economy (or one of its sectors),
is typically employed in a backward-looking fashion, i.e., to study the evolution pattern of a given sentiment indicator over the years or measure the impact related to past initiatives, such as performing intervention analysis, see for example [
3] or [
4]. In general, the extraction of
is a less complicated matter—which can be performed in various ways using simpler approaches (e.g., Hodrick–Prescott or Baxter–King)—compared to
. Finally,
is the residual sequence of the type Gaussian White Noise (GWN), assumed to have a mean of 0 and constant (unknown) variance
. As will be explained later, an approximate Gaussian distribution for the residuals is required for the discriminating function to work properly.
The proposed procedure aims to reduce
by replacing
in (
1) with its “optimal” filtered counterpart, denoted as
, according to a suitable cost function. Formally, a filter is an operation performed on a given time series—called the input series or simply input (
)—to obtain a new series—called the output series or simply output (
)—which shows a reduced amplitude at one or more frequencies. In symbols, we have:
As will be outlined in
Section 2.2, there are no guarantees that the inequality in (
2) holds with probability one. Therefore, if
, the procedure should be abandoned in favor of a more suitable one.
2.2. The Discriminating Function
Equation (
2) refers to the optimal filtered series
—i.e., the one meeting predetermined optimality conditions—found iteratively by our procedure. Essentially, it is designed to sequentially and exhaustively apply a predefined set of
N different wavelet filters to the original time series, according to a grid search strategy involving a small set of wavelet hyperparameters. As a result, a set of filtered series—i.e.,
—is obtained. By computing the following
N differences (one for each denoised version of the original series):
or, more compactly,
, the
matrix of residuals
is generated. Here, each column vector: (i) can be seen as a
T-dimensional point in the
space
and (ii) belongs to a set denoted by
, with cardinality
(the symbol
denotes the cardinality function).
Even though one can legitimately expect that—out of the
N filtered series—a larger proportion of filter configurations generates residuals that are not white noise (for instance, those embedding underlying linear structures), it is also reasonable (see Equation (
5) and the related discussion) to find a number
of them producing white noise residual sequences. These series are of particular interest and will be stored in a competition set, denoted by
, from which the “best” one—already denoted as
—will be extracted, i.e.,
.
Let
be the set of
N vectors of residuals computed in (
3), and let
be a predetermined significance level associated with the discriminating function
. The filtered series generating residual sequences for which
—denoted by
—are stored in a competition set called
. Formally:
where the symbol ⇔ denotes the statement “if and only if”. It is worth noting that, by design, the search strategy is performed using a wide range of filter configurations, which may vary in adequacy for the time series under investigation. As a result, in general, only a small number of residual sequences satisfy the
condition of Equation (
4), which is why, in practice, we usually observe that
At this point, one might consider the construction of the competition set
an unnecessary step, given that maximizing the Ljung–Box test’s
is a sufficient and necessary condition:
In principle, such an objection is correct, and (
6) provides a valid approach. However, we want the residuals generated by the winning filtered series
not only to be maximally uninformative, but also to exhibit the highest relative weight with respect to the original time series. Specifically, if the condition represented in Equation (
6) is considered satisfactory from a theoretical perspective, it may not represent an operational solution, as the extracted time series can be practically indistinguishable from the original one. This occurs when the wavelet filter manages to eliminate only a tiny portion of noise—which does satisfy Equation (
6)—while leaving the data largely unaltered. Formally, we have that
where
is a sequence of very “small” numbers.
As already pointed out,
is not a set whose cardinality is necessarily smaller than or equal to 1, but there are two possible options, namely
and
Under these conditions, if
(as expressed in Equation (
8)), one is left with no option but to use
, which hopefully satisfies the conditions in Equation (
4). The least favorable situation is expressed by Equation (
9), i.e., when none of the considered wavelet filters is able to verify Equation (
4). In such a circumstance, one might either accept progressively lower statistical significance levels (
) or decide to abandon our procedure.
In
Section 3, a target function
I will be presented. Its objective is to find the maximizer of the variance ratio
and thus to solve or alleviate—when possible—the problem described in the context of Equation (
6).
3. The Target Function
The solution devised to avoid—when possible—the situation described by Equation (
6) is to apply a suitable target function
I to the subset
. Under the conditions of Equation (
6) and provided that Equation (
8) is satisfied, a suitable target function is employed to select a better filtered version of the given time series. In practice, we need to pick a new filtered series that is not too similar to the original one, but is still able to pass the Portmanteau test. In this setup, a more meaningful filtered series is one that shows the maximum level of discrepancy compared to the original series, even if this comes at the expense of the whiteness of the vector
. The function adopted for this purpose is known as the Kullback–Leibler (KL) divergence (also called relative entropy), which can be roughly defined as an estimator of how one probability distribution differs from a second, reference probability distribution. KL divergence is pivotal in information theory, as it is related to fundamental quantities such as self-information, mutual information, Shannon entropy, conditional entropy, and cross-entropy. Unlike similar measures (e.g., variation of information), it cannot be defined as a metric because it lacks the triangle inequality property. KL divergence is applied in many contexts, such as characterizing the relative Shannon entropy in information systems, assessing randomness in continuous time series, or evaluating the information gain achieved by different statistical models.
Roughly speaking, K-L is a measure of how one probability distribution function (PDF) is different from a second one used as a benchmark. Let x be the vector of realizations of (our time series), we define a function identifying the “reference” and a set of m f–approximating function .
The K-L information number between a given model
and an approximating one
can be expressed, for continuous distribution by
For discrete probability distributions, say
P and
Q, defined on the same probability space, the Kullback–Leibler divergence between
P and
Q is defined to be
The relation expressed in (
12)–(
11) quantifies the amount of information lost due to the application of approximating functions under the condition of
iid observations. However, in the present context, such a condition is violated and therefore the K-L discrepancy, as above expressed, is non applicable. Nevertheless, we can still use this theoretical framework, by replacing the density distributions
(
10) with the spectral densities computed on the “true” DGP and on its observed realizations, here respectively denoted by the symbols
and
, i.e.,
Usually, we want
to be as low as possible. For example, the famous Information Criterion of Akaike is aimed at selecting the model that has the smallest Kullback–Leibler divergence to the “true” spectral density
generating the data
x. In this regard,
(i.e., the “truth”) is something we want the spectral density computed on our approximating model to be as close as possible, i.e.,
3.1. The Target Function I
In our particular context, the Kullback–Leibler divergence function
will be employed to solve the problem of the selected filtered time series passing, on one hand, the Ljung–Box test—i.e.,
(see Equation (
4))—but showing, on the other hand, very “small” differences compared to the original time series. In this sense, the use of the function
I is diametrically opposite to the particular purpose it has been designed for. In fact, here we are not interested in maximizing the closeness of an approximating function with the true, unknown DGP (as in the case of the Equation (
13)), but in finding, among the time series
, the one that generates the greatest amount of discrepancy with respect to the original time series
. In symbols:
In this equation, the “truth” is what we actually observe, i.e., the time series; therefore,
is simply the identity function. On the other hand,
is the wavelet (approximating) function, defined through the 3–dimensional vector of tuning parameters
. As it will be illustrated in
Section 3.2, those parameters are: number of resolution levels, number of vanishing moments, and threshold.
In our particular theoretical framework, the maximization of the target function (Equation (
14)) is approximatively equivalent to the maximization of the residual variance
and consequently of the magnitude of
(see Equation (
6)). To see this, we recall that signal and noise are uncorrelated—provided that the maximization of Equation (
4) is verified—and therefore, the spectral density of the original series, say
, can be expressed as the sum of a constant quantity, that is the spectral density of the noise
plus the spectral density of the filtered series
. By plugging both the last two densities into Equation (
12), we can re-express it as follows:
Now, given that
is “small” compared to
, we can use the quasi–equivalence
to rewrite (
15) as follows:
By virtue of this approximation, the following is true:
In practice, recalling that the subset
contains all the filtered time series
, such that
> 0.05, under strict inequality of Equation (
8), the function
is pairwise applied to all the element of the set
, as is now explained. Let
and
be the frequency distributions of the filtered and original data, respectively, we have that
so that the series
, which satisfies
will be the selected denoised version of the original series.
3.2. The Adopted Wavelet Filters
It is assumed that the white noise component translates into the wavelet domain and, as such, is captured by a set of coefficients across the different bands into which the signal is decomposed.
This creates the problem that, in order to cover the entire spectrum, an infinite number of levels would be required. The scaling function filters the lowest level of the transform and ensures that the entire spectrum is covered. For a detailed explanation, see [
5]. The wavelet function is effectively a multi-level band-pass filter that progressively halves its bandwidth with the scaling levels
, with
J typically determined arbitrarily. Our denoising method is of the wavelet shrinkage type. As such, it is particularly suitable for disentangling information from noise with minimum computational complexity. More specifically, in the wavelet domain, the information has a coherent structure whose energy is captured and “stored” by a limited number of high-magnitude coefficients. The remaining coefficients exhibit small magnitudes and account for incoherent low-energy structures (the noise). The sparse structure of the coefficients reflecting the signal is exploited by a threshold-driven shrinkage mechanism to separate those coefficients carrying useful information from those accounting for noisy structures. In our setup, the noise threshold is of the hard type, i.e.,
with
and
being the noise and denoised wavelet coefficients, respectively, pertaining to the decomposition level
j. Even if (
19) refers to the possibility for
to vary across the
j decomposition levels, for the sake of readability—without loss of generality—in what follows, we will restrict our attention to the case of a single threshold. Therefore, (
19) reads as
As it will be clarified later, our method envisions the threshold to be selected out of a set of candidates, i.e., , according to a grid searching procedure.
It is evident that the coefficients exhibit dependence on the selected filter. For the purposes of this study, a comprehensive selection of filters has been considered, including: “haar”, “d2”, “d4”, “d6”, “d8”, “d10”, “d12”, “d14”, “d16”, “d18”, “d20”, “s2”, “s4”, “s6”, “s8”, “s10”, “s12”, “s14”, “s16”, “s18”, “s20,” “l2”, “l4”, “l6”, “l14”, “l18”, “l20”, “c6”, “c12”, “c18”, “c24”, and “c30”. We assume that each filter is represented as , and, collectively, these are organized within a set denoted by .
The rationale underpinning the proposed method is derived from the assertion that traditional wavelet denoising approaches—characterized by a series of “a priori” selections—can introduce substantial uncertainty into the analytical framework. The uncertainties under consideration are primarily linked to the selection of the waveform, as well as the associated number of vanishing moments, the number of decomposition levels, and the choice of threshold level. The proposed algorithm, as summarized below, deviates from conventional methodologies by assuming no prior knowledge of the structure of the denoiser. Instead, it is specifically designed to systematically evaluate various denoising parameters with respect to the target function
, as previously introduced in
Section 2.2.
The Algorithm
A waveform is selected;
A threshold is chosen;
Select the set , composed of the different numbers of decomposition levels to be tested. It is given that , where and ;
Without loss of generality, we assume the algorithm starts with , , and ;
Take the MODWT for all the detail components into which the original signal is decomposed. As a result, approximating components become available, i.e., ;
Use on the original signal to obtain the denoised components, i.e., ;
Take the Inverse MODWT on ; the first denoised series, , is obtained;
Steps 5 to 7 are repeated until all remaining combinations of the elements belonging to are sequentially applied, resulting in the respective denoised versions of the original signal, i.e., ;
By conditioning the set of all denoised series
to
and
(
Section 2.2), the final series
is selected.
3.3. Empirical Experiment
In this study, we analyze a comprehensive dataset consisting of 236 time series, which spans from January 1986 to December 2018, yielding a total of 396 observations. This dataset is particularly significant as it encompasses
In this section, we will present two specific outcomes derived from a broader empirical experiment conducted on time series belonging to the category of Business Confidence Indexes, produced by the Italian National Institute of Statistics, spanning from January 1986 to December 2018 (396 observations). By focusing on these distinct outcomes, we seek to illustrate the efficacy of our denoising procedure and the insights it offers regarding the underlying economic conditions reflected in the data.
In
Figure 1, we present the optimal waveform applied to the Istat series titled “Liquidity vs. Operational Needs”. This waveform is classified as “c24”, which denotes a Coiflet filter characterized by 24 coefficients and 12 vanishing moments. The analysis employs five levels of resolution and implements a threshold value of 3.9. The residuals generated from this filtering procedure successfully pass the Portmanteau test, thus reinforcing our confidence in the method’s efficacy for effectively distinguishing and eliminating uninformative (noise) components from the signal. A similar favorable outcome is observed for the time series entitled “Assessment of Finished Goods Inventories”, as depicted in
Figure 2. In this case, the optimal filter employed is of the type “c18”, indicating a Coiflet filter with 18 coefficients and 12 vanishing moments, also utilizing five levels of resolution, but with an adjusted threshold set at 4.8. This consistent performance across both analyses underscores the robustness of our denoising methodology and its applicability in extracting meaningful insights from complex economic indicators.