Article Joint Markov Blankets in Feature Sets Extracted from Wavelet Packet Decompositions

Since two decades, wavelet packet decompositions have been shown effective as a generic approach to feature extraction from time series and images for the prediction of a target variable. Redundancies exist between the wavelet coefficients and between the energy features that are derived from the wavelet coefficients. We assess these redundancies in wavelet packet decompositions by means of the Markov blanket filtering theory. We introduce the concept of joint Markov blankets. It is shown that joint Markov blankets are a natural extension of Markov blankets, which are defined for single features, to a set of features. We show that these joint Markov blankets exist in feature sets consisting of the wavelet coefficients. Furthermore, we prove that wavelet energy features from the highest frequency resolution level form a joint Markov blanket for all other wavelet energy features. The joint Markov blanket theory indicates that one can expect an increase of classification accuracy with the increase of the frequency resolution level of the energy features.


Introduction
Raw input variables, such as the single samples from time series or the single pixels from images, are often meaningless to the targeted audience, e.g., an industrial expert or a clinician.The ease of interpretation can be enhanced by first constructing meaningful features.
A basic approach to construct features from time series and images consists in computing some general statistical parameters such as the median, the mean, the standard deviation and higher-order moments.A more thorough approach exists in using basis functions, sometimes called templates, that can be used to construct features.The prior information about the classes to be predicted is then related to the choice of the templates.However, generic approaches that generate a library of templates, such as wavelet packets, have been proposed by Coifman and Meyer [1].Wavelet packet decompositions (WPD's) offer a library of templates that have many desired properties.First of all, WPD's can be founded on the mathematical theory of multiresolution analysis [2,3] that allows to represent signals and images in new bases.The decomposition in a new wavelet packet basis guarantees that no "information" is lost as the original signals can always be reconstructed from the new basis.Secondly, the templates in a wavelet packet decomposition are easily interpreted in terms of frequencies and bandwidths [4].Thirdly, wavelet packet decompositions are more flexible than the discrete wavelet transform and the Fourier transform.This means that the basis functions that are used in a discrete wavelet transform (DWT) are also available in the wavelet packet decomposition [3,4].
We refer here to the selection of wavelet coefficients or features derived from the wavelet coefficients to predict a target variable "C" (e.g., a class label) as feature subset selection.A basis selection algorithm specifically tuned for wavelet packet decompositions has been first proposed in [5].This algorithm did not take into account a target variable, such as a class label, but chose one basis using minimal entropy as the selection criterion.Algorithms that take the target variable into account were proposed in [6][7][8].It was shown [9,10] that dependencies between wavelet features were not taken into account in the previous algorithms.Dependencies between wavelet features were taken into account more recently in, e.g., [10][11][12][13].However, a systematic analysis of redundancies between wavelet packet features by means of Markov blankets, as a solid theoretical framework to assess redundancies, is lacking so far.The dependencies between wavelet features will allow us to obtain analytical results on the existence of Markov blankets regardless of the underlying probability distribution of signals and images.In this article, we infer the redundancies between the wavelet coefficients and between energy features that are computed from a wavelet packet decomposition by means of the joint Markov blanket theory.These energy features are regularly computed from wavelet coefficients to scale down the number of features to select from as, e.g., in [12][13][14].Other features such as the variance of the wavelet coefficients have been used in the literature as well, see, e.g., [15].The joint Markov blankets proposed in this article are shown to be a natural result of iteratively applying Markov blanket filtering.

Feature Extraction from Wavelet Packet Decomposition
This section introduces the background for feature construction from wavelet packet decompositions.We will use the terminology of template and basis function interchangeably.Strictly speaking, a template is a more general terminology, because it does not need to be part of a basis.We use time series to develop the theory as it allows for a more simple notation, the results can be easily extended to images.We represent a single time series by means of a sequence of observations x(t): x(0), x(1), ... x(N − 1), where "t" refers to the time index and "N" is the number of samples.Time series x(t) can be considered as being sampled from an "N" dimensional distribution defined over an "N" dimensional variable X(t): X(0), X(1), ... X(N − 1), we write this "N" dimensional variable in shorthand notation as X 0 :N −1 and use capitals to denote variables.

Wavelet Coefficient Features
Features are computed from a wavelet packet decomposition by computing the inner product between the templates and the time series (using a continuous notation, for the ease of notation): A feature, in this case a wavelet coefficient, in the wavelet packet decomposition needs to be specified by the scale index "i", frequency index "j" and time index "k".The coefficient γ i,j,k can be considered as quantifying the similarity, by means of the inner product, between time series x(t) and wavelet function ψ j i (t − 2 i k) at position 2 i k in time.The parameter "i" is the scale index and causes a dilation (commonly called a "stretching") of the wavelet function ψ j (t) by a factor 2 i : The wavelet functions ψ j i (t) are recursively defined by means of the low-pass filter h[k] and high-pass filter g[k]: and In order to form an orthonormal system the filters h[k] and g[k] need to satisfy the conjugate mirror filter condition [3]: It is the parameter "j" in (2) that determines the shape of the template.In case we choose the 12-tap Coiflet filter, [16] (see pp. 258-261) we obtain the first 8 different templates ψ 0 (t), ψ 1 (t), ψ 2 (t), ... ψ 7 (t) shown in Figure 1.The construction of these basis functions can be found in text books [16].
In Figure 2, we show a graphical representation of the different subspaces that are obtained in a wavelet packet decomposition.In the discrete wavelet transform the only nodes in the tree that are considered are W 1  1 ,W  1) and dilated and time translated functions of ψ 1 0 (t) (this function is called the mother wavelet function and is shown as the second template in the top row of Figure 1).The division in subspaces in Figure 2 also corresponds to a tiling of frequency space [4].In Figure 2, only two bases are shown: the gray shaded basis corresponds with the discrete wavelet transform, the basis marked with diagonals is chosen arbitrarily and is one of the possible bases in the wavelet packet decomposition.The basis marked with diagonals puts more emphasis on a finer analysis of the higher frequency part of the signals.
. Library of wavelet packet functions.Different subspaces are represented by W j i .Index "i" is the scale index, index "j" is the frequency index.The depth "I" of this tree is equal to 4. Every tree within this tree where each node has either 0 or 2 children is called an admissible tree.Two admissible trees are emphasized, one shaded in grey and one marked with diagonals.A particular node in the tree can be index by (i,j).Retaining any binary tree in Figure 2, where each node has either 0 or 2 children, leads to an orthonormal basis for finite energy functions, denoted as x(t) ∈ L 2 (R): Such a tree is called an admissible tree.If the leaves of this tree are denoted by {i l , j l } 1≤l≤L the orthonormal system can be written as: This means that the space W 0 0 , which is able to represent the input space of the time series, can be decomposed into orthonormal subspaces W j l i l .It should be noted that a full wavelet packet decomposition yields many features.A full wavelet packet decomposition leads to N*(log 2 N+1) features.This can be seen as follows.From Figure 2, it can be noted that the number of subspaces at a certain scale "i" is determined by the scale index "i".The number of subspaces at scale "i" is equal to 2 i .Therefore the frequency index "j" at a certain scale "i" will be an integer from [0, 2 i −1], indicating the starting position of the subspace at scale "i".As can be seen from Equation (1) at scale "i" the inner products are computed at discrete time positions 2 i k.Therefore at scale 0, we obtain "N" (length of the signals) coefficients: γ 0,0,0 , ... γ 0,0,N −1 .At the next scale i = 1 we obtain N/2 coefficients in each subspace i.e., γ 1,0,0 , ... γ 1,0,N/2−1 and γ 1,1,0 , ... γ 1,1,N/2−1 .At the highest frequency resolution, i = log 2 N and we obtain coefficients: γ logN,0,0 , ... γ logN,N −1,0 .Hence at each scale there are "N" coefficients and in total there are log 2 N+1 different scale levels.This leads overall to N*(log 2 N+1) different coefficients to select from.When we want to emphasize the variable that can be associated with the coefficient γ i,j,k we use capitals Γ i,j,k .

Wavelet Energy Features
In cases where one can assume that the exact time location "k" of the template is of no importance, one can, e.g., consider the energy of wavelet coefficients over time for each possible combination of the scale index "i" and the frequency index "j": Then each node in Figure 2 will correspond with 1 energy feature E j i .In total there are 1−2 log 2 N +1 1−2 = 2N − 1 nodes and hence 2N − 1 energy features.Such energy features have been previously used in [8,[12][13][14].

Dependencies between Wavelet Features
Analytical results of dependencies between wavelet coefficients for specific classes of stochastic signals have been obtained in [17,18] in case of fractional Brownian motion and for autoregressive models in [12].Dependencies between wavelet packet features also exist regardless the underlying distribution of signals.
Further on, we use the notation γ i,j,k or γ i,j [k] interchangeably.The first notation emphasizes the notion as a characteristic or a feature, while the latter emphasizes the time index "k".
Although the above definition of the wavelet coefficients γ i,j,k in Equation ( 1) allows for an intuitive interpretation as a degree of similarity, it was proven in [3] (see Proposition 8.4, p. 334) that these coefficients at the decomposition can be computed also as: starting from the initialization: Intuitively, the wavelet coefficients γ i+1,2j [k] can be obtained from a convolution of γ i,j [m] with h[−m], but followed by a factor 2 subsampling.Along the same line γ i+1,2j+1 [k] can be obtained from a convolution of γ i,j [m] with g[−m], followed by a factor 2 subsampling.From Equation ( 9) and Equation (10) it is clear that level "i+1" coefficients can be computed from level "i" coefficients.
On the other hand, level "i" coefficients can also be computed from level "i+1" coefficients.At the reconstruction the coefficients can be computed as: Because wavelet packet decompositions are orthonormal transformations the energy is preserved and it holds that: Hence, energy features at level "i" can be expressed as a sum of energy features from level "i+1".
In order to take into account only the wavelet coefficients at scale "i" that affect wavelet coefficient γ i+1,2j,k at the next scale "i+1" in Equations ( 9) and ( 10), we introduce following definition.
Definition 2.1.The level "i" parent coefficients of a wavelet coefficient γ i+1,2j,k are the wavelet coefficients γ i,j,m in its parent node for which the filter coefficients h[m-2k] in Equation ( 9) are different from 0. Let us denote these level "i" parent features/coefficients as parent i (γ i+1,2j,k ).
Similarly, the level "i" parent coefficients of a wavelet coefficient γ i+1,2j+1,k are the wavelet coefficients γ i,j,m in its parent node for which the filter coefficients g[m-2k] in Equation (10) are different from 0. These parent features are denoted as parent i (γ i+1,2j+1,k ).Knowing either h[m] or g[m] these parent relationships can be derived for each level "i".
Here we used the notations parent i (γ i+1,2j,k ) and parent i (γ i+1,2j+1,k ) to emphasize that the parent coefficients of the even frequencies γ i+1,2j,k and the parent coefficients of the odd frequencies γ i+1,2j+1,k may differ, as can be seen from Figures 3 and 4.More generally (without emphasizing differences between odd and even frequency components), we can write the parents of γ i,j,k as: parent i−1 (γ i,j,k ).
Similarly, we introduce the child coefficients of γ i,j,k as the coefficients at the next resolution level "i+1" that affect γ i,j,k .
Note that we used the terminology of parent and child nodes as used in wavelet packet trees, these should not be confused with the terminology used in directed acyclic graphs (DAG's).
In Figure 5, we used a notation γ i,j,2k to indicate that each child node γ i+1,2j,m and γ i+1,2j+1,m only consists of half the number of coefficients.More generally, we write the child coefficients of γ i,j,k as child i+1 (γ i,j,k ).
Figure 5. Child coefficient relationships for γ i,j,2k .The child coefficients for γ i,j,2k+1 are the same coefficients in case of L-tap Coiflet filters.The top row coefficients are the odd frequency child coefficients, the bottom row are the even frequency child coefficients.

Markov Blanket Filtering: A Link with Information-Theoretic Approaches
Markov blanket filtering as an approach to feature elimination was established by [19] and inspired others in the design of new feature subset selection algorithms such as in [20][21][22].Most recent research aims at finding the Markov boundary (the minimal Markov blanket) of the target variable in feature sets containing more than ten thousands of variables while still remaining theoretically correct under the faithfulness condition [23][24][25].A seemingly different approach to feature subset selection is that by means of mutual information that was used in [10,[26][27][28][29][30][31][32].As opposed to Markov blanket filtering, which is due to [19], the origin of the use of mutual information as a feature subset selection criterion is more unclear.We believe that the first use of mutual information as a feature subset selection criterion can be traced back to Lewis [33].However, at that time Lewis did not call the functional used in [33] "mutual information".A connection between Markov blanket filtering and the mutual information feature subset selection criterion was shown independently in [11] an [34].
Previous work using mutual information in [29] has used heuristic concepts of information relevance and redundancy in feature subset selection, as opposed to the statistical concepts of relevance in [35] and redundancy in [21] that can be used to obtain optimal subsets.If one makes a statement that: "a feature is redundant for a feature set with respect to the target variable", we want to be sure that really all information about the target variable is covered in that feature set and the considered feature can be removed without information loss.This is exactly what Markov blanket filtering offers and the reason we extended it here to joint Markov blankets for inference of redundancies between features extracted from wavelet packet decompositions.
Let F G be the current feature set, i.e., the feature set obtained after removal of some other features from the full feature set F, and F i a feature to be removed from the current feature set F G .
Hence, a Markov blanket M i is a feature subset not including F i that makes F i independent of all other features F G \ {M i ∪ F i } and the target variable "C": The connection with the mutual information functional [36] is given in the following, see also [11,34,37].Read MI(X; Y |Z) as the mutual information between X and Y conditioned on Z, where X, Y and Z may be single variables or sets of variables.
Proof.The comparison of the probability functions p( is performed in information-theoretic sense by means of the Kullback-Leibler distance: using conditional probabilities this can be written as: using the definition of conditional mutual information [36] this is equivalent to: Using a corollary of the information inequality Theorem (2.6.3) in [36], it is known that the conditional mutual information in this case MI( This result can be related to Theorem 8 in [37].There it was shown for discrete features F i that if MI(M i ; F i ) = H(F i ) then M i is a Markov blanket for F i .This can also be easily shown from Lemma 3.2.Starting from MI(F i ; C, F G \ {M i ∪ F i }|M i ), this can be written as: Using the condition MI(M i ; Furthermore, because conditioning reduces entropy it holds that H( Because Theorem 8 in [37] assumes discrete features, entropy must be ≥ 0, from which it follows that H( This proves the Markov blanket condition.The main difference between Lemma 3.2 and Theorem 8 of [37] is that we do not need to assume discrete features.It needs to be remarked that when dealing with small sample sizes, it has been shown [38] that Markov blanket filtering may favor the removal of features that are most correlated with the target variable.Of course, this is the opposite result of what one wants to achieve with Markov blanket filtering.In [38] this behavior was observed when discretizing the features.It still remains to be explored if such behavior can also be observed when one uses the continuous features instead.
Markov blanket filtering leads naturally to the definition of a "joint" Markov blanket M S 1:n−1 of a set of features In the future of this article we will use a shorthand notation in the definition of the joint Markov blanket: where the latter equation is called the shorthand notation.
We show that joint Markov blankets are obtained from performing Markov blanket filtering iteratively.
Proof.We need to show that it follows from MI(F Using the chain rule for information [36] (Theorem 2.5.2) we can write: Now we show that both (18) and ( 19) are equal to 0. For (18), applying the chain rule for information to (18) we obtain: In (20) the feature F n is included in F \ {F 1:n−1 ∪ F n }; this does not change the dependencies because conditioning is also on F n .Both terms in ( 21) are equal to 0 zero because M S 1:n−1 is a joint Markov blanket for = 0 (the first term in ( 21)) and any possible subset thereof so that: For (19), applying the chain rule for information on (19) we obtain: Both terms in (22) are equal to 0 because M n is a Markov blanket for Hence, Equation ( 17) is equal to 0 and the condition of a "joint" Markov blanket is fulfilled.
The proof was provided in case the feature to be removed F n was not part of the joint Markov blanket found so far M 1:n−1 .More generally, one may choose F n ∈ M 1:n−1 .The proof in this case becomes more elaborate, but in a similar way as above, it can be shown that the joint Markov blanket in this case becomes {M 1:n−1 \ F n } ∪ M n , with M n a Markov blanket for F n .For details on the extended proof when F n ∈ M 1:n−1 the reader is referred to Theorem 2.4 in [11].
In Markov blanket filtering [19], one starts with removing a single feature based on a Markov blanket found for that feature.Hence, in order to show that iteratively performing Markov blanket filtering leads to a "joint" Markov blanket for the removed features, we need to show that according to Theorem 3.4 that the first Markov blanket found in Markov blanket filtering is a "joint" Markov blanket.Suppose that one finds a Markov blanket M 1 for F 1 : for this feature In order for M 1 to be a joint Markov blanket it must satisfy: If we set n = 2 in the last condition we obtain: This condition is satisfied and hence the first Markov blanket is a special case of a joint Markov blanket.Therefore iteratively performing Markov blanket filtering leads to "joint" Markov blankets.

Joint Markov Blankets in Wavelet Feature Sets
We show the existence of Markov blankets in feature sets extracted from wavelet packet decompositions.In Section 4.1 the set of all features F consists of the wavelet coefficient variables Γ i,j,k , in Section 4.2 the set consists of all energy features E j i .

Parents or Children Nodes are Joint Markov Blankets
Let us denote by F the set of all wavelet features obtained from a wavelet packet decomposition: Proof.In order to prove that parent i (Γ i+1,2j,k ) is a Markov blanket for Γ i+1,2j,k we have to show: The proof is obtained by expanding the mutual information in its entropy terms: The first entropy term in Equation ( 24), H(Γ i+1,2j,k |parent i (Γ i+1,2j,k )), is equal to 0. This is due to the fact that Γ i+1,2j,k is a function of parent i (Γ i+1,2j,k ), according to Equation ( 9) and Definition 2.1.Hence the uncertainty left about Γ i+1,2j,k after observing parent i (Γ i+1,2j,k ) is 0. The second term in Equation ( 24), must also be equal to 0 for the same reason.From both terms equal to 0 in Equation ( 24) we can conclude that Proof.In order to prove that child i+1 (Γ i,j,k ) is a Markov blanket for Γ i,j,k we have to show: Expansion of the mutual information in entropy terms leads to: Similarly as in the proof of Proposition 4.1 the first entropy term H(Γ i,j,k |child i+1 (Γ i,j,k )) = 0 due to functional dependence of Γ i,j,k on child i+1 (Γ i,j,k ).The second term in Equation ( 26) is equal to zero for the same reason.
Using Theorem 3.4 iteratively on all wavelet coefficients in a node we can show that child nodes (or its parent node) form joint Markov blankets.

Proposition 4.5. The set of all wavelet coefficient features in the parent node {Γ
Proof.The proof is similar to Proposition 4.4.Applying Markov blanket filtering iteratively to the coefficients Γ i,2j,k and Γ i,2j+1,k in nodes (i,2j) and (i,2j+1) one finds (applying Theorems 3.4, 4.1 and Corollary 4.2 iteratively similar as in Proposition 4.4) that: ) is a joint Markov blanket for all coefficients in nodes (i,2j) and (i,2j+1).This is equal to the set of all coefficients of node (i-1,j): Summarizing the results of Propositions 4.4 and 4.5, we see that all coefficients in a node (i,2j) can be removed either by existence of its child nodes (i+1,2.2j)and (i+1,2.2j+1)or by existence of its parent node (i-1,j).Both node (i-1,j) or nodes (i+1,2.2j)and (i+1,2.2j)are guaranteed to form a joint Markov blanket.It is interesting to note that node (i-1,j) contains N/(2 i−1 ) coefficients and nodes (i+1,2.2j),(i+1,2.2j+1)jointly contain (N/2 i ) coefficients which forms a smaller blanket.However, if one selects node (i-1,j) as a joint Markov blanket for removal of (i,2j), it will also be a joint blanket for (i,2j+1).

Child Nodes are Joint Markov Blankets for Energy Features
Here, the set of all features F consists of all energy features obtained from a wavelet packet decomposition: In case of the energy features, the analysis of dependencies between features is somewhat simpler.As shown in Equation (12), energy features at level "i" (E j i ) depend functionally on E 2j i+1 and E 2j+1 i+1 .Hence, in this case there are only child features that determine level "i" features.This leads to Corollary 4.6 (similar to Corollary 4.3).

Corollary 4.6. Energy features E 2j
i+1 and E 2j+1 i+1 form a Markov blanket for E j i .
Proof.In order for E 2j i+1 and E 2j+1 i+1 to form a Markov blanket for E j i , it needs to be shown that: Using the expansion of the mutual information in its entropy terms yields: The first term in Equation ( 27) ) is equal to 0 due to functional dependence, the second term is equal to 0 for the same reason (see also the proof of Proposition 4.1).
For the set of energy features, we obtain following result on which energy features form a "joint" Markov blanket for all other energy features.
Proof.Iterative elimination of features based on Markov blankets starting from the top of a wavelet tree (as in Figure 2) proceeds as follows.Starting from the top, feature E 0 0 can be removed because E 0 1 ∪ E Hence, iterating this procedure until arriving at {E j log 2 (N ) } 0≤j≤N −1 , these features form a joint blanket for:

Experiments with Energy Features of Wavelet Packet Decomposition
As shown in the proof of Proposition 4.7, energy features at level i+1, i.e., E 0 i+1 , ...
, form a joint Markov blanket for the features at level i (as well as for those at levels i-1, ... 0).This implies that the set {E 0 i , ... E }; C).Furthermore, we know there is a close relationship between the mutual information and the probability of error (Pe) for predicting a target variable [30].In particular, the Kovalevsky upper bound [39] is known to be a tight upper bound [40] on the probability of error as a function of the mutual information.With increasing mutual information the upper bound on the probability of error becomes smaller and smaller, see, e.g., [30].The consequence is that the probability of error is expected to decrease with increasing level of the energy features.This behavior may be observed from an increasing testing accuracy when a classifier is trained with energy features of increasing levels.However, this behavior can be expected only at the lower levels: 0, 1, 2, 3, ... Indeed the number of energy features at level "i" increases as 2 i and hence the curse of dimensionality [41] may become dominant at higher levels which implies that the testing performance decreases again.This behavior is dependent on the particular classifier being used as well as on the ratio of the number of training patterns "N" to the dimensionality "d" of the patterns: N/d [42][43][44].Next, we will illustrate the increasing classification accuracy with increasing level of the energy features as expected from the joint Markov blankets explained in previous paragraph.We consider six different time series classification problems.The corrosion data set consists of 4 classes: absence of corrosion (197 signals), uniform corrosion (194 signals), pitting (214 signals) and stress corrosion cracking (205 signals).The signals are acoustic emission signals that were obtained during each of the corrosion processes.A trained classifier can be used to predict which corrosion process is active based on the emitted acoustic signals.For a background on the origin of the acoustic activity and the details of the experiments, the reader is referred to [9,10].We applied the C-SVC (C-Support Vector Classifier) [45] using the LIBSVM software [46].For more background information on SVM's (Support Vector Machine) see, e.g., [47][48][49][50].We used a linear kernel and a grid search within the training set, see also [46], to find the best cost parameter C. In the grid search, we performed a 5-fold cross-validation and varied the cost parameter from C = 2 −5 , 2 −4 , ... 2 15 .The testing accuracy was obtained by means of a 10-fold cross-validation.We used the 12-tap Coiflet filter to compute the energy features.The same settings were used for the other time series classification problems, unless mentioned otherwise, with the exception that we dispose of separate training and test sets.The evolution of the testing accuracy as a function of the level of the energy features is shown in Figure 6.As predicted from the joint Markov blanket theory, at the lower levels the classification accuracy is expected to increase, but starting at level 6 (with 2 6 energy features) the classification accuracy starts to fluctuate which can be partly attributed to the curse of dimensionality.In order to deal with the curse of dimensionality, one could further apply a feature subset selection algorithm to the energy features extracted from the highest frequency resolution.The second time series classification problem is the cylinder-bell-funnel class problem defined by Saito and Coifman [6].The cylinder, bell and funnel class are defined respectively as [6]: where i = 1,...128, a is an integer-valued uniform random variable in the interval [16,32], similarly (b-a) follows an integer-valued uniform distribution on the interval [32,96], η and (i) are standard normal random variables and χ [a,b] is the characteristic function on the interval [a, b].We generated 100 training times series for each class and 1000 testing time series for each class.The tendency of increasing performance with increasing level of energy features is largely confirmed in Figure 7.
The face (all) data set consists of 14 different subjects (classes) with 560 training examples and 1690 testing examples.There are 131 time series points for each subject; we restricted this to the first 128 time series points in order to have a power of two number of samples before applying the WPD.The increasing performance with increasing energy levels is confirmed in Figure 8.Note that the performance (75.2%) at the highest energy level (7) is higher than obtained with the 1-NN Euclidean distance classifier (71.4%, see [51]), but lower than obtained with time warping (80.2%) reported in [51].This is a typical data set to test time warping algorithms.The gun-point data set consists of 2 classes, with 50 time series in the training set and 150 time series in the testing set [51].The time series count 150 samples, these have been zero-padded to 256 samples before applying the WPD.We used the radial basis function (RBF) kernel in the SVM.In the grid search, we performed a 5-fold cross-validation in which we varied the cost parameter from C = 2 −5 , 2 −4 , ... 2 15  and varied the kernel parameter from γ = 2 −15 , 2 −14 , ... 2 3 .At the highest energy level we achieved a performance of 92.0%, see Figure 9, which is higher than obtained with the 1-NN Euclidean distance classifier (91.3%) and than obtained with time warping (91.3%) [51].The Swedish leaf data set consists of 15 classes and contains 500 time series in the training set and 625 time series in the testing set. Figure 10 shows the increasing classification accuracy with increasing energy levels.The performance at level 8, 88.6%, is higher than the 78.7% for the 1-NN Euclidean distance classifier and higher than for time warping 84.3% reported in [51].The adiac data set consists of 37 classes, 390 training time series and 391 testing time series [51].The time series contain 176 samples and these have been zero-padded to 256 samples before applying the WPD.The increasing accuracy with increasing energy levels is again confirmed as supported by the joint Markov blanket theory.The result obtained at level 8 (75.4%) in Figure 11 is higher than obtained with the 1-NN Euclidean distance classifier (61.1%) and time warping (60.9%) [51].

Conclusions
We have argued that within feature subset selection research, wavelet packet decompositions need special attention due to the existence of many dependencies between features.We extended Markov blanket filtering to the theory of joint Markov blankets in Theorem 3.4 by exploiting the link between the information-theoretic mutual information selection criterion and Markov blanket filtering.Analytical results on the existence of joint Markov blankets were established in some propositions.
For the energy features it was proven in Proposition 4.7 that the highest resolution features {E j log 2 (N ) } 0≤j≤N −1 form a joint Markov blanket for all other energy features.In six experiments it was confirmed that with increasing level of energy features the classification accuracy is expected to increase as explained by the joint Markov blanket theory.However, this behavior is observed only for the lower levels in the corrosion data set.At higher levels, the curse of dimensionality may reduce the classification accuracy due to the increasing number of energy features.

Figure 6 .
Figure 6.Evolution of the classification accuracy as a function of the level of the energy features for the corrosion data set.

Figure 7 .Figure 8 .
Figure 7. Evolution of the classification accuracy as a function of the level of the energy features for the cylinder-bell-funnel data set.

Figure 9 .
Figure 9. Evolution of the classification accuracy as a function of the level of the energy features for the gun-point data set.Training and testing data set are available [51].

Figure 10 .
Figure 10.Evolution of the classification accuracy as a function of the level of the energy features for the Swedish leaf data set.Training and testing data set are available [51].

Figure 11 .
Figure 11.Evolution of the classification accuracy as a function of the level of the energy features for the adiac data set.Training and testing data set are available [51].