Exploiting the Cone of Influence for Improving the Performance of Wavelet Transform-Based Models for ERP/EEG Classification

Features extracted from the wavelet transform coefficient matrix are widely used in the design of machine learning models to classify event-related potential (ERP) and electroencephalography (EEG) signals in a wide range of brain activity research and clinical studies. This novel study is aimed at dramatically improving the performance of such wavelet-based classifiers by exploiting information offered by the cone of influence (COI) of the continuous wavelet transform (CWT). The COI is a boundary that is superimposed on the wavelet scalogram to delineate the coefficients that are accurate from those that are inaccurate due to edge effects. The features derived from the inaccurate coefficients are, therefore, unreliable. In this study, it is hypothesized that the classifier performance would improve if unreliable features, which are outside the COI, are zeroed out, and the performance would improve even further if those features are cropped out completely. The entire, zeroed out, and cropped scalograms are referred to as the “same” (S)-scalogram, “zeroed out” (Z)-scalogram, and the “valid” (V)-scalogram, respectively. The strategy to validate the hypotheses is to formulate three classification approaches in which the feature vectors are extracted from the (a) S-scalogram in the standard manner, (b) Z-scalogram, and (c) V-scalogram. A subsampling strategy is developed to generate small-sample ERP ensembles to enable customized classifier design for single subjects, and a strategy is developed to select a subset of channels from multiple ERP channels. The three scalogram approaches are implemented using support vector machines, random forests, k-nearest neighbor, multilayer perceptron neural networks, and deep learning convolution neural networks. In order to validate the performance hypotheses, experiments are designed to classify the multi-channel ERPs of five subjects engaged in distinguishing between synonymous and non-synonymous word pairs. The results confirm that the classifiers using the Z-scalogram features outperform those using the S-scalogram features, and the classifiers using the V-scalogram features outperform those using the Z-scalogram features. Most importantly, the relative improvement of the V-scalogram classifiers over the standard S-scalogram classifiers is dramatic. Additionally, enabling the design of customized classifiers for individual subjects is an important contribution to ERP/EEG-based studies and diagnoses of patient-specific disorders.


Introduction
The continuous wavelet transform (CWT) and the discrete wavelet transform (DWT) are indispensable signal processing tools that are extensively used to analyze and develop classifiers for event-related potential (ERP) and electroencephalography (EEG) signals. Like most practical signals, ERPs and EEG epochs have finite lengths. The CWT and DWT of finite-length signals suffer from border distortions which are called cone of influence (COI) effects and border effects, respectively. This study focuses primarily on In order to validate the hypotheses and to also show that the performance trends are consistent across classifiers, the three approaches are implemented using a diverse set of classifiers which are trained and tested on the 64-channel ERPs of 5 subjects engaged in a binary semantic task which involved distinguishing between synonymous and nonsynonymous word pairs. The problems with developing customized ERP classifiers for individual subjects are identified, and a subsampling strategy is developed to generate small-sample average ensembles to facilitate the design. A rank-of-rank sum channel selection strategy which uses an interclass separation measure is developed to select a subset of channels that carry the most useful discriminatory information for classifying the synonymous and non-synonymous ERPs. Classification experiments are designed using the (5 subjects) (8 channels) = 40 data sets, and it is shown that: (a) the Z-scalogram classifiers outperform the standard S-scalogram classifiers; (b) the V-scalogram classifiers significantly outperform the S-scalogram and Z-scalogram classifiers; (c) the relative improvement of the V-scalogram classifiers over the standard S-scalogram classifiers is dramatic; (d) the improvement trends across the three scalogram approaches are remarkably consistent across the classifiers, ERP channels, and subjects; (e) the region outside the COI does not carry useful discriminatory information; (f) the subsampling strategy to generate small-sample ERP ensembles enables customized classifier design for single subjects.
To the best of our knowledge, we are not aware of any other study that systematically demonstrates how the performance of standard scalogram-based classifiers can be improved through the V-scalogram approach, which excludes the artifact coefficients outside the COI. Furthermore, the subsampling strategy, which enables the design of customized classifiers for individual subjects, is unique.

Materials and Methods
This section presents (a) the definitions of the COI, S-scalogram, Z-scalogram, V-scalogram, and V-complement scalogram, (b) methods to reduce effects of edge artifacts, (c) measures to indicate the quality of scalograms, (d) the development of the S-scalogram, Z-scalogram, and V-scalogram approaches, (e) the subsampling strategy to generate small-sample ERPs to enable the development of ERP classifiers for single subjects, (f) the development of the algorithm to rank and select channels for classification problems, (g) the diverse set of classifiers selected to implement the 3 scalogram approaches for ERP classification, (h) the 64-channel dichotomous EEG data of 5 subjects that is used to implement the various classifiers for the three scalogram approaches, (i) the experiments designed to evaluate the performances of the various implementations of the three classification approaches, and (j) details of the wavelet choice, averaging parameter selection, feature vector generation, classifier architectures and hyperparameters, and cross-validation.

COI
The primary focus of this study is on the development of classifiers that modify the scalogram for feature extraction based on the COI. It is, therefore, important to understand how the COI is generated in the scalogram, which, in turn, requires an understanding of how the scalogram is generated from the CWT. Given a wavelet basis function Ψ(t), the CWT of a signal x(t) is given by where a and b are the scale and translation parameters, respectively. That is, the CWT is the convolution of x(t) and scaled versions of Ψ(t). Applying the CWT to x(t) results in a set of wavelet coefficients which are a function of scale and position. A plot of the coefficients provides a time-scale representation of x(t). In order to implement on a computer, the CWT is discretized; consequently, W(a, b) is discrete both in scale and in time. The discretized CWT is often referred to as the DCWT to distinguish it from the discrete wavelet transform (DWT). Using s and n to denote the discrete-scale and discretetranslation variables, respectively, W(s, n) plotted as a function of s and n is called the CWT coefficient matrix. The scalogram is the amplitude |W(s, n)| or power |W(s, n)| 2 which can be represented by a matrix G(s, n), s = 0, 1, . . . (J − 1); n = 0, 1, . . . , (N − 1), where J is the number of scale bands, and N is the signal duration. In this zero-indexed matrix representation, row 0 corresponds to scale S 0 , row 1 corresponds to scale S 1 , . . . , row (J − 1) corresponds to scale S J−1 , with S j < S j+1 . The COI is best explained by first understanding the convolution operations that are used to generate the coefficient matrix and to identify the COI. Each row of the coefficient matrix W(s, n) is obtained from the linear convolution of x(n) with a wavelet of a particular scale. There are several different forms of linear convolution, which include "full", "same" and "valid", which are useful for identifying the COI in the coefficient matrix. The three types of convolutions, which are described in detail in Appendix A, differ in the manner in which the start and end points are computed. The standard scalogram, which we refer to as the S-scalogram, is generated by same-convolution. The S-scalogram contains artifact coefficients because it suffers from edge effects. The valid convolution operation yields coefficients that are accurate and is used to identify the COI in the coefficient matrix. The boundary that separates the artifact coefficients from the accurate coefficients defines the COI in the S-scalogram. We refer to the part of the S-scalogram that is inside the COI as the V-scalogram because it corresponds to the coefficients associated with valid convolutions. Furthermore, the entire S-scalogram with zeroed-out coefficients outside the COI is referred to as the Z-scalogram. Figure 1 shows an ERP and the 3 different types of scalograms generated from the Morlet wavelet transform of the ERP. In order to analyze the oscillatory behavior in most physiological signals, the scalogram is represented in terms of the discrete frequency variable , instead of the scale variable. The discrete-time and discrete-frequency S-scalogram can be represented by the matrix ( , ), = 0,1, … , ( − 1), where in this case is the number of discrete-frequency bands. The COI depends not only on the wavelet choice and frequencies but also on the signal duration. For a given scalogram, the ratio of the number of V-scalogram coefficients to the total number of S-scalogram coefficients can be used as a measure of the quality of the scalogram. For a ( × ) scalogram, the ratio is given by where is the number of coefficients in row that are inside the COI. For a given scalogram, a quality ratio can also be defined in terms of the power enclosed by the COI to the total power in the following manner: where the summations in the numerator are taken for values of ( , ) that fall inside the COI. Higher values of and indicate better scalogram quality.

The Three Scalogram Approaches
Our goal is to compare the performances of classifiers that employ S-scalogram, Zscalogram, and V-scalogram features; therefore, the most straightforward way to demonstrate this goal is to design classifiers for each ERP channel. Clearly, the same goal can be

Edge Artifacts
It is inevitable that the CWT applied to a finite duration signal will lead to edge artifacts in the standard S-scalogram due to the need for extending the signal through padding. The inaccuracies of the artifact coefficients are the highest at the edges and decrease towards the COI. Various notable methods have been made to reduce, but not remove, the effects of the artifacts by smoothing the extensions [27][28][29], forecasting the extensions [30] and modifying the wavelet transform instead of extending the signal [31,32]. The effects of several extension methods have also been compared in [33,34]. For least-squares wavelet analysis (LSWA), which is used to analyze non-stationary signals that are not sampled evenly, the stochastic confidence level surface acts like the COI but is based on degrees of freedom [35].
Methods have also been proposed to approximate the COI, which include (a) using the time constant 1/e to delineate the borders of the cone of influence at each scale [2,36,37], (b) defining the extent of the COI at each scale as the point where the wavelet transform magnitude decays to 2% of its peak value [38,39], (c) delineating the COI by adding and subtracting (1/2) the wavelet footprint at the beginning and end of the observation interval at each scale [37,40,41], and (d) using the wavelet time interval that encompasses 95% of the wavelet's energy for Morse wavelets [42][43][44].

Scalogram Quality
In order to analyze the oscillatory behavior in most physiological signals, the scalogram is represented in terms of the discrete frequency variable f , instead of the scale variable. The discrete-time and discrete-frequency S-scalogram can be represented by the matrix G S ( f , n), f = 0, 1, . . . , (J − 1), where J in this case is the number of discretefrequency bands. The COI depends not only on the wavelet choice and frequencies but also on the signal duration. For a given scalogram, the ratio ρ of the number of V-scalogram coefficients to the total number of S-scalogram coefficients can be used as a measure of the quality of the scalogram. For a (J × N) scalogram, the ratio is given by where N f is the number of coefficients in row f that are inside the COI. For a given scalogram, a quality ratio β can also be defined in terms of the power enclosed by the COI to the total power in the following manner: where the summations in the numerator are taken for values of ( f , n) that fall inside the COI. Higher values of ρ and β indicate better scalogram quality.

The Three Scalogram Approaches
Our goal is to compare the performances of classifiers that employ S-scalogram, Z-scalogram, and V-scalogram features; therefore, the most straightforward way to demonstrate this goal is to design classifiers for each ERP channel. Clearly, the same goal can be demonstrated by designing classifiers that combine information from multiple ERP channels [45,46]. However, the performance will then also depend on the combination methods which could mislead the performance comparisons. Therefore, the assumption hereafter is that the classifiers are single-channel classifiers.
The strategy to demonstrate the influence of the COI on classification performance is to develop three classification approaches in which the feature vectors are derived from (a) the S-scalogram G S ( f , n), (b) the Z-scalogram G Z ( f , n), and (c) the V-scalogram G V ( f , n). The scalogram G S ( f , n) consists of the signal power at each time-frequency location. For classification problems, transforming a signal x(n) into G( f , n) is regarded as a feature extraction process. Clearly, additional operations can be performed on the scalogram feature matrix to generate other features such as the mean band power, ratios of power in select bands, and those which are typically used in image classification, such as texture, entropies, higher-order statistical features, and PCA-derived features. However, the classification results would then depend on the choice of the features making it difficult to draw direct conclusions about the influence of the COI on classifier performance. Therefore, in order to avoid this problem, no further modifications of the scalogram features are employed in the development of the classifiers. That is, the features are simply the scalogram coefficients.
The most direct way to demonstrate the performance trends across the three classification approaches is to concatenate the rows of the scalogram into a feature vector. In the formulation of the three classification approaches, the region outside the COI is represented by The feature vector in this approach is formed by concatenating the rows of the (J × N) S-scalogram. If ∆ and T represent the row concatenation and transpose operations, the [(J)(N) × 1] feature vector G S is given by n). The feature vector G S includes a combination of accurately computed and artifact features. It can, therefore, be expected that the performance will be impacted negatively by the inclusion of the artifact features.

Z-Scalogram Approach
In this method, the S-scalogram is modified by multiplying it with a mask M( f , n) such that the resulting scalogram has elements set to zero outside the COI and are unchanged inside the COI. That is, the zeroed-out S-scalogram is given by where The feature vector G Z , which has the same dimension [(J)(N) × 1] as G S , is given by Although the dimension of the feature vector G Z is the same as G S , it can be hypothesized that the performance would improve by zeroing out the unreliable features that are outside the COI. However, including features with zero values across all classes is completely uninformative, providing no discriminatory value. Furthermore, the unnecessarily large feature vector can lead to overfitting, which can negatively impact the performance.

V-Scalogram Approach
In this method, the S-scalogram is cropped with a COI-shaped mask V( f , n) which selects the features inside the COI and discards the features outside the COI. The resulting COI-shaped 2-dimensional V-scalogram G V (s, n) is given by where is used to denote the cropping operation, and the mask selects or discards features according to The [(J)(N V ) × 1] feature vector G V is given by where, f =0 N f , and N f is the duration of the frequency band f inside the COI. Unlike the two previous cases, the durations of the frequency bands are not equal; consequently, the feature vector G V does not have the same dimension as G S . By excluding the features outside the COI, the dimension (J)(N V ) of G V will be smaller than the dimensions of G S and G Z . Through the simultaneous exclusion of unreliable features included in the S-scalogram approach and the cropping out of irrelevant features included in the Z-scalogram approach, it can be hypothesized that classifiers using G V for the feature vector should outperform the classifiers using the feature vectors of the S-scalogram and Z-scalogram approaches.

V-Complement Scalogram Classification
In order to erase any doubts about the importance of excluding the features outside the COI, classifiers were also developed using only the features outside the COI, that is, the features contained in G O ( f , n). This approach, represented by the V-scalogram, is referred to as the V-complement scalogram approach because the feature set is the complement of the V-scalogram feature set. If these features are truly unreliable, as claimed throughout this study, the classification accuracies of the V-scalogram classifiers should be worse than those of the baseline S-scalogram classifiers. The V-complement approach is not considered to be 1 of the 3 scalogram classification approaches but is simply developed to demonstrate that the features outside the COI do not carry any useful discriminatory information.

Subsample ERPs for Classifier Design
In statistical analyses, an ERP is formed by averaging all the time-locked single trials collected from a channel. The averaging operation is performed independently on the single trials of each channel. That is, 1 ERP is obtained for each channel. A different approach is required for ERP classification because classifiers are designed using the data split into two mutually exclusive subsets: the training set needed to train the model and the test set needed to evaluate the model. The term "design" includes classifier training and testing. Designing customized classifiers for individual subjects, whether it is through direct ERP-based classification or feature-based ERP classification, is impossible if each ERP category is represented by a single ERP. Group-based classifiers would require a very large number of subjects to generate enough ERPs for training and testing, which, in many practical cases, may be impossible. Also, it is important to note that the performance of a classifier is highly dependent on the number of ERPs and the quality of the training set. A small training set would not be representative of the variations in the ERPs, which results in poor generalization. Furthermore, large training sets are required to overcome the "curse of dimensionality" often encountered in the design of parametric classifiers [47]. Poor-quality (poor SNR) training sets lead to poor learning, which, in turn, results in poor performance.
The most obvious approach to increase the size of the training and test sets is to simply use single trials without averaging. However, high classification accuracies cannot be expected with single trial signals due to the poor SNR. Because the SNR improves through averaging [48][49][50], it can be expected that the classification accuracy will improve by averaging over a small number of single trials. The classification accuracy can be expected to increase even further by increasing the number of trials averaged. For convenience, the signal formed by averaging m single trials will be referred to as an m-ERP, and m will Brain Sci. 2023, 13, 21 8 of 26 be referred to as the "averaging parameter". Furthermore, the full-sample ERP formed by averaging all Ω single trials in an ensemble will be referred to as the Ω-ERP, and, for convenience, a single trial will be referred to as a 1-ERP. In order to form an m-ERP, the 1-ERP ensemble can be partitioned sequentially into blocks of size m and then averaging the 1-ERPs in each block. Alternatively, subsamples of size m single trials can be drawn randomly without replacement and averaged to form distinct m-ERPs. However, in both cases, the sizes of the m-ERP training and test sets, which are equal to the number of 1-ERPs in the respective sets divided by m, "shrink" as m increases, which leads to overfitting. The obvious solution to this problem is to increase the size of the single trial ensembles. However, collecting large ensembles of single trials to form a large ensemble of m-ERPs for training and testing is difficult in practice because subjects tend to lose focus on the task due to fatigue and lapses in concentration over long data collection sessions. Another alternative is to average m single trials drawn with replacement to generate large ensembles of m-ERPs. However, there is the risk of generating identical m-ERPs, which can result in singular covariance matrices. This could be problematic for parametric classifiers such as those with Gaussian discriminant functions, which require the computation of the inverse of the covariance matrix [47].
The real challenge, therefore, in designing ERP classifiers for many applications is to avoid having to repeat the stimulus presentation numerous times to be able to form high SNR m-ERPs, especially during real-time testing. To be practical, most applications require accurate classification of ERPs averaged over a small number of single trials. The goal, therefore, is to enable the design of practical ERP classifiers that can accurately classify ERPs averaged over a small number of single trials without having to collect a prohibitively large ensemble of 1-ERPs.

Generation of m-ERPs through "m-Subsample Averaging"
In order to overcome the shrinking problem that accompanies sequential averaging and averaging random subsamples without replacement, we employ a subsampling and averaging strategy introduced and implemented in [45,51,52]. We will refer to this method of generating subsample ERPs as "m-Subsample Averaging (m-SA)". For a given 1-ERP ensemble of size Ω 0 , the m-SA strategy can be summarized as follows: (a) Draw a random subsample of 1-ERPs of size m, m < Ω 0 . The m 1-ERPs in the subsample are drawn without replacement; (b) Average the m 1-ERPs in the subsample to obtain an m-ERP; (c) Replace the m 1-ERPs of the subsample into the single trial ensemble; (d) Repeat steps (a)-(c) q times to generate an ensemble of Ω q m-ERPs. Each ensemble generation is referred to as a "run"; (e) Repeat steps (a)-(d) Q times to yield Q runs, with each run containing an m-ERP ensemble of size Ω q .
In order to maintain mutual exclusivity between the training and test sets, the 1-ERP ensemble is first partitioned into training and test sets, and subsequently, m-ERPs are generated from each of these sets. Each run, therefore, begins with randomly partitioned 1-ERP training and test sets. The specified number of m-ERPs for the training and test sets set is generated from the 1-ERPs of the training and test sets, respectively. Consequently, none of the 1-ERPs used to form the m-ERPs of the training set are included in forming the m-ERPs of the test set.
The key point to note is that m-SA offers the flexibility of generating large ensembles of m-ERPs for classifier design. Most notably, the m-ERP ensembles can be generated from the 1-ERP ensembles of individual subjects, thus enabling the design of customized classifiers for individual subjects. Clearly, m-SA is equally applicable for generating large ensembles for multi-subject group-based classifier design. Increasing the amount of data for training in this fashion is a form of data augmentation that can be especially beneficial for improving the performance of data-driven classifiers such as deep learning CNNs. Also noteworthy is that the training set can be made up of m-ERPs with different values of m to increase the amount of data and the variability in the training set. An additional point to note is that the averaging parameter does not have to be the same for the training and test sets. That is, the classifier could be trained with m-ERPs and tested on n-ERPs. In practice, it is desirable to keep both m and n small in order to avoid having to collect a large number of single trials for training and testing, respectively. It is especially important to obtain high accuracies for small values of n so that trials do not have to be repeated many times during testing.

Rank-of-Rank-Sum (RRS) Channel Ranking
The primary reasons for channel selection are to extract a subset of channels that carry useful discriminatory information and to reduce the computational burden. For ERP analysis, temporospatial PCA can be used to identify channels and time windows that best capture effects elicited by stimuli [53,54]. Therefore, the classification performance can be expected to improve by using only the identified channels and time windows. However, the drawback of using such prior knowledge for channel selection with respect to multichannel ERP classifier design is that the information that a particular classifier uses to discriminate between the ERP categories may differ from the information identified by PCA which is aimed at seeking projections that best represent the data in a least-square sense. A different approach which is aimed at selecting channels that best discriminate between the ERP classes is required for classification problems. A common approach is to use a suitable criterion with backward elimination to develop greedy algorithms which start with the full set of channels and decrease the number of channels, one channel at a time, to determine a subset of channels that yield the best criterion score. References [55][56][57] review several notable channel selection methods that have been developed for EEG channel selection.
The rank-of-rank-sum (RRS) algorithm developed in this study is feature-blind and quite general because it is formulated to rank channels for classification problems involving multiple classes (polychotomous) and multiple subjects. Binary classification and singlesubject cases are treated as special cases of the RRS algorithm. The channels are ranked according to the interclass separations of the ERPs generated at the channels. The interclass separation measure selected is a scatter-normalized measure of separation between two cluster means. In general, higher classification accuracies can be expected if clusters have higher interclass separations. However, the interclass separation, which is a pairwise measure, is not directly applicable for multi-channel and polychotomous ranking because it provides a separation measure only for a pair of channels and a pair of ERP classes. Moreover, it is not directly applicable to multiple subjects. To solve the first problem, channel ranking is obtained via the ranking of the sum of the pairwise rankings of the U(U − 1)/2 pairs of ERP classes where U is the number of ERP classes. One approach to solving the second problem is to combine the B subjects into a single group. This approach, however, is sensitive to the inter-subject variabilities of the ERPs. Instead, the RRS algorithm solves the second problem by determining the rankings of the sum of the channel rankings across the B subjects. The RRS algorithm is described in detail in Appendix A.

Choice of Classifiers
In order to test the performance-related hypotheses and to determine whether the performance trends are consistent across classifiers, channels, and subjects, a range of diverse classifiers is selected to implement the three scalogram approaches for ERP classification. The following well-known classifiers are selected: SVMs, RFs, k-NN, MLPs, and deep learning CNNs. Each of these classifiers has its advantages and limitations. Details of these well-known classifiers can be found in excellent books on machine learning [58][59][60][61][62].
Two variants of CNN classifiers are designed: vector input CNN-1 and matrix input CNN-2. The CNN-1 classifier accepts vector inputs and can, therefore, be implemented for all 3 approaches through the row concatenation operations described in Section 2.2. As mentioned in the introduction, the matrix scalogram is an ideal format for CNN inputs. Implementing matrix input CNN classifiers for the first 2 approaches is straightforward because the S-scalogram and the Z-scalogram have matrix formats. However, implementing CNN classifiers capable of accepting the unequal row V-scalogram matrix is problematic because the CNN matrix convolutions require equal row matrices. Such an implementation is beyond the scope of this study.

Data Sets
The EEG data used to demonstrate the application and evaluate the performance of the 3 classification approaches was downloaded from: https://eeglab.org/tutorials/10_Group_analysis/study_creation.html#description-ofthe-5-subject-experiment-tutorial-data (accessed on 1 October 2022).
This data set was selected because it was freely available and ideal for demonstrating the goals of this study. However, the methods formulated in this study are quite general and are not tied to this particular data set or any other data set. According to the description of the study on the website, the EEG data were collected from 5 subjects performing an auditory binary semantic task in which they were asked to distinguish between synonymous and non-synonymous word pairs. The second word was presented 1 s after the presentation of the first word. Data epochs were extracted 2 s before the second-word onset to 2 s after the second-word onset. EEGs were collected from 64 channels using a sampling rate of 200 Hz. After decomposing each subject's data by ICA, 2 EEG datasets were extracted: the synonymous data set comprising trials in which the second word was synonymous with the first one and the non-synonymous data set in which the second word was not a synonym of the first. Thus, the study included 5 binary data sets or (5 subjects) (2 conditions) = 10 datasets for each of the 5 subjects. Both datasets of each subject were recorded in a single session. The number of epochs for each condition varied from 196 to 235 across the 10 data sets.
For the purpose of this study, the single trials (1-ERPs) for the classification experiments consisted of the 1 s segment extracted after the second-word onset in each epoch. Consequently, the total number of samples in each 1-ERP was 200. In order to facilitate 5-fold cross-validation and to have an equal number of epochs across the 10 data sets, the first 195 epochs were selected from each data set. The top 8 ranked channels for each subject determined by the RRS algorithm were selected for the experiments. The rankings are listed in Table 1. The rankings of the channels differ across the 5 subjects, which is not unexpected. Whether these channels differ from those extracted from temporospatial PCA is not of concern because the focus is on channel selection for classification and not for analyses. Furthermore, whether the top-ranked channels are correlated proximal channels is not taken into account because the goal is to simply select 1-ERPs from a subset of channels, instead of all channels, that can be used as data sets for the experiments. If needed, correlation analyses could be conducted on the ranked channels to systematically include low-correlated channels starting with the top-ranked channel.

Subject
Top 8 Ranked Channels Each classifier was designed on (5 subjects) (8 channels) = 40 binary data sets. The 40 binary data sets are large enough to adequately observe the performance trends across the three approaches. Adding more channels will simply increase the size of the already large tables that are included in the results without contributing to the trend analyses. Figure 2 shows the full-sample Ω-ERPs of subject b 1 formed by averaging the 1-ERPs in the binary data sets. Observe that the Ω-ERPs of the two classes appear to be fairly well separated in all 8 channels.
three approaches. Adding more channels will simply increase the size of the already large tables that are included in the results without contributing to the trend analyses. Figure 2 shows the full-sample Ω-ERPs of subject formed by averaging the 1-ERPs in the binary data sets. Observe that the Ω-ERPs of the two classes appear to be fairly well separated in all 8 channels.

Classification Experiments
This section describes various aspects of the design of experiments to test the implementations of the three classification approaches using the 5 different types of classifiers and the 5-subject 8-channel ERP data sets.

Wavelet Choice
Over the years, many different CWTs have been proposed, which include, among many others, Morlet [63,64], Morse [44], Gaussian [2], and Mexican Hat [65,66]. The choice of the wavelet is application dependent, and in this study, the analytic Morlet wavelet is selected because it is frequently used to analyze the oscillatory behavior of ERPs and EEGs. The analytic Morlet mother wavelet, a product of a complex exponential signal of frequency f 0 and a zero-mean Gaussian window with variance σ 2 , is given by where the constant A ensures that the wavelet energy is equal to one. The analytic wavelet coefficients are complex, and the power spectrum is zero for negative frequencies. Note that the wavelet is defined in terms of a frequency parameter which is preferred in applications involving oscillatory analyses.

Selection of the Averaging Parameter
The aim of this study is not focused on obtaining the highest classification accuracies but on analyzing the performance trends across the 3 classification approaches using the data sets. In general, classifiers trained and tested on high SNR data sets yield good accuracies across the classifiers, making it difficult to compare their performances. Furthermore, it may not be practical to repeat trials a large number of times to generate high-SNR m-ERPs during real-time testing. It is, therefore, important to be able to obtain high accuracies for small values of m. For these reasons, the smallest value, m = 4, which enabled the analyses of the performance trends across the 3 approaches, was selected for generating the m-ERP ensembles. In the experiments to follow, a classifier was developed to test the 4-ERPs of each channel collected from each subject independently.

Feature Vector Generation
The feature vectors for each approach were generated from the Morlet wavelet power scalogram through row concatenation, as described in Section 2.2. The question as to whether the scalogram is computed from 1-ERPs or m-ERPs has to be considered. In general, the scalogram of the average of the single trials is used for evoked power analysis, whereas the average of the scalograms of each single trial, which gives the total power, is used for induced power analysis. The total power is the sum of the evoked and induced power. In this study, for the purpose of feature extraction, a single scalogram is computed for each m-ERP where m is equal to 4. The duration of each 4-ERP (same as the single trial duration) was N = 200 samples, and the number of frequency bands J was 108. The dimensions of the scalograms were, therefore, 108 × 200. Hence, the dimension of the feature vectors G S and G Z were (21,600 × 1). Using the process outlined in Section 2.2 for selecting the features inside the COI, the dimension of the feature vector G V was 17,056 × 1. The ratio ρ of the number of V-scalogram coefficients to the total number of S-scalogram coefficients for this case is (17,056/21,600) (100) = 78.96%. That is, approximately 21% of the scalogram coefficients were artifact coefficients.

Classifier Architectures and Parameters
This subsection describes the architectures and hyperparameters of the 6 different classifiers that were implemented. No serious attempt was made to optimize the architectures and hyperparameters because the goal is to compare the relative performances of a given SVM Classifiers: The SVM classifiers were implemented using the Gaussian radial basis function kernel. The best combination of the regularization parameter C and influence parameter γ was selected using an exhaustive grid search.
RF Classifiers: The following hyperparameters were selected: number of trees = 10, maximum depth of a tree = 475, minimum number of samples required to split an internal node = 3, minimum number of samples required to be leaf node = 1, splitting quality criterion = entropy.
k-NN Classifiers: The Euclidean distance metric was used to measure proximity. Through trial and error, it was found that k = 20 yielded good performance across all classifiers.
MLP Classifiers: The MLP classifiers consisted of four layers of neurons: the first layer had 1024 neurons, the second layer had 512 neurons, the third layer had 256 neurons, and the output layer had 2 neurons. The sigmoidal activation function was used in all 4 layers. The following training options were used: initialization = uniform random, optimizer = Adam, learning rate = 0.001, number of epochs = 50, drop-out probabilities = 0.15.
CNN-1 Classifier: The CNN-1 classifiers consisted of three convolution layers followed by a max pooling layer and terminated with a fully connected network (FCN) consisting of four layers of neurons. The convolution layers used the ReLU activation functions, the first three layers in the FCN used sigmoidal activations, and the last layer used softmax activations. The "same" operation was used in the convolution layers. The number of filters in the convolution layers was 32. The filter dimensions in the convolution layers were (9 × 1). A (2 × 1) max pooling filter with a stride of 2 was used in the pooling layer. The FCN had 1024, 512, 256, and 2 neurons in the four layers. The following training options were used: initialization = uniform random, optimizer = Adam, learning rate = 0.001, number of epochs = 50, drop-out probabilities = 0.15.
CNN-2 Classifiers: The CNN-2 classifiers consisted of a convolution layer, max pooling layer, convolution layer, max pooling layer, convolution layer, max pooling layer, and terminated with a fully connected network (FCN) consisting of four layers of neurons. The convolution filters had dimensions (3 × 3), and the max pooling filter had dimensions (2 × 2) with a stride equal to 2. The "same" operation was used in both convolution layers. The number of filters in both convolution layers was 32. The convolution layers used the ReLU activation functions, the first three layers in the FCN used sigmoidal activations, and the last layer used softmax activations. The FCN had 1024, 512, 256, and 2 neurons in the four layers. The following training options were used: initialization = uniform random, optimizer = Adam, learning rate = 0.001, number of epochs = 50, drop-out probabilities = 0.15.
In summary, the SVM, RF, k-NN, MLP, and CNN-1 classifiers were developed for all three scalogram approaches, and the CNN-2 classifiers were developed for the first 2 approaches to give a total of 17 classifiers. Each classifier was trained and tested independently on the 40 binary data sets. The classifiers were implemented using the PyTorch library.

Cross-Validation
The steps outlined in Section 2.3.1 were used to generate runs consisting of the m-ERPs of the training and test sets. For each run that started with randomly ordered 1-ERPs, the 195 1-ERPs of each class were partitioned into 5 folds, each containing 39 1-ERPs. The training set consisted of 4 folds (156 1-ERPs), and the test was the remaining fold. From the 1-ERP training set, an equal number (156) of 4-ERPs were generated for the training set. Similarly, 39 4-ERPs were generated for the test set. This method of generating 4-ERP training and test sets was conducted for each channel. The channel classifiers were trained and tested using 5-fold cross-validation for each run. For each run, the total number 4-ERPs for training and testing each channel classifier consisted of (195 4-ERPs/class) (2 classes) = 390 4-ERPs. The classification accuracy for each run, expressed as a percentage, was estimated as the ratio of the correctly classified 4-ERPs to the total number (390) of 4-ERPs of the 2 classes. The procedure was repeated over 50 runs, and the final classification accuracy of each classifier was estimated by averaging the accuracies of the 50 runs. That is, the final classification accuracy of each classifier was averaged across (390) (50) = 19,500 4-ERPs belonging to the 2 classes. Importantly, all the classifiers were implemented using identical training and test sets.

Results
This section presents the results of the various experiments designed to observe the performance trends of the 3 classification approaches across classifiers, channels, and individual subjects.

m-ERP Classification Results
In order to observe the trends across the 3 approaches closely, the final classification accuracies obtained from the classifiers across the three approaches are presented separately in five "full" tables, which contain the results for each subject. Due to the large size, the five full tables are presented in Appendix B (Tables A1-A5). Each entry in the tables is the average classification accuracy obtained from testing (390 4-ERPs × 50 runs) = 19,500 4-ERPs. The best result for each channel is highlighted in bold font. The entries are empty for the CNN-2 classifiers for the V-scalogram approach because they were not implemented. Any inconsistent result (expected improvement is not observed) is highlighted in red. Out of the total of 680 results presented in the five full tables, only 10 (<1.5%) were minorly inconsistent in the sense that the expected improvements of the Z-scalogram classifiers over the S-scalogram classifiers were not observed. Table 2 summarizes the overall average classification across the five subjects and eight channels for each approach. Except in the row labeled Global Average, each entry in Table 2 is the average classification accuracy obtained from testing (five subjects × eight channels × 19,500 4-ERPs) = 780,000 4-ERPs. The row labeled Global Average is the average of the classifiers across each approach (column averages). An asterisk on CNN-2 indicates that the CNN-2 results were not included in the computation of the Global Average. The CNN-2 averages are included below the Global Averages to emphasize this point.  Figure 3 shows the 95% confidence interval error bars for the Global Averages in Table 2. The V-scalogram error bar does not overlap with the S-scalogram and Z-scalogram error bars. Therefore, the improvements of the V-scalogram approach over the S-scalogram and Z-scalogram approaches are statistically significant, with p-values less than 0.05.

Single Trial Results
Up to this point, we have assumed that single trial classification will yield poor classification accuracies. In order to support this assumption, the classifiers were trained and tested on the 1-ERPs of Subject b 3 . The results are shown in Table A6 in Appendix B. For brevity, the results of the 4 other subjects are not included because the trends are the same as those observed for Subject b 3 . As expected, the results for the 1-ERP classification are inferior to the 4-ERP classification. Interestingly, the CNN-1 results using V-scalogram features are quite good for single trial classification. The significant jumps in the CNN results from the S-scalogram accuracies to the V-Scalogram accuracies are noteworthy.

V-Complement Scalogram Results
Classifiers were developed using only the (21600 − 17056) = 4544 features outside the COI. Table 3 summarizes the classification accuracies of the classifiers trained using this V-scalogram approach. The S-scalogram results are also included in the table to facilitate comparison. The V-scalogram average classification accuracy of 51.95% is close to the 50% accuracy expected for random binary classification. Clearly, the artifact coefficients do not carry useful discriminatory information. Therefore, the performance should improve by zeroing the artifact coefficients and improve even further by removing them completely, as confirmed by the experimental results in Appendix B.
Brain Sci. 2023, 13, x FOR PEER REVIEW 15 Figure 3. The 95% CIs for the global mean accuracies of the 3 approaches.

Single trial Results
Up to this point, we have assumed that single trial classification will yield poor sification accuracies. In order to support this assumption, the classifiers were trained tested on the 1-ERPs of Subject . The results are shown in Table A6 in Appendix B brevity, the results of the 4 other subjects are not included because the trends are the as those observed for Subject . As expected, the results for the 1-ERP classificatio inferior to the 4-ERP classification. Interestingly, the CNN-1 results using V-scalo features are quite good for single trial classification. The significant jumps in the results from the S-scalogram accuracies to the V-Scalogram accuracies are notewort

V-complement Scalogram Results
Classifiers were developed using only the (21600 − 17056) = 4544 features ou the COI. Table 3 summarizes the classification accuracies of the classifiers trained this -scalogram approach. The S-scalogram results are also included in the table to itate comparison. The -scalogram average classification accuracy of 51.95% is clo the 50% accuracy expected for random binary classification. Clearly, the artifact c cients do not carry useful discriminatory information. Therefore, the performance sh improve by zeroing the artifact coefficients and improve even further by removing completely, as confirmed by the experimental results in Appendix B.

Discussion
This section discusses the findings presented in the Section 3.

Performance Trends
The following general trends in the 6 full tables in Appendix B and the summary table are important to note: for each classifier, there is an improvement in the performance of the Z-scalogram classifiers over the S-scalogram classifiers (including CNN-2) and a consistent improvement of the V-scalogram classifiers over the Z-scalogram classifiers. This is also reflected across the overall averages for the three approaches. Importantly, the CNN-1 results in Tables A1-A5 also show that the two-class semantic ERPs can be classified with accuracies exceeding 94% for all 5 subjects using a value as small as m = 4 for subsample averaging with the V-scalogram. As noted in the subsampling section, even higher accuracies can be obtained by increasing the averaging parameter m.
For the CNN-2 classifier, the improvement of the Z-scalogram approach over the S-scalogram approach is also notable. Quite interestingly, even the Z-scalogram approach can yield accuracies greater than 93% for all five subjects. This impressive performance can be attributed to the fact that the convolution filters are trained to extract discriminatory features from input during the training phase. That is, the zeroed-out region outside the COI is ignored to some degree by the filters. The filters are influenced by the zeroes to a greater extent when they overlap with scalogram regions that are both inside and outside the COI boundary. Based on the observed trends of the other classifiers across the three approaches and the trends between the two CNN-2 implementations across the first two approaches, we can expect a dramatic improvement in the CNN-2 implementation of the V-scalogram approach over the standard S-scalogram approach if the convolution mechanisms in the CNNs can be modified to accommodate unequal row matrices. Furthermore, through an understanding of how the filter weights are trained by convolving over entire matrices for S and Z-scalograms containing artifact and zeroed-out coefficients, respectively, it can be expected that the convolution filters trained only over the correct V-scalogram information are bound to lead to better performance. It is important to note that modifying wavelet-based CNN classifiers to restrict the matrix convolutions within the COI would be a major breakthrough, given the huge attention focused on developing and applying wavelet-based CNN classifiers to numerous problems in various fields. Future efforts will focus on developing direct/indirect methods to restrict matrix convolutions within the COI.

Relative Improvements
The consistent improvements of the Z-scalogram and V-scalogram approaches over the baseline S-scalogram approach are quite remarkable because they are observed across the m-ERPs of all 5 subjects across all 8 channels (all 40 data sets). The relative improvement in the performance of an approach B over an approach A, denoted by Γ (B, A), can be measured using where α(A) and α(B) are the classification accuracies of approaches A and B, respectively. Table 4 summarizes the relative improvements for the average accuracies listed in Table 2.
These results clearly confirm that the Z-scalogram approach outperforms the S-scalogram approach and the V-scalogram approach outperforms the Z-scalogram approach. Most importantly, the improvements of the V-scalogram approach over the standard S-scalogram approach are dramatic.

Comparison of Classifiers
Although the primary goal of this study is to show that for any given classifier, the performance of the classifier using the standard S-scalogram features can be improved by using Z-scalogram features, and further improvements can be obtained by using V-scalogram features, the results in the tables can also be used to rank the 6 different classifiers to pick the best performing classifier type. From the averages in Table 2, it can be concluded that the best results for the V-scalogram approach across classifiers, channels, and individual subjects were obtained with the CNN-1 classifiers. The superiority of the CNN classifier models can be attributed to the fact that the filters in the convolution layer have the ability to extract complex features that are relevant to classification. Most importantly, the scalograms can be input directly to the CNNs without having to extract a set of hand-engineered features. Consequently, the performance is not dependent on the hand-engineered feature set, which is often determined through trial and error. The size and the shape of the convolution filters can couple information in different ways to extract different types of features from the same input presented to the CNN. For example, row filters couple information from individual frequency bands in the scalogram. A matrix filter can be used to couple information across frequency bands. If needed, the complexity of the features can be increased by increasing the number of convolution layers. In general, CNNs are highly scalable for larger classification problems by increasing the network depth and increasing training. Therefore, CNNs should be a good choice for multi-subject group-based classification. Furthermore, there are several elegant ways of fusing "information" from multiple channels with CNNs to improve performance. The information can be data/features of the channels or decisions of the channel classifiers. The study in [67] involving the classification of multi-axial (multi-channel) signals using CNNs shows that, in general, multi-axial fusion classification models outperform the best uniaxial (single-channel) classification model.

Conclusions
The goal of this study was to demonstrate the importance of exploiting information offered by the COI in designing CWT-based classifiers. It was hypothesized that the performance would improve by zeroing out the unreliable scalogram features outside the COI and that the performance would improve even further by completely excluding the unreliable features outside the COI. A subsampling strategy was developed to generate small-average ensembles to enable customized design for individual subjects/patients, and a rank-of-rank sum channel selection strategy was developed to select a subset of channels from multiple channels. The S-, Z-, and V-scalogram classification approaches developed to validate the performance hypotheses were implemented with a set of diverse classifiers and tested on semantic task ERPs of 5 subjects. The results from 40 data sets confirmed the expected relative improvements, and it was noted that the improvements were remarkably consistent across channels, classifiers, and subjects. Most importantly, the relative improvements of the V-scalogram classifiers over the standard S-scalogram classifiers were dramatic. Among the 5 classifier types implemented, the CNN classifiers yielded the best results consistently.
In conclusion, this study demonstrated unequivocally that excluding the region outside the COI for feature extraction results in a significant improvement in classifier performance. The V-scalogram approach is not restricted to ERP/EEG classification and is applicable to improving the performance of wavelet classifiers designed to classify other physiological signals as well as signals in many other fields, including engineering, science, and economics. Undeniably, this is a significant contribution, considering the vast number of wavelet-based classifiers that have been reported but have not exploited the COI for feature extraction in the manner described in this study. Furthermore, the ability to design customized classifiers for individual subjects which yield high classification accuracies using subsample averaging is an important contribution for studying and diagnosing patient-specific disorders using ERPs and EEGs.
Author Contributions: Conceptualization, L.G., X.C. and R.S.G.; methodology, X.C., L.G. and R.S.G.; software, X.C.; validation, L.G., X.C. and R.S.G.; formal analysis, X.C. and L.G; investigation, X.C. and L.G., writing-original draft preparation, X.C. and L.G.; writing-review and editing, L.G., X.C. and R.S.G. All authors have read and agreed to the published version of the manuscript. Acknowledgments: The authors would like to thank the anonymous reviewers for their helpful suggestions to improve the paper.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
The convolution operations that lead to the definitions of the S-scalogram, Z-scalogram, V-scalogram, and COI are described in detail in this Appendix. The formulation of the generalized RRS channel selection algorithm applicable to single/multiple subjects and dichotomous/polychotomous classification is also presented in this Appendix.

Appendix A.1. Types of Convolution
The question that is addressed in this section is: what type of convolution causes the edge artifacts to occur in the scalogram coefficient matrix? In order to answer this question, three convolution mechanisms are described by assuming the durations of the signal x(n) and the zero-centered mother wavelet filter h(n) are N and (2M + 1), respectively. That is, the compact support of h(n) is [−M, M]. It is also assumed that M < N.
In general, the discrete convolution sum is given by where λ is a dummy variable. That is, the convolution operation is performed by shifting a folded version of h(λ) along x(λ) and finding the dot product at each shift n. It is assumed that the shift occurs in unit steps. In order to simplify the discussion, it will be assumed that h(n) is symmetric. In full convolution, the rightmost sample of h(λ) is aligned with the leftmost sample of x(λ) at the start of convolution, and the leftmost sample of h(λ) is aligned with the rightmost sample of x(λ) at the end of convolution. By convention, the filtered output is at the filter center; therefore, the duration of the filtered output is In order to perform the dot product at the start and end, x(λ) has to be padded by 2M samples on the left and on the right. A common practice is to pad the ends with zeros; however, other methods, such as padding with the mean value of the signal, linear padding, symmetric padding, periodic extension padding, and fitting polynomial functions, can also be employed [27,68,69]. For the given signal and filter durations, the full-convolution sum can be expressed as where,x(λ) is x(λ) padded with 2M samples on both ends. It is important to observe that the first 2M and the last 2M samples of the output are computed from partial overlaps between x(λ) and h(λ), which are the artifact coefficients in this case. Irrespective of the padding method employed, the outputs computed from the partial overlaps are the inaccurate artifact coefficients.
In practice, full convolution is performed very efficiently in the frequency domain by multiplying the DFTs of x(n) and h(n). The filtered output is given by the inverse DFT of the multiplication. The DFTs are computed by augmenting both sequences with zeros to have durations equal to or greater than N for the resulting circular convolution to be identical to the desired linear convolution. The DFTs are computed efficiently using fast Fourier transform (FFT) algorithms [70].
This form of convolution is the one used to generate the standard CWT coefficient matrix to ensure that the rows have the same duration as the input. Each row of the coefficient matrix is obtained by performing the same-convolution operation between x(n) and h(n). The convolution operation starts by aligning the filter center with the rightmost sample of x(λ) and ends with the alignment of the filter center with the rightmost sample of x(λ). The signal x(λ) is padded with M samples at both ends; therefore, the first M and the last M samples of the output are computed from partial overlaps, which result in artifact coefficients. For the given signal and filter durations, the same-convolution sum can be written as where,x(λ) is x(λ) now padded with M samples on both ends. We refer to the resulting scalogram, which has rows with the same duration as the input signal, as the same (S)-scalogram. The shorter duration same-convolution result can be obtained from fullconvolution by cropping out the first M and last M samples. In general, row j of the standard scalogram is generated by same-convolving x(n) with a wavelet having an effective support [−(S j )(M), (S j )(M)]; therefore, the first and last (S j )(M) coefficients in the row are the artifact coefficients, and the coefficients in between are the accurate coefficients.

Valid Convolution
The valid convolution operation yields coefficients that are accurate and is used to identify the COI in the coefficient matrix. In this form of convolution, the leftmost sample of the centered filter is aligned with the leftmost sample of x(λ) at the start, and the rightmost sample of the centered filter is aligned with the rightmost sample of x(λ) at the end of the convolution. Consequently, the duration of the output is [N − 2M]. Valid convolution requires no padding at the ends of x(λ); thus, none of the output samples are computed from partial overlaps resulting in no artifact coefficients. That is, all valid coefficients are computed accurately. Valid convolution can be obtained from same-convolution by cropping the first and last M samples and by cropping the first and last 2M samples from full-convolution. For the given signal and filter durations, the valid convolution sum can be written as As noted in Section 2.1, the COI is the boundary that separates the artifact coefficients from the accurate coefficients in the S-scalogram, the part of the S-scalogram that is inside the COI is the V-scalogram, and the entire S-scalogram with zeroed-out coefficients outside the COI is referred to as the Z-scalogram.

Appendix A.2. RRS Algorithm
The channel rankings can be computed from the 1-ERPs or the m-ERPs. The RRS algorithm is described in terms of m-ERPs and is applicable to 1-ERPs by setting m = 1. Given C channels, the ranking R m (c), c = 1, 2, . . . , C, is such that the channel with the highest multi-class separation is assigned the highest rank = 1 and the channel with the smallest is assigned the lowest rank = C. Based on this ranking, channels can be selected by evaluating the performance, starting with the best channel, and adding the next best channels until the performance starts to drop. Alternatively, a subset of channels with the highest ranks can be selected to decrease the computational requirements, especially for dense-electrode arrays. If L mxcb i and L mycb i are the m-ERPs of subject b i , i = 1, 2, . . . B, elicited by stimuli I x and I y at channel c, respectively, and ||L 1 − L 2 || is the Euclidean distance between L 1 and L 2 , the interclass separation between the m-ERPs belonging to a pair of classes, ω x and ω y , at channel c is given by are the q th m-ERPs; L mxcb i and L mycb i are the means, and N x and N y are the number of m-ERPs in the ω x and ω y m-ERP ensembles, respectively. Note that the pair (x, y) takes on the following values: The rank R (x,y)mb i (c) of ρ (x,y)mb i (c) for each subject b i is determined for all possible pairs of x and y. Then, the rank-sum (RS) of channel c for subject b i across all possible (U)(U − 1)/2 pairs of ERP classes is given by The ranking of the channels for subject b i is obtained from the rank of the rank-sum given by The ranking of the channels across all B subjects is determined by first summing the ranks to obtain the rank-sum and the final ranking of the channels across the U ERP classes and B subjects is determined from the rank of the rank-sum given by R m (c) = Rank{(RS) m (c)}.
The entire sequence of steps can be summarized as follows:

Rank
Sum across subjects Rank Sum across class pairs Rank ρ (x,y)mb i (c) For the special dichotomous (2-class) case, the ranking of the channels for each subject is determined directly from R mb i (c) which is the ranking of ρ mb i (c), the interclass separation of the two classes x and y. The ranking across all subjects is obtained from the rank-sum (RS) m (c) of channel c, which is given by and the final ranking of channel c is given by the rank-of-the rank-sum A final comment regarding the application of the RRS algorithm is that the channel rankings are determined by applying the RRS algorithm to the ERPs of only the training set.