Dimension Reduction Using New Bond Graph Algorithm and Deep Learning Pooling on EEG Signals for BCI

: One of the main challenges in studying brain signals is the large size of the data due to the use of many electrodes and the time-consuming sampling. Choosing the right dimensional reduction method can lead to a reduction in the data processing time. Evolutionary algorithms are one of the methods used to reduce the dimensions in the ﬁeld of EEG brain signals, which have shown better performance than other common methods. In this article, (1) a new Bond Graph algorithm (BGA) is introduced that has demonstrated better performance on eight benchmark functions compared to genetic algorithm and particle swarm optimization. Our algorithm has fast convergence and does not get stuck in local optimums. (2) Reductions of features, electrodes, and the frequency range have been evaluated simultaneously for brain signals (left-handed and right-handed). BGA and other algorithms are used to reduce features. (3) Feature extraction and feature selection (with algorithms) for time domain, frequency domain, wavelet coefﬁcients, and autoregression have been studied as well as electrode reduction and frequency interval reduction. (4) First, the features/properties (algorithms) are reduced, the electrodes are reduced, and the frequency range is reduced, which is followed by the construction of new signals based on the proposed formulas. Then, a Common Spatial Pattern is used to remove noise and feature extraction and is classiﬁed by a classiﬁer. (5) A separate study with a deep sampling method has been implemented as feature selection in several layers with functions and different window sizes. This part is also associated with reducing the feature and reducing the frequency range. All items expressed in data set IIa from BCI competition IV (the left hand and right hand) have been evaluated between one and three channels, with better results for similar cases (in close proximity). Our method demonstrated an increased accuracy by 5 to 8% and an increased kappa by 5%.


Introduction
Today, the EEG is used as a non-invasive system to record brain signals from electrodes on the scalp for brain activity. Brain signals aim to control systems for sick and healthy people, playing an essential role in various programs in different fields [1][2][3][4][5][6][7]. The primary advantage of processing brain signals is to discover information for prediction and classification. Brain signals have some general information-processing steps: filtering, feature extraction, feature selection, and classification. Most researchers have focused on feature selection for finding new suitable methods and algorithms for improvement [8][9][10][11][12][13][14].
Feature selection can appear in three possible methods.

•
In the first method, feature selection (one-dimensional and more) after data preprocessing (e.g., filtering) and before feature extraction (if used). If the feature extraction is not used, the selected features will be sent directly to the classifiers for classification after reducing the dimensions.

•
In the third method, after reducing the dimensions of the features (i.e., feature selection), the following steps are performed, respectively: (1) feature extraction, (2) feature selection, and (3) classification.
In some studies, using the first and third methods, a dimensional reduction is used to reduce features, reduce electrodes, or both.
In all of the methods mentioned above, the features from a larger set (including channels, etc.) are converted into better and smaller feature sets by removing the less effective features. In most cases, a certain frequency range  Hz for the brain signal) is processed for specific purposes (such as left and right-hand imaging). In most studies, all the features along with specific channels or whole channels have been considered [15][16][17][18][19][20]. In addition, in some studies, one-dimensional or two-dimensional reduction methods are performed for reducing features and the electrodes. However, none of the previous studies examined reducing the frequency range from one to three electrodes.
In some articles, certain frequency ranges have been used. In other words, feature reduction is used by reducing the frequency range with a single electrode. Given that human brain signals are generally a combination of different domains, examining them as a single domain is challenging and has been studied in many previous studies [20][21][22][23][24][25][26][27][28][29][30][31][32]. In this study, we propose a new optimization algorithm based on the concept of Bond Graph, which is the modeling and simulation language for multi-domain systems. We examined and analyzed the reduction of frequency intervals of brain signals.
The main contributions of this paper are as follows: 1. Introduction of the Bond Graph algorithm (BGA), which demonstrated a good performance compared to the genetic algorithm (GA) and particle swarm optimization (PSO) in the benchmark functions. The assessment was performed by testing on eight benchmarks and EEG brain signals of right and left-hand perception (two classes). Important features of our proposed algorithm (i.e., BGA) include fast convergence and no trapping in the local optimization. 2.
The reduction of features, reduction of electrodes, and frequency range have been evaluated simultaneously. 3.
Feature extraction and feature selection for time domain, frequency domain, wavelet coefficients, and autoregression have been investigated as feature reduction, electrode reduction, and frequency range reduction. 4.
Initially, the features, electrodes, and frequency are reduced, and then, new signals are made based on them using special formulas for each row. Then, a common spatial pattern (CSP) is performed to remove noise and feature extraction, which is followed by classification.

5.
A separate study with a deep learning sampling method has been implemented as feature selection in several layers with different functions and different window sizes. This part also includes feature and frequency reduction.
In cases 1 to 4, the proposed BGA, GA, PSO, and Quantum Genetic Algorithm (QGA) are used for evaluation. In case 5, no algorithm was required, but simple computational functions were used.
The main objective of this study is dimension reduction in single and multi-domain brain signals. For this purpose, five general models (seven model scenarios) have been implemented for testing: 1.
One filter bank and single channel from a set of two filter banks (i.e., 2 and 5) with two channels (i.e., right and left hemispheres of the brain) have been investigated by four main algorithms, i.e., BGA, GA, PSO, and QGA. The two channels' features are selected from the right and left hemispheres. The reduction of features for two channels is considered separately for each channel.

2.
A combination of filter banks with a single channel from a set of four filter banks (i.e., 2, 5, 6, and 9) and two channels (i.e., right and left hemispheres of the brain) have been investigated with the same algorithms. The reduction of features for two channels is considered separately for each channel. 3.
Four general methods, including time domain, frequency domain, wavelet coefficients, and autoregression have been used for feature extraction from a combination of two filter banks (i.e., 2 and 5) as a new signal with a single channel (i.e., right or left hemispheres of the brain), two channels (i.e., right and left hemispheres of the brain), or three channels (i.e., right and left hemispheres and center of the brain). Then, feature selection is done by BGA, GA, PSO, and QGA separately. 4.
Feature selection was done by four main algorithms on a combination of filter banks (from filter banks 2, 5, and 6) and two (i.e., right and left hemispheres of the brain) or three channels (i.e., right and left hemispheres and center of the brain). Then, using special formulas, new signals were formed, which were used as input for CSP. Finally, the ELM classifier classified the extracted features from CSP.

5.
Feature selection was performed by deep learning sampling with five general functions on all filter banks together or each filter bank individually with all channels in three sampling models. In the first sampling model, each filter bank with the same functions was used in all layers. In the second sampling model, the functions were the same selected for all layers and all filter banks. In the third sampling model, the functions were randomly selected for each layer.
The rest of the article is organized as follows: In the Section 2, a summary of previous work on the dimensionality in all items is provided for brain signals. Then, in the Section 3, we review the previous related work and define the Bond Graph and how to implement the proposed algorithm and use it to reduce the dimension in brain signals and reduce the dimension by sampling. The configuration of the experiments and the data set used in this article are described in the Section 4. Finally, in the Section 5, the results of the experiments and beads are analyzed and examined. The last part presents the conclusion.

Previous Work
In the article [10], Adham et al. implemented specific individual and specific masks working with the CAR method, standard resource methods, and ELM. The method was a two-dimensional reduction of features and electrodes, which has achieved accuracy in the range of 15 to 32%. In the article [33], Jing Luo et al. performed the feature extraction method by analyzing the wavelet components for two channels (each channel separately) in two modes, first with a dynamic frequency feature selection method and second with a full frequency range (without frequency range selection). They obtained accuracy results of 68% and 67% for the two modes, respectively. So, the feature selection had improved their accuracy. In their work, feature selection was performed after feature extraction, which reduced the channels from 22 to two channels. Liu et al. [34] used the glow algorithm to select the features. First, feature extraction is performed by CSP for all channels and three important and specific channels individually (channels are left hemisphere, right hemisphere, and center of the brain). The GA, PSO, and a Firefly Algorithm obtained 59.85%, 60%, and 70.20% accuracies, respectively. In the article [35], Bashar Awwad used the sixth-order autoregression method with a sample window shift. They reduced the features to 20 features (characteristics) with PCA and then classified them with LDA. They obtained accuracy results between 46.8% and 59% for one channel and between 48% and 62% for two channels. They reported the average of the best accuracies in different nonidentical channels as 59.67%. In the article [36], Adham et al. designed an individual and specific work mask that has been shared between individuals, so that part of the data is trained for subjects and tested for the untrained subjects. Features reduction and channel reduction were up to 90%, and the average accuracy was between 73.5% and 74.5%. In the article [37], Nakisa et al. used frequency-temporal and temporal-frequency feature extraction. They implemented five channels with ACO, DE, GA, SA, and PSO algorithms along with the PNN classifier to select the features. The best result in this study was 65% for four classes. Peterson [38] used the MNE and Infomax preprocessing methods along with feature extraction with spectral power by selecting the feature by GA and SVM classifier. In this paper, for two classes, the number of channels has been reduced from 32 to two. The accuracy range of the subject is between 55% and 67%. By changing the method of the CSP with SVM classifier, Mahnaz Arvandeh [39] obtained 70.90% accuracy for three channels, 79.07% for four to 14 channels, and 81.63% accuracy for nine to 19 channels, while the total accuracy for the standard CSP is 79.23%.
Yang et al. [40] used the time domain parameter method to extract the feature and the time-space optimizer to select the optimal channels along with the Fischer analysis classifier. When three channels were selected from 118 channels with BP and TDPS, average results of 71% and 72% were obtained, respectively. When using all the channels, the accuracy was 76%. While in their proposed model, they could increase the accuracy up to 78%. Chen et al. [41] used the method of extracting features from nine filter banks (bands) using the CSP for each filter bank. After feature selection from features collected from all the filter banks, the NBW classifier is used for classification. In their model, the reduction of channels was three or 13 out of 22 channels. The average accuracy was 75% for three specific channels and 87% for 13 channels. In the article [42] by Izabela Rejer, the BSS and PCA methods were used to select the feature. Three channels and 12 filter banks (bands) were used. However, in this study, only one person was examined. For different modes, the accuracy was between 55% and 80% for FSS and between 52% and 87% for PCA. Eslahi et al. [43] have implemented a modified feature extraction method with wavelet subbands and feature selection by GA with four classifiers. A person with three channels was used for the experiment. Results between 68% and 84% were obtained for different classifiers. Wang [44] used the method of the Warp Laser Space Group. The feature extraction is based on the statistical time domain, spectral power, autoregression, and wavelet coefficients in their study. The Warp Laser space group with strategy autoregression model has been used to select channels and features. The total results of the dataset were 83.37%, which reduced the number of channels to 17 and 18 channels, resulting in an accuracy of 84.7%. Kuman [45] used noise elimination cases, cross-correlation normalization by selecting effective channels (from the left and right nipples), and the calculation of data statistics along with ANN and SVM classifiers. Data are collected by a sensory headset device to imagine left and right finger movements. With 14 channels and 420 features, they obtained an average accuracy of 95%. With 10 features and 14 channels, they obtained an average accuracy of 96.69%. With three features and 14 channels, they obtained an average accuracy of 97.34%. In the article by Kasun Amarasinghe [46], where the sensing headset collected the data, the feature selection method was used in addition to the SVM, ANN, and NB classifiers. With 14 channels and 11 features selected for three classes and one person for NB, the accuracy was 82.97% for ANN 83.07% and 83.26% for SVM.
Chin [47] used the channel analysis method to select channels for filtering banks. Three to 14 channels were selected for filtering banks, and feature extraction was done by CSP. Channel reduction was made based on the accuracy of its valid crossover, and the feature reduction was not performed. The total average between 13 and 14 channels is 84.51%, and for three channels, it is 75%. Ang et al. [48] used feature selection from the filter bank common spatial pattern (FBCSP) method with the NBPW classifier. With the original filter bank of CSP (oFBCSP), kappa was 60.7%. With the S filter bank of CSP (sFBCSP), kappa was 61.9%. With the E filter bank of CSP (eFBCSP), kappa was 63.5%.
We summarize the advantage and disadvantages of the previous methods that we presented above: Advantages: • Most of them use heuristic algorithms, which achieved good performance related to their works.

•
The heuristic algorithms are suitable for selecting features with or without extracting features using conventional methods.

•
In all of them, two general models are used that include (1) feature selection on preprocessed data before the final classification, (2) feature selection after feature extraction on preprocessed data followed by the final classification. It has been demonstrated that the second method is more efficient. However, a combination of small and large frequency domains to reduce the noise and increase the distinctive features between classes is not studied.

Bond Graph Introduction
Bond Graph presents a graphical representation of the dynamic behavior of independent domains from physicals systems [51][52][53]. A graphical representation of the Bond Graph is similar to a flow chart diagram with a different meaning for analyzing the systems. The Bond Graph represents the state of space as a dialogue and interaction inside, outside, and between the systems. The Bond Graph uses a graphical model to show and explain details of the system and the relationship of subsystems and elements. This relationship shows the implementation of the calculation of the system for solving problems. Similar to the block diagram, it uses a single-current graph and also represents one-way information. The Bond Graph can integrate multiple domains in the best possible way.
The basis of the Bond Graph relies on bands. These bands connect a single port (port), a double port, and several ports of the elements. A band is a line connection between two and more elements, and the port is the connection point between an element and a band. The band provides energy and power in real time. Power variables consist of pairs of variables distinguished by the band's current (power is calculated based on flow and effort). These variables are flow and effect variables. For example, the variables of flow and effort are electric current and electric voltage in electrical systems, respectively. Meanwhile, the variables of flow and effort are velocity and force in mechanical systems, respectively. Figure 1 shows an example of a Bond Graph architecture for the mechanical field. The connection model of the sub-models determines the priority and superiority in the direction of computing the bands. Calculations of the port variables will be determined. The structure of the formula calculation is determined based on the connection model, which contributes to solving the problem. Bond Graphs can be combined with diagram block ports. Bond Graph models can be used as power ports, signal ports, and output signals. In the physical domain, the concept of a band (energy (electric potential) or current (electric current)) can be used to support modeling processes.
For example, in electrical networks, port variables are moved through the Bond Graph elements, the electrical voltage is moved across the element port, and the electric current is moved through the port. A port is an interface from one element to another (connecting points of bands). Power is obtained from the potential multiplication relation of the flow, which is always changing. Power is changed by the system port. A power band connection means that the band notes that the energy changes between the elements. A band is the design of the edges with the elements and notes the direction of the bands in a positive direction from the energy flow. We have a flow source that delivers power to the system, and other elements absorb power.
The Bond Graph consists of two connections. Connection 1: The flow (current) through all connection bands is the same (the algebraic sum of the currents at the input and output nodes is zero). It means that the algebraic sum of effort (voltage) differences along one closed loop is zero (Kirchhoff's voltage law).
Connection 0: The input effort (voltage) of the connection of all elements is the same. The effort (voltage) across all the band connections is the same, which means that the algebraic sum of the current is zero (Kirchhoff's current law).
The variables of flow and effort are electric current and electric voltage in electrical systems, respectively. Figure 1 shows an example of a Bond Graph architecture for a mechanical field. Table 1 describes several domains for the Bond Graph methodology that show the flow and effort variables in the various domains associated with them. The most basic variables of each domain are introduced as flow and effort variables that can be understood in different domains. We briefly describe the graph structure of the Bond Graph below.
To produce a graph-band model, we start with an ideal physical model. In fact, there is a systematic method that we present here as a method or procedure. This procedure generally consists of identifying the basic domains and elements, generating the connection structure (called the connection structure), placing the elements, and possibly simplifying the diagram. This method is different for mechanical domains compared to other domains. These differences are expressed between the parentheses. This is because elements must be linked to different variables or between global variables. Effort variables in non-mechanical domains and velocities (flow variables) in mechanical domains are the global variables we need.
The Bond Graph includes the following steps. Steps 1 and 2 are about identifying domains and elements. Steps 3 to 5 describe the production of a connection structure, which is called a connection structure.

1.
Determine which physical domains exist in the system and identify all the basic elements such as C or C-elements (storage elements such as a capacitor or spring), I or I-elements (storage elements such as an inductor or mass), R or R-elements (dissipate free energy such as resistors), SE (sources effort such as motors), SF (sources flow such as motors), TF (transformer, within the same domain (toothed wheel) or between different domains (electromotor)), and GY (gyrator such as an electromotor, a pump, or a turbine). For this purpose, each element is assigned a unique name to distinguish it from the other.

2.
Specify a reference source in the ideal physical model for each domain, given that only resources are directed in the mechanical realm.

3.
Identify other variables (mechanical domains: speed) and assign unique names to them.

4.
Graphically design these effort variables (mechanical: velocities) rather than the source, zero connection (connection 0), and connection mechanics (connection one).

5.
Identify all the effort variables (mechanical: speed difference (=current)) required for the ports of all the elements listed in step 1 of the junction structure.

Bond Graph Optimization and Algorithm
The optimization algorithm proposed in this paper is based on the bond graph methodology. Our proposed algorithm leads to the good convergence and performance of these models. The main structure of this optimizer model and some steps that have been taken to design our algorithm are presented below: 1.
Determining the number and type of the physical domains (models): In our model, four specific domains with different formulas for each domain (type of domains) are introduced.

2.
Combination of subdomain details: The subdomains are small linear algebraic equations that form bigger algebraic formulas (by addition of them) for calculating the changes in features. Since the small formulas are linear, they lead to the formation of bigger linear formulas. Variables of the small algebraic formulas are effective element attributes.

3.
Identifying the basic elements: In our method, the best general element or best local element, the worst general element or any local element, and the average public element or average local element are introduced as basic elements.

4.
Identifying all the influencers and naming them: Different combination formulas that are specific to each domain (model) are connected based on ports, and the most influential combinations are determined. Each linear algebraic addition forms a port in our model.

5.
Determining the algebraic sum of coefficient of each domain: the algebraic sum of the coefficient of each domain is set to zero or one. 6.
Calculation of power based on current or energy: Finally, the power is calculated based on the values obtained multiplied by random numbers. Firstly, values obtained from formula(s) are multiplied by one random number. Secondly, new element values are calculated by adding the powers of each element and its old element value.
In our algorithm, the Bond Graph concepts are partially used with some modifications. This algorithm can be used similarly to other algorithms to optimize problems in all areas. However, unlike other algorithms, the same or different formula can be used for each dimension. Our algorithm can be customized by modification of the base formulas, domain, and effective elements for other optimization problems.
Our four domains are given below with their formulas and combinations: All coefficients are calculated as follows.
where α 1 , α 2, α 3 , and α 4 are the coefficients used in the four models. The values of the coefficients are obtained from the distance of the identified elements or the current element, which is as follows: , and x nlb are variables for the current element, global best, global-local best, global worst element, and normal local best. These variables are calculated in a similar way as in the PSO algorithm, but for normal local best (nlb), near to the average of them is selected.
In this section, four domains are considered to introduce one formula or set of formulas for updating each domain. For each domain, the combination of models is considered different. If the calculation formula for a domain is more than one, the calculation for new elements is divided between formulas, respectively. The following ranges and formulas are stated in 4, 5, 6, and 7, which are: x New_Elements = x Old_Elements + ∆E Model_x .
In each domain, the number of formulas depends on your design and the model you implement. In this case, one formula for the first domain, three formulas for the second domain, and four formulas for the third and fourth domains are used to change the values of the elements. When more than one formula is used, the distribution of elements is done randomly. It is possible to distribute all of the elements to one formula in each iteration. However, in fact, the elements distribute between different formulas. Different modes of this model can be used: (1) Using the same formulas (same range) to optimize more dimensions at runtime; (2) Using different formulas (different domains) to optimize more dimensions at runtime; and (3) Using all different formulas (different domains) to optimize more dimensions at runtime. Due to the high complexity of this algorithm, in this article, the first part of the tests is reviewed.
The pseudo-code of the Bond Graph algorithm is defined as follows ( Figure 2): (1) Define algorithm parameters (algorithm variables, elements, element properties, etc.).
(2) Initialize the defined parameters (initialize the properties of the elements and some fixed parameters such as selecting the domain as a constant). (9) Check whether the best element is general and better than the predicted value or not? (10) If it is better, go to step 11 and if not, go to step 5. (11) Execution is completed, because the best general element is the best answer to solve the problem.
We explain the differences between BGA and PSO in the following: • Structure of main formula for calculation: In PSO, only the difference between the current particle and the local best particle and global best particle is calculated in each iteration. The coefficients are two fixed random numbers. However, in BGA, for example, in model three, three formulas are used for updating and improving the values of features. The first formula calculates the difference between the current element and the global best and global-local best, and the coefficients α 1 and α 2 are calculated separately for each iteration. The second formula calculates the difference of the current element from the global best and global worst, and the coefficients α 1 and α 4 are calculated separately for each iteration. The third formula calculates the difference between the current element and the normal local best and global worst, and the coefficients α 3 and α 4 are calculated separately for each iteration.

Common Spatial Pattern (CSP)
The common spatial pattern algorithm (CSP) [41,47,48] is known as an efficient and effective EEG signal class analyzer. In other words, it is a feature extraction method that gives signals from several channels below the mapping space, which can maximize the difference between classes and minimize their similarities. This is accomplished by maximizing the variance of one class by minimizing the variance of another class.
The CSP calculation is done as follows where C is the covariance of the normalized space of data input E, which provides raw data from a single imaging period. E is an N × T matrix. T is the number of electrodes or channels, and N is the number of samples in the channel. Trace (.) is defined to be the sum of the elements on the main diagonal of matrix A. The apostrophe represents the transposition operator. Trace is also a set of diagonal elements of x.
The covariance matrix of both classes C 1 and C 2 is calculated by the average of several imaging periods of the EEG data, and the covariance of the combined space C c is calculated as follows: where C c is real and symmetric and can be defined as follows: where u c is a matrix of special vectors and λ c is the diameter of the matrix of eigenvalues. p is whitening transformation: The variances in space are equalized by u c , and all eigenvalues of pc L p prime are equal to 1.
S R = pc R p (29) S L and S R are matrix covariance with eigenvector and eigenvalue, provided that if Eigenvalues are arranged in descending order, and the projection matrix is defined as W: The reflection matrix of each training is as follows: 34) where N rows are selected to represent each period of conception W P (p = 1, 2, . . . , N) and the covariance P of Z, P components of the feature vectors are calculated for the nth instruction. Normalized variance is used as the algorithm:

Spatial and Temporal Frequency Domains
This section deals with the formulas of time-domain, frequency-domain, autoregression, and wavelet coefficients because most of the articles explain these methods in detail. In our paper, the formulas are used from papers [28,44]. In the following, we explain the formulas briefly.
These extracting features formulas from 36 to 53 and 54 to 57 belong to the timedomain and frequency-domain, and other formulas that belong to other methods are mentioned above. For wavelet coefficients, the MATLAB function is used. Based on algorithms, features are selecting from different formulas. Using algorithms, the features of some formulas are selected, and the rest are not selected.
Modi f ied mean absolute value type 1, modi f ied mean absolute value type 2, Simple square integral, SSI = The absolute value of the 3rd, 4th, and 5th temporal moment Root mean square, Di f f erence absolute standard deviation value, r xx (t) and r xx (n) are autocorrelation functions.

Reduce and Extract Features of Brain Signals with Filter Banks and Bands (FBs)
In some studies, feature and electrode reduction methods have been used in combination or separately to process information to reduce data volume. GA and PSO or some other algorithms have been implemented for each person or individual to reduce the dimensions before extracting the features or selecting the features after extracting the features or as the classifiers for classification separately. In this paper, our proposed method is inspired by different methods to reduce the dimensions, while the proposed method is somewhat different from the methods presented in some articles. In a way, it has partial similarities with the items mentioned in the previous. First, in all our models, bandwidth reduction (filter bank) has been used to extract the feature. Second, in some cases, the reduction of features and electrodes has also been used. Third, selecting the feature is done after extracting the features. For this purpose, we implemented different models, which support all items. These models (five general models) are as follows: (1) The selection of features is made through a filter bank with a specific channel. Two filter banks and two channels are used for the purpose (separately for individuals by selecting two filters and two specific channels). (2) The selection of features is made through a hybrid filter bank with a specific channel, and a new signal from two filter banks along the channel is used for this purpose.
(Separately for a person with a combined signal and two specific channels). (3) Reduction of features for two channels with two filter banks (in the range of 8 Hz) is intended for frequency reduction. We have a 63% reduction in the frequency range; i.e., from the whole frequency domain, which is 8 to 30 Hz, we have used a combination of smaller frequency ranges. For example, 8 to 12 Hz and 20 to 24 Hz represent a combination of filter banks 2 and 5. Then, we create a signal matrix with dimensions of 10 × 100 for each imaging period (approximately 1.33% of the value of imaging period characteristics for a channel) that is used as input to CSP to filter and extract features. This is followed by a classification of features. (4) Features reduction for three channels and three bank or band filters (12 Hz interval) has been used to reduce the frequency. From the 22 Hz range (8-30 Hz), the 12 Hz range is selected (45% reduction in the frequency range). Then, a signal matrix of 18 × 100 is created for each imaging period (generally generated for all imaging periods), which is 2.4% of one channel. The total features are generally slightly more than two channels. Then, the extracted properties are used as two-dimensional matrices as input to the CSP to filter and extract the properties. After extracting the features, classification is done on them. (5) Feature extraction from 1 to 3 channels is done separately (for each channel, feature extraction and feature selection are made separately). Time domain, frequency domain, wavelet coefficients, and autoregression coefficients were used to extract the features. From each feature extraction method, a certain amount is considered for feature selection. In other words, four parts of features (features related to each proposed method) are used for each channel separately. Feature extraction and feature selection have been implemented for two channels, 8 parts, and for 3 channels, 12 parts. The amount of feature selection is determined separately from each method and channel, and all features from all channels and methods are used for classification.
In all the stated cases, 100 features are considered for the channel using algorithms. In other words, the reduction is done from 875 to 100 features. To evaluate the five models mentioned above, we utilized the proposed BGA, PSO, GA, and QGA with ELM classifier for classification.
If more than one channel is used, the ability to detect 100 features is the same on all channels. The coefficients determine the location of the features. The values of coefficients of the elements in our algorithm, chromosomes of genes in GA and QGA, and attributes of particles in PSO between zero and one change during the execution of the algorithm. However, in order to improve the accuracy in the fifth model, the coefficients related to the time domain and autoregression are set to binary values.
For models one and two, filter banks 2 and 5 and channels c3 and c4 have been used, and their evaluation has been performed on four algorithms, namely the proposed algorithm, PSO, GA, and QGA. However, for the rest of the models, the ELM classification and the proposed algorithm have been evaluated on the filter banks and individuals, etc. Figure 3 shows an overview of the model for better expression and understanding in detail, which is described in more detail in the experimental section. In this figure, BGA is used as the main algorithm. The other three algorithms (GA, PSO, and QGA) are used similarly.

Reduce Features Using Pooling for Channels
The key idea is to make a sampling that happens randomly at each layer of complexity (deep learning). Common forms of complex (deep learning) sampling are a function of the mean and maximum determinants that select the largest activity in each sampling area. In a random sample, the activity selection consists of a polynomial explanation drawn by the activities in the sampling area. A secondary view of random sampling is similar to standard maximal sampling but with the copying of input images, each of which has short local variations. This is similar to the explicit elastic deformation of input images, which provides excellent performance for some datasets. In addition, the use of random sampling in a multilayer model gives a large number of changes compared to the higher layers. In random sampling, we select the pool mapping response by sampling a distributed polynomial description of activities from each pool area. In maximum sampling, only the strongest activity is taken from the time filters with input from each area, and whether the rest of the activities have any effect is not considered. For this purpose, random sampling is implemented for activities when maximum activities will be useful. You can see an example of this sampling model in Figure 4 [54][55][56]. In some papers, convolutional layers accept arbitrary input sizes. However, they produce variable output sizes. Classifications and artificial neural layers need to be connected to fixed vectors. The spatial pyramid sampling improves the previous method to store space information by sampling in local space depots. Therefore, these depots are proportionate in size. They are relative to the image.
In this paper, we introduce random sampling to select features randomly and statistically. This is because sampling methods are considered in all layers. Each layer reduces the number of features. After the last layer of classification, it begins to gain accurate classification. Most methods use GA and particle cluster optimization to reduce the dimension that the algorithms use. In our study, the feature reduction model is based on deep learning sampling. Deep learning sampling can be used for samples with any channel, i.e., hybrid channels or individuals. In this article, sampling is done in two ways: (1) for all channels and features for each filter bank individually, and (2) for all channels and features for all filter banks altogether.
In addition to the functions used in older models, i.e., maximum, minimum, and average, two new functions are introduced in this article. The first is the intelligent selection function that selects the maximum or minimum, or average based on a specific formula. In this case, first, we find the average of all the features of the window; then, we calculate the absolute value of the difference between their distance and the average. The maximum value is the selected value. The second function is calculating the average without the participation of the maximum value.
In the following sections, we briefly present a structure of the sampling layer model with classification for our idea: (1) In the filter input of banks in general, no preprocessing has been done. After the record of subject, data, or domain, the specified frequencies are filtered. (2) Select the size of different molds (from 2 to 6) for each sampling layer (fixed for each layer). However, the active layers are three layers for models with sizes 2 to 4 and five layers for models with sizes 5 to 6. (3) Selection of sampling functions for each layer (maximum, minimum, average function, smart maximum or minimum selection, average function without considering the maximum). All sampling models are supported for all layers. Two sampling models for layers are used. First, the layers use only one identical function. Second, the layers use different functions. (4) The output of each layer is the input of the next layer: according to the combination of channels and features, the size of the template is a vector, and the output will be one page. Dimensional reduction occurs for features separately. (5) Perform steps 2 to 4 for all layers. (6) The outputs from the last layers all add up and form a vector (all channels). For classification, if the output is two-dimensional (more than one channel), it is converted to a one-dimensional matrix to send classifiers. (7) Selection of training and testing model for classification (10-10 fold method on total data): The whole data set is divided into ten parts. One part is used for testing and the rest are used for training. The whole part of the steps related to testing this model is repeated ten times. (8) Classification is done in two classes. The results are different for each person. The results of the first random classification are discussed later in this paper.
An overview of our proposed model for reducing the channel dimension is presented in Figures 5 and 6.

Case Study and Numerical Results on Proposed Algorithm
The parameters of the benchmark functions are shown in Table 2. These are 8 general functions for testing optimization regions. Our algorithm with all of the models (domains) along with GA and PSO [57,58] are run in MATLAB 2016 software on PC, and a comparison is performed. All details of the test scenarios for the algorithms are briefly described in Table 3 [59].

Bond Graph Algorithm (BGA)
Compared to other algorithms, the BGA for the benchmark functions is rapid convergence and near to the global optimum. In other words, it avoids falling into local traps (the main targets are ranges 3 and 4). It is demonstrated that the performance of this model is better than PSO and GA. Figure 7 shows the convergence diagram and the effect of the BGA compared to the other two algorithms. The convergences of the two benchmark functions over time for better understanding are illustrated in Figure 7. Domains 3 and 4 had excellent results near the global optimum. However, the first two domains in our definition had the worst convergence, which needs to be improved (Figure 7 down). Table 4 shows the comparison of 35-dimensional results of GA and PSO and all of the models of the BGA. From this table, we can conclude that the design of the formulas and variables used is very important and influential in convergence.
The BGA optimization algorithm, similar to PSO, and GA is linear with order O(n). Only one loop is used for calculation. BGA is six times more time consuming than PSO and half as time consuming as GA. For example, the average execution time for 400 iterations and 30 executions on Spheres (first bench mark) for GA, BGA, and BGA was 0.83 s, 0.42 s, and PSO 0.07 s, respectively.

Experiments and Scenarios
In this study, data set IIa from BCI competition IV [60] is used for our experiments. This database contains the following details: (1) The number of participants for the test consists of nine subjects. (2) Brain information record electrodes include 22 channels (processing is related to these channels).  Table 5 [60]. In our experiments, details of the new optimizer paradigm are described in Section 3 for the production of models. This is similar to most heuristic-based algorithms for finding solutions to problems by reducing the dimension for different modes and extracting each case individually. In this approach, a set of features in the elements is used to represent the location of features in the channels.
In the following, we present the details of the models (seven model scenarios) used in our experiments for dimension reduction before the feature extraction (three-dimensional reduction) and the models that are used for dimension reduction after feature extraction: 1.
The first model scenario: a filter bank with one channel for each participant is examined by four algorithms with ELM [61,62] classification. Two filter banks, i.e., 2 and 5 with two channels, i.e., 8 and 12 which are representing C3 and C4, respectively, are used. 2.
The second model scenario: a new hybrid filter (combination of two, three, or four filter banks) with one channel with four algorithms with ELM classification is examined for each participant. A hybrid filter bank with two channels, i.e., 8 and 12, representing C3 and C4, respectively, is used. 3.
The third model scenario: In this model scenario, we first extract the features from channel 8 (C3) or 12 (C4) according to Table 6. Then, from the extracted features, the feature selection is done with graph algorithms, genetics, particle clustering, and quantum genetics (Table 6), and then, the classification is done. 4.
The fourth model scenario: In this mode, first, feature extraction from two channels 8 (C3) and 12 (C4) according to Table 6 is performed. Then, from the extracted features, feature selection (i.e., dimension reduction) is done by the specified algorithms. In this approach, the selection of features for each method and each channel is selected separately (Table 6), and their set is sent to the classifier for classification. 5.
The fifth model scenario: This model scenario is the same as the previous model scenario except that in this model scenario, we used three channels. First, feature extraction is done from three channels 8 (C3), 10 (CZ), and 12 (C4), according to Table 6. Then, from the extracted features, feature selection (i.e., dimension reduction) is done by the specified algorithms. In this case, we have four methods and three channels forming 12 parts. From each part, the best features are selected for the classification according to Table 6. We used the ELM classifier for classification. 6.
The sixth model scenario: In this model scenario, first, the features of channels 8 and 12 are reduced by 100 features (100 features are selected for each channel). These features are selected from the frequency domain of 8-12 Hz, 20-24 Hz, and 24-28 Hz that are within the general range of 8-30 Hz. Subsequently, for both of them, using specific formulas, a matrix of 10 × 100 is prepared for each period of conception as a part of CSP. The extracted components of CSP are equal to m = 5 (10 features). After the feature extraction was done for all imaging periods, the classification operation is performed on it. The formulas of each row in the CSP input matrix are as follows: where R is a row of the CSP input matrix. For R 1 to R 4 , the values are considered from a specific electrode and frequency domain (one filter bank). For the rest, each row is calculated by the addition of two selected electrodes and frequencies (two filter banks).

7.
Seventh model scenario: This model scenario is similar to model scenario 6, but in this model scenario, first, the features are selected from three channels, i.e., 8, 10, and 12, that are reduced by 100 features. These features are selected from the frequency domain of 8-12 Hz, 20-24 Hz, and 24-28 Hz that are within the general range of 8-30 Hz. After that, for both of them, using specific formulas, a matrix of 18 × 100 is prepared for each period of conception as a part of CSP. Extracted components of CSP are equal to m = 5 (10 features). After the feature extraction was done for all imaging periods, the classification operation is performed on it. These formulas are the same as the previous ten formulas in Part 6, to which eight new formulas have been added. The formulas are as follows: The following are the sampling models used to reduce the dimension based on deep learning sampling: 1.
The first sampling model: Dimensional reduction based on deep learning sampling for each filter bank and participant is done separately with different window sizes and different functions. The same function is used in the layers.

2.
The second sampling model: Dimensional reduction based on deep learning sampling for all filter banks together and participants separately is done with different window sizes and different functions. The same function is used in the layers.

3.
Third sampling model: Dimensional reduction based on deep learning sampling for all filter banks together and participants separately is done with different window sizes and different functions. Different functions randomly selected with non-uniform distribution are used in each layer.
Two diagnostic measurements, i.e., accuracy and kappa, are considered for the analysis of each mental task. Table 7 shows the results of a dimensional reduction by 100 features performed by four algorithms on filter bank number 2 (frequency range was 8 to 12). The average accuracies for channel 8 were calculated as 61.1%, 61.0%, 61.3%, and 60.0%, respectively, and for channel 12, they were calculated as 61.3%, 61.7%, 62.3%, and 61.4%, respectively for four algorithms, i.e., PSO, GA, BGA (with the third model), and QGA [63,64]. The overall best accuracy for the channels 8 and 12 are highlighted for each algorithm. The average accuracy of our proposed algorithm is better than the others. In channel 8, our algorithm demonstrated a slightly better accuracy compared with PSO and GA (i.e., 0.1%) and outperformed compared with the QGA by about 1%. In channel 12, our algorithm outperformed other algorithms by about 1-2%. However, for some test subjects, the highest accuracy was not obtained by our proposed algorithm. For example, in channel 8, the highest accuracy was obtained by GA for the first, seventh, and ninth subjects, i.e., 62.2%, 61.4%, and 60.8%, respectively. However, in the same channel, our algorithm had better accuracy for the second, third, fifth, and eighth subjects, i.e., 61.7%, 61.9%, 60.8%, and 60.3%, respectively. In channel 12, our proposed algorithm had the highest accuracy for the first, second, third, seventh, eighth, and ninth subjects, i.e., 60.5%, 60.3%, 63.0%, 66.6%, 60.4%, and 62.5%, respectively.  Table 8 shows the results of a dimensional reduction by 100 features performed by four algorithms on filter bank number 5 (frequency range was 20 to 24). The average accuracies for channel 8 were calculated as 60.8%, 61.2%, 61.4%, and 60.1%, respectively, and for channel 12, they were calculated as 61.5%, 61.9%, 63.2%, and 61.1%, respectively for four algorithms, i.e., PSO, GA, BGA (with the third model), and QGA. In general, for each channel and subject, the highest accuracies are about 0.2 to 1% higher than the other algorithms. Our proposed algorithm performed better for most of the cases. However, for some test subjects, the highest accuracy was not obtained by our proposed algorithm. For example, in channel 8, the highest accuracies were obtained by GA for the first, third, and sixth subjects, i.e., 61.6%, 62.0%, and 63.2%, respectively, and for the ninth subject, the highest accuracy, i.e., 61.8%, was obtained by QGA. In channel 12, the highest accuracy, i.e., 65.2%, was obtained for the fourth subject by PSO algorithm. The rest of the highest accuracies was obtained by our proposed algorithm in both channels for different subjects.

Results of Second Model Scenario
In the following, we will examine the results of the hybrid signals, i.e., a combination of filter banks (a) 2 and 5 presented in Table 9, (b) 2, 5, and 6 presented in Table 10, and (c) 2, 5, 6, and 9 presented in Table 11. Such cases do not exist in nature for recording, and no system can record such models. Hence, we generate them offline. We have highlighted the highest obtained accuracies in these tables, similar to the previous two tables presented above. For these cases, our proposed algorithm mostly had the highest accuracies for channel 12. In addition, our algorithm demonstrated higher average accuracies in all of the cases. In Table 11, similar to Table 9, in channel 12, the majority of the best accuracies was obtained by our algorithm. According to the results presented in Tables 7-11, in general, the algorithms perform better in channel 12 compared to channel 8 (1-3% higher accuracy). Reduction of the dimensions for the brain signals is effective in avoiding the local minimums.  Table 12 presents the results related to the extraction and selection of features, including time domain, frequency domain, autoregression, and wavelet coefficients, on a single channel (C3 = channel 8 or C4 = channel 12) by combining two filter banks, i.e., two and five (making a new signal with eight new frequency ranges). The best average accuracy was obtained by GA on both channels, i.e., 61.90% and 60.97%, respectively. On channel 8, the average accuracies were calculated as 59.84%, 59.78%, and 59.75% for PSO, BGA, and QGA, respectively. On channel 12, the average accuracies were calculated as 61.24%, 61.14%, and 60.89% for PSO, QGA, and BGA, respectively. In channel 12, all other algorithms performed better than our proposed algorithm. In general, in this model, our proposed algorithm did not perform well.  Tables 13 and 14 present the results related to the extraction and selection of features, including time domain, frequency domain, autoregression, and wavelet coefficients, on two channels (C3 = 8 and C4 = 12) and three channels (C3 = 8 and CZ = 10 and C4 = 12), respectively. In both cases, two filter banks, i.e., two and five, were combined, making a new signal with eight new frequency ranges.
The GA obtained better average accuracies in both models, i.e., 66.16% and 65.45% for two channels and 66.59% and 66.57% for three channels.  Table 15 presents the accuracy of the two models with CSP (sixth and seventh models that we defined earlier). In this case, a feature reduction by 100 features was performed, which was followed by the creation of new signals using the filter banks 2, 5, and 6. As a result, matrices M1 and M2 are created as input to CSP based on two and three primary channels, respectively. Our proposed algorithm shows the best performance, i.e., 73.12% and 79.53% average accuracies for M1 and M2, respectively. Kappa was calculated as 46.24% and 59.05% for M1 and M2, respectively. The average accuracy for M2 was also relatively good for the other three algorithms, i.e., 75.07%, 73.99%, and 71.18% for PSO, GA, and QGA, respectively. While in M1, the accuracy drops by 4 to 5% when using two channels and three filter banks.
Our proposed algorithm's average accuracy improves 6.41% from M1 to M2, and the average Kappa improves 12.82% from M1 to M2. In PSO, this improvement is 5.53% and 11.06% for the average accuracy and kappa, respectively. In GA, this improvement is 4.7% and 9.43% for average accuracy and kappa, respectively. In QGA, this improvement is 5.22% and 10.45% for average accuracy and kappa, respectively. This shows that the addition of the channels and filter banks has a significant impact on improving the accuracy.    We investigate three well-known functions, i.e., maximum, minimum, and average functions for different participants.
For example, for subject one, the accuracy obtained for maximum, minimum, and average functions were 58.70%, 58.00%, and 71.90%, respectively. While for subject two, these accuracy values were about 10% different.
The accuracy of the first function, i.e., maximum function, is close to 50% in most subjects. The accuracy of the second function, i.e., minimum function, in some cases is 4 or 5% better than the first function. In the second function, the difference in accuracies is 13% and 31% in some subjects. This indicates the effectiveness of the maximum function.
The average accuracy on the average function for all subject on the filter bank 5 is 61.45%, 54.30%, 62.30%, 73.17%, 62.53%, 59.27%, 62.09%, and 63.07%, respectively. So, it can be concluded that filter bank 5 (FB5) has the most information that is related to the perception of the left and right hand. According to our experiment, the frequency range between 8 and 30 Hz contains the most important information that is represented as filter bank 5. Figures 10-13 show the results of two filter banks, i.e., 5 and 6, with all the functions and different types of sampling window sizes for participants. In practice, we have five different types of sampling window sizes in the range of 3-7. We present the results of the average accuracy of all subjects with varying window sizes for all functions and the filter bank 5 in Figure 10.   Reduction data based on pooling model on some subjects (2,3,5,6,8,9) and filter banks with different window sizes for FB 6 and RF for 10 iterations. The highest accuracy obtained for the first, second, third, fourth, and fifth functions and filter bank 5 were related to window sizes 3, 6, 4, 4, and 4, respectively, i.e., 56.79%, 57.43%, 64.74%, 63.72%, and 73.17%, respectively ( Figure 12). Similarly, the results for filter bank 6 is presented in Figures 12 and 13. Table 16 presents the result of two different approaches: first, all filter banks with the same function used in all layers, and second, all filter banks with randomly selected functions for each layer. Both cases are applied for different types of sampling window sizes for participants. The best-obtained accuracies are highlighted in the table. According to this experiment, we conclude that the window size and the selected functions significantly impact the accuracy. The best accuracies in this scenario, i.e., using all filter banks, are in the range of 55-62%, while this range is between 60 and 73% when using only one filter bank, i.e., filter bank 5 ( Figure 11). This demonstrates that effective features are selected in filter bank 5, contributing to better accuracies.  Table 17 shows the best results for four different algorithms for different settings, i.e., different combinations of filter banks and channels. Our proposed algorithm (BGA) on filter bank 5 and channel 12 has obtained the best accuracies in most subjects. The average best accuracy, in this case, is 63.2%. The second-best result is related to GA with filter bank 5 and channel 12 (i.e., 61.9%). Subsequently, BGA and PSO have achieved the best results of 61.6% and 61.7%, respectively, when the combination of filter banks 2 and 5 are used. This indicates that the combination of filter banks 2 and 5 provides valuable information, but the results are still 1.5% less than the case using only filter bank 5.

Discussion of Algorithm on Brain Signals
In Table 18 and Figure 14, the results of accuracy for GA, BGA, and PSO with two channels and two proposed models (i.e., FM and CSP) are compared with the methods of some articles [33,65]. BGA with CSP obtained the best accuracies in most subjects with the best average accuracy, i.e., 73.12%, which is about 5-6% more than the other three algorithms, while the results related to GA and PSO are within the same range as the other three algorithms.  In Table 18, the statistical significance of performance between RF and DDFS with two other methods [33] with the fourth model scenario (four feature extractions, i.e., time domain, frequency domain, autoregression, and wavelet coefficients with two channels), and the sixth model scenario (CSP with two channels) have been calculated. A paired t-test is calculated for comparing p-values. The methods are considered of statistical significance when the p-value is less than 0.05. BGA with CSP with the fourth and sixth model scenarios could achieve p-values 0.020 and 0.005, respectively.
In Table 19 and Figure 15, the results of kappa for GA, BGA, and PSO with three channels and two proposed models (i.e., FM and CSP) are compared with the methods of some articles that use different channels [39]. BGA with CSP obtained the best kappa in most subjects with the average kappa, i.e., 59.06%, which is only 0.84% less than standard kappa (i.e., 60%). This result is significantly higher (about 18%) than the article [39] with three channels, while this difference is only about 0.5-1% compared to the results of the article [39] that uses all or 8.55 channels.  In Table 19, the statistical significance of performance between CSP with three channels (C3, C4, and CZ) with two other methods [39], the fifth model scenario (four feature extractions, i.e., time domain, frequency domain, autoregression, and wavelet coefficients with three channels), and the seventh model scenario (CSP with three channels) have been calculated. The methods are considered statistically significant when the p-value is less than 0.05. BGA with CSP in the seventh model scenario and two models from other papers could achieve p-values of 0.032, 0.003, and 0.003, respectively. Table 20 and Figure 16 show the results of kappa for GA, BGA, and PSO with three channels, and two proposed models (i.e., FM and CSP) are compared with the methods of some articles that use different channels and different filter banks [39,48]. The best average kappa in this experiment is related to FBCSP in [48], which was 4.41% higher than BGA with CSP. In this scenario, the other methods, in general, had 1.5-3% better kappa compared to BGA with CSP. This is because in BGA, we use only three channels with 60% of the frequency range, while in other studies, 22 channels with the entire frequency range have been used. Considering the significant difference in the channels and frequency range, the BGA had only 1.5% less kappa compared to the smallest kappa in the other works. Table 21 and Figure 17 show the results of kappa for various implementation methods with CSP [66]. The kappa values are in the range of 50% to 63%. BGA with CSP, three channels, and 60% of the frequency range obtained the second-best average kappa, i.e., 59%. While in all the other methods, all channels with all frequency ranges have been used.   Figure 16. Examining the best kappa different methods for selecting features (along with feature extraction, reducing features and channels, etc.) on three channels with more than three channels by combining different selected bands and so on.  [39], with the seventh (CSP with three channels (CSP2)) model scenario and some previous methods.
In Table 21, the statistical significance of performance between CSP (C3, C4, CZ) [39] with two of our model scenarios and other methods is evaluated. We used a paired t-test for the comping of p-value. All of the methods have a p-value of less than 0.05 (p-value < 0.05). Figure 18 shows one training epoch of the increasing accuracy of subjects 3 and 8 with algorithms using the seventh model scenario (CSP with three channels (CSP2)). BGA has the best converging to reach optimal results.  In most studies involving filter banks, first, all filter banks are applied to noise space filters to reduce noise, and then, features are extracted from them. Next, all the features of all filter banks are added together, and we select a set of features from them. In addition, in reducing the features for channels or selecting a channel, an interval of eight to 30 frequencies is considered. In this study, two or three filter banks with two important channels are applied to reduce the noise and extract more important information. We examined various algorithms tested on brain signals. We also proposed a new algorithm based on the Bond Graph method, i.e., BGA. Our design goal for the new algorithm is to effectively reduce the dimension of features and the effect of specified filter banks and channels.
Most of these algorithms show close convergence with each other. Although the accuracy values for some subjects were low in some cases, it was still far enough from the local minimum trap. In most cases, BGA showed the best performance compared to the other algorithms.

Discussion Sampling (Pooling) on Brain Signals
In all our study cases, sampling is part of deep learning to reduce the dimension in the layers. It is used during the deep learning phase. We use samplings based on four main functions, i.e., maximum, minimum, maximum-minimum automatic function, and no-maximum average function, on the data along with filter banks to reduce the dimension of the channels. The number of channels is fixed, and the dimensional reduction of each channel depends on the window size. The output window sizes for each channel are as follows: window size 3 with four features, window size 4 with one feature, window size 5 with seven features, window size 6 with five features, and window size 7 with three features.
Window size 4 holds the lowest number of features (i.e., one feature) with 99.88% dimension reduction. So, in our experiment, a total of 22 features out of a total of 19,250 features have been selected for classification for window size 1. Window size 4 holds the highest number of features (i.e., seven features), with 99.2% dimension reduction. So, in our experiment, a total of 154 features out of a total of 19,250 features have been selected for classification for window size 4.
In addition to the earlier functions, we also used an average function for the sampling in each channel. In the average function, only the average value of the features within each window size is selected. So, the number of output samples for the average function is 22 features.
The average accuracy of the functions mentioned above is in the range of 55% to 73%. This can be due to the excessive reduction of features in each channel. Therefore, in future works, more features must be considered to examine the main performance. However, despite the very small selection of features, when using a single filter bank, the accuracy of 61.45% to 64.74% is obtained, which is very good considering the very small selected features.
Compared to the articles on the basic quantum neural network method [65], where the average accuracy of all individuals based on feature extraction is 66.59%, all channels and total data have been involved in the extraction, which is only 2.2% better than our sampling methods with a single filter bank and significant dimension reduction.
Compared to Jing Luo's [33] article, feature selection was selected after feature extraction from two channels, i.e., 8 and 12, using a wavelet package. The average accuracy in the range of 67-68% was obtained using a random forest classifier in the frequency range of 8-30, which is only 3-4% better than our sampling methods with a single filter bank and significant dimension reduction.
In Adham Atyabi's article [36], a Mask model has been used on 118 channels to reduce the electrodes and features. As a result, the feature reduction rate was 99%, which is near to our work (i.e., 99.88%). They obtained accuracy in the range of 63-86% for five subjects. The average accuracy was 74% in their work, which is only1% more than our best average accuracy. However, the dimensional reduction in our work was more than their work.

Strength and Weakness of the Bond Graph Algorithm
Based on our experiments for the new BGA, BGA could achieve a very fast convergence compared to GA and PSO on eight benchmarks with models 3 and 4. It means that the formulation of two domains is suitable for converging, whereas the convergence of models 1 and 2 was worse than the others. Based on our experience, introducing a formula structure is very effective for improving the convergence.
It is possible to change parameters to find out the most suitable parameters for convergence. However, it may not have a significant effect on their performance, because their structure remains unchanged.
We obtained the best results for feature selection in one or two channels with model 3 when the selected features are sent directly to the ELM classifier. In addition, we obtained the best results for feature selection in two or three channels with model 3 when the selected features are sent directly to the CSP feature extraction. Then, the extracted features are sent to the ELM classifier. In both cases, the BGA converged significantly faster to the global optimum compared to the other algorithms.
However, the BGA converges slower and may fall in local optimum when the feature selection is done based on four different parts, i.e., time domain, frequency domain, autoregression, and wavelet coefficients. In this case, if some parts are selected based on the binary model and some other parts are selected based on the random mode, the BGA performs poorly compared to the other algorithms. However, if all of the parts are selected based on either binary or random models, the results are better than the other algorithms. So, to solve this problem, two different models for feature selection should not be used simultaneously.

Conclusions
In this article, a new algorithm called BGA is introduced and implemented to reduce the feature dimensions in brain signals. This algorithm shows a better performance than GA and PSO on general functions. Our algorithm performs better compared to other algorithms when reducing the dimension to 100 features on the specific filter banks. Some scenarios for testing this algorithm were implemented, including the following. (1) Reductions of features, electrodes, and frequency range have been evaluated simultaneously for brain signals. (2) Feature selection (with algorithms) and feature extraction by time domain, frequency domain, wavelet coefficients, and autoregression have been studied on some electrodes and filter banks. (3) Feature, electrodes, and the frequency range are reduced, which is followed by the construction of new signals based on the proposed formulas. Then, the CSP is used for the extraction of features. (4) Finally, a separate experiment with the deep learning sampling method was implemented as feature selection in several layers. The dimensional reduction was performed by sampling using three general functions and two new functions. All scenarios expressed in the left hand and right hand have been evaluated between one and three channels. Our algorithm outperformed by up to 5 to 8% accuracy and had 5% better kappa compared to the other studies with the same or similar settings.
For future works, first, we will investigate even smaller sizes and different combinations of filter banks to optimize the noise reduction and increase the distinctive patterns. Second, we aim to examine new functions of deep learning sampling models with the aim of increasing accuracy and performance.