Metaheuristic Optimization-Based Feature Selection for Imagery and Arithmetic Tasks: An fNIRS Study

In recent decades, the brain–computer interface (BCI) has emerged as a leading area of research. The feature selection is vital to reduce the dataset’s dimensionality, increase the computing effectiveness, and enhance the BCI’s performance. Using activity-related features leads to a high classification rate among the desired tasks. This study presents a wrapper-based metaheuristic feature selection framework for BCI applications using functional near-infrared spectroscopy (fNIRS). Here, the temporal statistical features (i.e., the mean, slope, maximum, skewness, and kurtosis) were computed from all the available channels to form a training vector. Seven metaheuristic optimization algorithms were tested for their classification performance using a k-nearest neighbor-based cost function: particle swarm optimization, cuckoo search optimization, the firefly algorithm, the bat algorithm, flower pollination optimization, whale optimization, and grey wolf optimization (GWO). The presented approach was validated based on an available online dataset of motor imagery (MI) and mental arithmetic (MA) tasks from 29 healthy subjects. The results showed that the classification accuracy was significantly improved by utilizing the features selected from the metaheuristic optimization algorithms relative to those obtained from the full set of features. All of the abovementioned metaheuristic algorithms improved the classification accuracy and reduced the feature vector size. The GWO yielded the highest average classification rates (p < 0.01) of 94.83 ± 5.5%, 92.57 ± 6.9%, and 85.66 ± 7.3% for the MA, MI, and four-class (left- and right-hand MI, MA, and baseline) tasks, respectively. The presented framework may be helpful in the training phase for selecting the appropriate features for robust fNIRS-based BCI applications.


Introduction
Brain-computer interface (BCI) technology enables a direct connection between the brain and an external device by avoiding the use of traditional channels, such as peripheral nerves and muscles [1]. BCIs are designed to provide impaired people (such as those with locked-in syndrome or spinal cord injuries) with new ways to communicate and manage their surroundings. BCIs may also be employed in industries such as gaming, transportation [2], recreation [3], virtual reality, human-machine interfaces [4], and neurological rehabilitation [5][6][7][8]. BCIs are classified into two types: invasive and non-invasive. Invasive BCIs capture brain activity using electrodes placed directly into the brain. This approach provides the highest-level information for signals but possesses significant risks, including the chance of infection and lasting brain tissue damage. Non-invasive BCIs, by contrast, do not require electrodes to be implanted in the brain. Instead, they capture signals from the scalp or other areas of the head using methods, such as electroencephalography (EEG) [9,10], functional magnetic resonance imaging [11,12], magnetoencephalography [13], and functional near-infrared spectroscopy (fNIRS) [14,15]. Non-invasive BCIs are less dangerous and less obtrusive than invasive BCIs but provide lower-resolution signals. The choice between invasive and non-invasive BCIs is determined by the specific application and the trade-off between the resolution and invasiveness. Despite these potential benefits, significant technical and clinical hurdles must be overcome before BCIs can be widely adopted and utilized.
Each non-invasive BCI has its own advantages and disadvantages. As a relatively new technique, fNIRS offers a balance between the temporal and spatial resolution and a variety of distinctive benefits [16][17][18]. Oxyhemoglobin (HbO/∆HbO) and deoxyhemoglobin (HbR/∆HbR) concentration changes are measured using fNIRS, utilizing pairs of multiple near-infrared lights (650-1000 nm range) that penetrate through the superficial cortical areas. Both the absolute and relative concentration changes are measured [19,20]. The neocortex activation caused by the brain stimulation results in increased blood flow and oxygenation levels (as reflected by the increases in ∆HbO or the decreases in ∆HbR). These modifications can then be used to provide control signals for the fNIRS-based BCI applications.
The development of rapid systems for quicker command decoding is one of the three primary research objectives in the field of BCIs. The other two are maximizing the number of decoded commands and improving the classification performance. The features are extracted using various small window sizes (0-2, 0-2.5, 0-5, 2-7, etc.) to speed up the BCI system [21][22][23]. fNIRS is used with other measuring modalities (e.g., EEG) to increase the number of commands that can be decoded from the brain [24,25]. The selection of the channels and features is a key component for enhancing the BCI classification accuracy. In fNIRS-based BCI research, active channel selection has been performed using various methods, such as averaging on a particular region of interest [26,27], computing the Pearson correlation coefficient [28], performing vector phase analyses [21,[29][30][31], averaging across all the channels [32][33][34], applying baseline corrections [24], calculating the contrast-to-noise ratios [35], using t-statistics and z-statistics [23,30,36], using a least absolute shrinkage and selection operator homotopy-based sparse representation [37], and utilizing joint-channelconnectivity methods [38].
Various types of features have been used to decode fNIRS signals in the literature, including statistical features [39], graph-based features [40], Mel frequency cepstral coefficients [41], vector phase analysis-based features [29], and frequency domain-based features [42]. Different ranges of two-and three-feature combinations have been used in prior studies by using temporal statistical data to determine the best combinations for classifying the various activities [43,44]. However, selecting the best features using visual inspection can be challenging, particularly when all the channels are used for the feature extraction and classification. Various studies have proven the impact of the feature selection on BCIs [29,45,46]. It helps to reduce the dataset's dimensionality, increase the computing effectiveness, and enhance the classification performance among the tasks. By identifying the features that are related to the activity, the feature selection increases the robustness of the classification system. Therefore, in one study [47], the authors used a genetic algorithm (GA) to determine the best window size for the computation of the temporal features. Their suggested framework improved the model's capability for classification. A recent study [45] presented a systematic approach for choosing the subject-specific features. They used filter-based techniques to remove the redundant features and boost the classification efficiency. Similarly, the "ReliefF" filter merged with a GA was used to classify upper limb movement intentions [48]. In another study, a minimum-redundancy maximumrelevance filter was used to reduce the feature vector size and enhance the classification performance. The relevant features were also identified using a sparse representation classification method. [49]. Recently, Dokeroglu et al. [50] reviewed various wrapper-based feature selection methods. Wrapper approaches examine the performance of each subset of features and combine a metaheuristic optimization algorithm with a classifier. Except for GAs, other wrapper-based feature selection methods have not been explored for fNIRS- based BCIs in the literature. The no-free-lunch (NFL) theorem states that no heuristic is sufficient to address every optimization issue. Most metaheuristics excel in at least one area compared to the others. The optimal subset of the features from several domains may not always be discovered using a single metaheuristic algorithm.
Therefore, herein, wrapper-based approaches operating in conjunction with metaheuristic optimization algorithms were explored for an fNIRS BCI to improve the accuracy and accelerate the processing. The following are the main highlights of the presented work.

•
First, the data of all the channels of the fNIRS signals were pre-processed to remove physiological noise.

•
The most commonly used statistical temporal features were computed for the fNIRS signals in all the channels.

•
Wrapper approaches with various metaheuristic optimization algorithms, such as particle swarm optimization (PSO), cuckoo search optimization (CSO), the firefly algorithm (FA), the bat algorithm (BA), flower pollination optimization (FPO), the whale optimization algorithm (WOA), and grey wolf optimization (GWO), were applied to observe the classification performance using a k-nearest neighbor (k-NN) approach.

•
The performance of the proposed framework was evaluated using the online motor imagery (MI) and mental arithmetic (MA) datasets. • A statistical analysis was also performed to determine the significance of the obtained results.

Proposed Framework
This section explains a framework for the feature selection for the fNIRS signals using optimization algorithms. This framework entailed choosing the most pertinent and significant features from the fNIRS signals. Here, the acquired fNIRS data were preprocessed to remove the physiological noise, and further details are provided in Section 3. After that, the temporal statistical features were extracted for all the channels in a 10-s window. A wrapper-based feature selection method was applied to retrieve the most important features using PSO, CSO, the FA, the BA, FPO, the WOA, and GWO. All the algorithms were implemented using "Jx-WFST," a Jxwrapper feature selection toolbox in MATLAB (https://www.mathworks.com/matlabcentral/fileexchange/84139-wrapperfeature-selection-toolbox, accessed on 7 February 2023). The k-NN model was selected for the classification, as it is a non-parametric machine learning approach that is accurate, easy, and widely used [51,52]. In this approach, the neighbors' votes determine how a sample is classified. The item corresponding to the k training samples (i.e., k, the object's closest neighbors) is categorized into a class based on the greatest class probability [53]. Further details on the k-NN can be found in [54]. The framework of the proposed approach is shown in Figure 1. reviewed various wrapper-based feature selection methods. Wrapper approaches examine the performance of each subset of features and combine a metaheuristic optimization algorithm with a classifier. Except for GAs, other wrapper-based feature selection methods have not been explored for fNIRS-based BCIs in the literature. The nofree-lunch (NFL) theorem states that no heuristic is sufficient to address every optimization issue. Most metaheuristics excel in at least one area compared to the others. The optimal subset of the features from several domains may not always be discovered using a single metaheuristic algorithm.
Therefore, herein, wrapper-based approaches operating in conjunction with metaheuristic optimization algorithms were explored for an fNIRS BCI to improve the accuracy and accelerate the processing. The following are the main highlights of the presented work.

•
First, the data of all the channels of the fNIRS signals were pre-processed to remove physiological noise.

•
The most commonly used statistical temporal features were computed for the fNIRS signals in all the channels.

•
Wrapper approaches with various metaheuristic optimization algorithms, such as particle swarm optimization (PSO), cuckoo search optimization (CSO), the firefly algorithm (FA), the bat algorithm (BA), flower pollination optimization (FPO), the whale optimization algorithm (WOA), and grey wolf optimization (GWO), were applied to observe the classification performance using a k-nearest neighbor (k-NN) approach.

•
The performance of the proposed framework was evaluated using the online motor imagery (MI) and mental arithmetic (MA) datasets.

•
A statistical analysis was also performed to determine the significance of the obtained results.

Proposed Framework
This section explains a framework for the feature selection for the fNIRS signals using optimization algorithms. This framework entailed choosing the most pertinent and significant features from the fNIRS signals. Here, the acquired fNIRS data were preprocessed to remove the physiological noise, and further details are provided in Section 3. After that, the temporal statistical features were extracted for all the channels in a 10-s window. A wrapper-based feature selection method was applied to retrieve the most important features using PSO, CSO, the FA, the BA, FPO, the WOA, and GWO. All the algorithms were implemented using "Jx-WFST," a Jxwrapper feature selection toolbox in MATLAB (https://www.mathworks.com/matlabcentral/fileexchange/84139-wrapperfeature-selection-toolbox, accessed on 7 February 2023). The k-NN model was selected for the classification, as it is a non-parametric machine learning approach that is accurate, easy, and widely used [51,52]. In this approach, the neighbors' votes determine how a sample is classified. The item corresponding to the k training samples (i.e., k, the object's closest neighbors) is categorized into a class based on the greatest class probability [53]. Further details on the k-NN can be found in [54]. The framework of the proposed approach is shown in Figure 1.

Experimental Data and Pre-Processing
Here, the fNIRS data of 29 participants from an online dataset were used to validate the presented methodology [55]. The prefrontal, motor, and occipital brain regions were surrounded by 36 channels formed by fourteen sources and sixteen detectors spaced three centimeters apart, utilizing the 10-5 international system with Fp1, Fp2, Fpz, C3, C4, and Oz as the references. The fNIRS data were measured at a 12.5 Hz sampling rate and were down-sampled to 10 Hz. The dataset was composed of triggered, fNIRS, and EEG data from six different sessions. The dataset was divided into datasets A and B corresponding to the left-hand motor imagery (LHMI) and right-hand motor imagery (RHMI) sessions (i.e., 1, 3, and 5) as well as the MA and baseline sessions (i.e., 2, 4, and 6). There was a 1min pre-rest time at the start of each session, followed by 20 trials (10 for each activity) and followed by another 1-min post-rest interval. The exercise consisted of a 2-s visual introduction, a 10-s task phase, and a rest period randomly allocated to last between 15 and 17 s. Here, only the fNIRS and associated trigger data were used. The fNIRS data (i.e., only ΔHbO in this study) were passed through a 3rd-order Butterworth low-pass filter with a 0.1 Hz cutoff frequency and a Butterworth high-pass filter with a 0.01 Hz cutoff frequency to reduce the physiological noise. Figure 2 shows the fNIRS optode placement and the experimental paradigm [55].

Feature Extraction
The fNIRS trials for both the MI and MA were retrieved after pre-processing, and a section of the trial lasting 10 s was utilized for the feature extraction. In this investigation, only the top five most commonly used statistical features-the mean, peak, slope, skewness, and kurtosis-were retrieved. [39,[56][57][58]. The mean, skewness, and kurtosis were determined using the following formulas, wherein the peak value is the maximum value, and the slope is retrieved using the curve fitting.
In the above, X represents the signals of the fNIRS (i.e., ∆HbO) and µ and σ denote the mean and standard deviation, respectively. S x , K x , and E x are the skewness, kurtosis, and expectation of X, respectively.
Herein, initially, all of the fNIRS channels were used for the feature extraction. In total, 180 features (36 × 5) were retrieved for a single trial. Thus, the feature selection was necessary to choose the right features, further minimize the feature vector size, and enhance the classification performance.

Feature Selection Method
In the feature selection process, a subset was chosen from the larger set of features to develop the machine learning model. The quality of the created candidate subsets was assessed using a predetermined criterion [59]. The feature selection aimed to improve the model's performance, decrease overfitting, and enhance the interpretability. The test classification accuracy was used to validate the outcome of the feature selection method. In general, feature selection algorithms may be divided into three primary categories, i.e., filter, embedding, and wrapper approaches, depending on the numerous assessment criteria and techniques used to generate the subsets.
Following the theory that the features with a high variance provide the most relevant details, the filter feature techniques are geared to maintain the features with a more significant variance [60]. These methods typically take little time to execute, although they can select redundant variables.
Embedding methods do not distinguish between the feature selection procedure and the classification method [61]. The feature selection is conducted within the classification process; therefore, it is incorporated as an algorithmic component or a functionality enhancement.
Wrapper algorithms are machine learning techniques that assess the performance of a subset of features using a particular machine learning model (the "wrapper") [62]. The wrapper's objective is to assess how the chosen features affect the model's performance. Based on the evaluation findings, the wrapper algorithm will either choose the current subset of features or search for a better subset of features. The best subset of features is eventually discovered by repeating this approach. The wrapper approach concept is presented in Figure 3. This work uses wrapper-based techniques and various metaheuristics algorithms for the optimal feature selection.

Metaheuristics Algorithms for Wrapper-Based Methods
Metaheuristic algorithms aim to approximate solutions to challenging problems. They are called "meta" because they manage high-level optimization problems by integrating several low-level heuristics [63,64]. PSO, CSO, the FA, the BA, FPO, the WOA, and GWO, among others, are examples of metaheuristic algorithms [50,65].
Learning algorithms are used in wrapper feature selection approaches to assess the classification performance of the generated feature subsets. The metaheuristics serve as the search algorithms to generate new potential optimal subsets [66]. The cost function for all of the optimization algorithms is defined as shown in Equation (4) Here,  and  are selected with the default values of 0.99 and 0.01, respectively.

Particle Swarm Optimization (PSO)
PSO is a technique for solving complicated optimization issues. It depends on the coordinated behavior of a collection of particles traveling through a solution space while being directed by the experiences of their nearby neighbors to locate the best solution [68,69]. It is inspired by the coordinated activity of a flock of birds or school of fish. Each particle in PSO represents a potential solution to the issue, is initially placed, and moves randomly. The particles adjust their locations ( X ) after each iteration depending on their current velocity ( V ) and considering their own personal best solution (pbest) and the group's overall best solution (gbest). The pull toward the pbest and gbest solutions and a random element to promote exploration are combined with the particle's current velocity to create a velocity update. The process is continued until a stopping requirement is satisfied, e.g., achieving a suitable solution quality or completing a predetermined number of iterations [70]. The equations below can be used to determine each particle's velocity and update its position.
where  , 1 a , 2 a , 1 b , and 2 b are the parameters of PSO. The 1 b and 2 b can be randomly selected. Further details about the PSO can be found in [71]. Algorithm 1 shows the pseudo code of PSO.
Algorithm 1 Pseudo code of PSO [71]. 1. Generate a random population of particles (N) 2. while the termination condition is not satisfied do 3. for each particle do 4. Evaluate the fitness of all the particles using the fitness function

Metaheuristics Algorithms for Wrapper-Based Methods
Metaheuristic algorithms aim to approximate solutions to challenging problems. They are called "meta" because they manage high-level optimization problems by integrating several low-level heuristics [63,64]. PSO, CSO, the FA, the BA, FPO, the WOA, and GWO, among others, are examples of metaheuristic algorithms [50,65].
Learning algorithms are used in wrapper feature selection approaches to assess the classification performance of the generated feature subsets. The metaheuristics serve as the search algorithms to generate new potential optimal subsets [66]. The cost function for all of the optimization algorithms is defined as shown in Equation (4) [67].
Here, α and β are selected with the default values of 0.99 and 0.01, respectively.

Particle Swarm Optimization (PSO)
PSO is a technique for solving complicated optimization issues. It depends on the coordinated behavior of a collection of particles traveling through a solution space while being directed by the experiences of their nearby neighbors to locate the best solution [68,69]. It is inspired by the coordinated activity of a flock of birds or school of fish. Each particle in PSO represents a potential solution to the issue, is initially placed, and moves randomly. The particles adjust their locations (X) after each iteration depending on their current velocity (V) and considering their own personal best solution (pbest) and the group's overall best solution (gbest). The pull toward the pbest and gbest solutions and a random element to promote exploration are combined with the particle's current velocity to create a velocity update. The process is continued until a stopping requirement is satisfied, e.g., achieving a suitable solution quality or completing a predetermined number of iterations [70]. The equations below can be used to determine each particle's velocity and update its position.
where α, a 1 , a 2 , b 1 , and b 2 are the parameters of PSO. The b 1 and b 2 can be randomly selected. Further details about the PSO can be found in [71]. Algorithm 1 shows the pseudo code of PSO.
1. Generate a random population of particles (N) 2. while the termination condition is not satisfied do 3. for each particle do 4. Evaluate the fitness of all the particles using the fitness function 5. Update the velocity and position of all the particles using Equation (5) 6. Evaluate the fitness f (Pr The CSO metaheuristic optimization method was developed in response to the cuckoo bird's tendency to lay its eggs in the nests of other bird species. The CSO algorithm was proposed by Yang in 2009 [72]. In CSO, a population of potential solutions to a problem is retained and an algorithm that uses random walk and exploitation operations iteratively improves it. The cuckoo's propensity to deposit its eggs in other birds' nests served as the inspiration for the random walk operation, which represents the discovery of new solution areas. The algorithm can concentrate on and enhance the most promising solutions with the help of the exploitation operation. In contrast to the other optimization algorithms, CSO finds high-quality solutions more quickly owing to its careful balance between exploration and exploitation. Additionally, CSO may be quickly parallelized for complicated optimization issues and is very easy to implement. It also does not require extensive parameter adjustment. The following equations can be utilized to model CSO. where x i is the solution at the step size (α) and λ is the variance of the levy distribution. Further details about CSO can be found in [73]. Algorithm 2 shows the pseudo code of CSO.
Algorithm 2 Pseudo code of CSO [73] 1. Generate a random population of host nests (N) 2. while the termination condition is not satisfied do 3. Select a random cuckoo (i) and determine a new solution using Equation (6) 4. Evaluate the fitness using the fitness function; F i 5. Randomly choose a nest (j) from the population . Abandon a fraction (P a ) of the worst nest and a new one at new location using Levy flights 10. Rank the solution and find the current best 11. return S best

Firefly Algorithm (FA)
In 2010, Yang presented the FA, which was motivated by the flashing qualities of fireflies [74]. These flashes draw potential mates or warn off predators. In the FA, the flashing properties are developed and applied as functions to solve combinatorial optimization issues [75]. According to their levels of brightness, fireflies attract other flies and advance toward brighter fireflies. The more space between the fireflies, the less appealing they become. The fireflies move at random when a brighter one is not present. As a result, the critical influences on the FA are the light intensity and the attractiveness level. A selected function can be used to control the brightness at a particular place. As it relies on the distance and absorption coefficient, the attraction is determined by the other fireflies. The FA can be modeled as follows.
where I and I o are the current and default light intensities; γ and r are the light absorption coefficient and the distance, respectively; β and β o are the current and default attractiveness (when the distance is zero); x i is the position; j is the higher intensity firefly; and α and ∈ are the random parameters. Further details about the FA can be found in [76]. Algorithm 3 shows the pseudo code of the FA. The BA is a nature-inspired optimization algorithm introduced by Yang in 2010 [77]. It is based on how bats use echolocation to locate their prey. The method is modeled after how bats fly around randomly while looking for food, generating noises, and listening for echoes to find their meal [78]. Using random walk and exploitation procedures, the BA maintains a population of alternative solutions to an optimization issue and iteratively enhances these solutions. While the exploitation operation enables the algorithm to concentrate on the most promising answers, the random walk operation imitates the unpredictable search behaviors of bats. Its usage of a frequency-tuning mechanism is motivated by the frequency modulation of bat sounds. This mechanism allows the algorithm to break out of the local optima and discover superior solutions. Furthermore, the BA is easy to deploy and does not require complicated parameter tuning. It is possible to codify echolocation as a method for improving an objective function [79]. The equations below can be used to model this algorithm. f where x i and v i are the position and velocity in the frequency range between f max and f min ; β is the random vector; x * is the best available solution; and α and A t are the random vector between the range of [0, 1] and the maximum loudness of the bats. Further details about the BA can be found in [80]. Algorithm 4 shows the pseudo code of the BA. The metaheuristic optimization method known as FPO was motivated by the pollination process in flowers [81]. It was developed to address the optimization issues in various fields, including computer science, engineering, mathematics, physics, and finance. The optimization procedure in FPO is based on how pollen moves between flowers. The quality of a solution is expressed by how much "nectar" it contains, and each solution in the optimization problem is represented as a "flower." The nectar is transferred between the flowers in a way that is similar to how the optimization problem solutions are selected and merged. FPO is renowned for its ease of use, quick convergence speed, and simplicity. It has been used to address numerous optimization issues, including function optimization, scheduling, and data clustering issues [82]. FPO can be modeled using the following equations. x where x t i is the pollen; g * is the best available solution; α is the scaling factor [0, 1]; L(λ) and Γ(λ) are the levy step size and gamma function; and x j and x k are the pollen from the different flowers but the same plant species. Further details about FPO can be found in [83]. Algorithm 5 shows the pseudo code of FPO.

Whale Optimization Algorithm (WOA)
In 2016, Mirjalili and Lewis [84] introduced the WOA, which is inspired by humpback whales' hunting habits. These whales hunt in groups, demonstrating their gregarious nature. They blow bubble nets to catch their prey (such as tiny fish or krill) when they come across clusters of prey. The WOA mathematical model represents a fresh approach to addressing complex optimization issues. The algorithm's primary tasks are finding prey, surrounding it, and moving in spiral bubble-net patterns. The WOA uses a variety of strategies after starting with a random solution. The other agents adjust their places after choosing the top search agent. They choose a target (either the best search agent or a random whale) and proceed to attack it. The WOA has helped resolve various optimization issues, including function optimization, scheduling, and data clustering issues [50]. The following equations can be used to model the WOA. where → x and x * are the current and best locations, respectively; b and l are the constant for determining the logarithmic spiral shape and a random number, respectively; and → D is the distance between the whale and the prey. Further details about the WOA can be found in [85,86]. Algorithm 6 shows the pseudo code of the WOA. level of authority over others, the group members are referred to as alpha, beta, omega, or subordinates in this social system. Each pack has a single alpha wolf or a dominant wolf and leader of the group. As a result, the alpha is in charge of the majority of tasks. The second most dominating wolf is the beta. It is anticipated that the beta will become the dominant wolf (alpha). The beta supports the alpha's decision-making, communicates his/her orders to the group, and sees them through. The lowest ranking wolf in the group (ruled by all the other wolves) is named the omega. The remaining wolves are referred to as subordinate or delta wolves. In addition to this social organization, grey wolves also hunt in groups. The group tracks and pursues a victim when it is within range. They surround the target whenever feasible, then each attacks in turn. The grey wolves' characteristics are considered for modeling GWO. The mathematical representation of GWO comprises the social structure and prey hunting techniques. The alpha, beta, and delta wolves in GWO are chosen as the top three solutions. The other wolves in the pack are considered the omega wolves and do not influence the choices made for the following iteration. The GWO technique was created to address the optimization issues prevalent in many disciplines, including computer science, engineering, mathematics, physics, and finance [67,89]. The variants of GWO are also proposed and applied to solve various engineering problems [90,91]. GWO can be modeled using the following equations.
where → x and → x p are the grey wolf and prey locations, respectively; → A and → C are the coefficient vectors; → D is the distance between the grey wolf and the prey; r 1 and r 2 are the random vectors; and a is the linearly decreased variable. It can be calculated using the following equation. a = 2 − t 2 maxiterations (12) where is the iteration number. The following equations can be used to update the grey wolves' position.
Further details about the GWO can be found in [92]. Algorithm 7 shows the pseudo code of GWO. Algorithm 7 Pseudo code of GWO [92] 1. Generate a random population (N) 2. Initialize the parameters 3. Evaluate the fitness of each wolf 4. while (t < T) do 5. for each wolf do 6. Update the position using Equations (13) and (14) 7. end for 8. Update the parameters 9. Evaluate the fitness value 10. Update the wolf's position 11. t = t + 1 12. end while 13. return → x α

Results and Discussion
This study applied wrapper-based metaheuristics feature selection approaches to enhance the fNIRS signals' classification accuracy. The MA and MI datasets were used to test the suggested feature selection algorithms [55]. The discriminations between the MA and baseline, LHMI and RHMI, and LHMI, RHMI, MA, and baseline were accomplished by utilizing a feature set composed of five statistical temporal features, as discussed above. The wrapper-based metaheuristic algorithms discussed above (PSO, CSO, the FA, the BA, FPO, the WOA, and GWO) were employed to extract valuable information. The extracted feature subset was classified using a k-NN and a 0.2-holdout validation technique. The K parameter of the k-NN was set as 5, N was the population size and T was the maximum number of iterations. Each algorithm was run ten times. All the parameters for each algorithm are enlisted in Table 1.
The results from all three cases are presented in Tables 2-4. This study used the classification accuracy and the feature vector size as the performance comparison metrics. Table 2. Classification performance of each subject for the mental arithmetic (MA) and baseline tasks in terms of the accuracy (Acc) and the feature vector size (F.V.S.) (data represented as the mean ± the standard deviation).   Table 3. Classification performance of each subject for the left-hand motor imagery (LHMI) and right-hand motor imagery (RHMI) tasks in terms of the accuracy (Acc) and the feature vector size (F.V.S.) (data represented as the mean ± the standard deviation).   Table 4. Classification performance of each subject for the LHMI, RHMI, MA, and baseline tasks in terms of the accuracy (Acc) and the feature vector size (F.V.S.) (data represented as the mean ± the standard deviation).   Tables 2-4 present the results for all three cases with each optimization algorithm for the specific subjects, using the accuracy and the feature vector size as the comparison metrics. In the case of the MA tasks, the full feature vector (180 features for each task) resulted in a 91.67% classification accuracy for subject one. By contrast, all of the wrapperbased optimization algorithms had a higher classification accuracy with a significantly reduced feature vector size, as shown in Table 2. After carefully analyzing the results, it can be seen that the GWO algorithm had a 99.1 ± 2.6 classification rate with a feature vector size of only 24.1 ± 5.1 for subject one. The algorithm enhanced the classification rate more than 8% from almost 150 fewer features. The WOA algorithm used the lowest number of features for the classification and produced a good classification rate when using all the channel features. For instance, in the cases of subjects three and four, the WOA only utilized 12.5 ± 14 and 4.8 ± 2.8 features to classify the MA task with the baseline and produced classification rates of 81.6 ± 8.6% and 88.3 ± 8.9%, respectively. These were almost similar results to those from the cases using all the channel features.

Subject
In the case of the MI tasks, the full channel features yielded an accuracy of only 41.67% for subject one. By contrast, the WOA produced the highest classification rate of 92.5 ± 8.2 with 35.7 ± 13.7 features. The FA showed an 85.8 ± 8.8 classification performance for subject 14 with 80.5 ± 6.8 features. The BA showed an 84.1 ± 8.2 classification rate for subject six with 75.2 ± 6.3 features, i.e., 10% more than using all the features. The results for the other subjects for the MI classification tasks are shown in Table 3. Table 4 shows the results from the merged MA and MI datasets. This was performed to observe the proposed strategy's effectiveness in a multi-class environment. As depicted in Table 4, the accuracy when using all the features was reduced for all of the subjects compared to the MA task alone. The results presented in Table 4 demonstrate the effectiveness of the proposed approach in a multi-class environment. For a better understanding of the results, the average classification accuracies, processing times, and feature vector sizes of all the subjects for the various tasks are presented in Figure 4 and Table 5.  Tables 2-4 present the results for all three cases with each optimization algorithm for the specific subjects, using the accuracy and the feature vector size as the comparison metrics. In the case of the MA tasks, the full feature vector (180 features for each task) resulted in a 91.67% classification accuracy for subject one. By contrast, all of the wrapperbased optimization algorithms had a higher classification accuracy with a significantly reduced feature vector size, as shown in Table 2. After carefully analyzing the results, it can be seen that the GWO algorithm had a 99.1 ± 2.6 classification rate with a feature vector size of only 24.1 ± 5.1 for subject one. The algorithm enhanced the classification rate more than 8% from almost 150 fewer features. The WOA algorithm used the lowest number of features for the classification and produced a good classification rate when using all the channel features. For instance, in the cases of subjects three and four, the WOA only utilized 12.5 ± 14 and 4.8 ± 2.8 features to classify the MA task with the baseline and produced classification rates of 81.6 ± 8.6% and 88.3 ± 8.9%, respectively. These were almost similar results to those from the cases using all the channel features.
In the case of the MI tasks, the full channel features yielded an accuracy of only 41.67% for subject one. By contrast, the WOA produced the highest classification rate of 92.5 ± 8.2 with 35.7 ± 13.7 features. The FA showed an 85.8 ± 8.8 classification performance for subject 14 with 80.5 ± 6.8 features. The BA showed an 84.1 ± 8.2 classification rate for subject six with 75.2 ± 6.3 features, i.e., 10% more than using all the features. The results for the other subjects for the MI classification tasks are shown in Table 3. Table 4 shows the results from the merged MA and MI datasets. This was performed to observe the proposed strategy's effectiveness in a multi-class environment. As depicted in Table 4, the accuracy when using all the features was reduced for all of the subjects compared to the MA task alone. The results presented in Table 4 demonstrate the effectiveness of the proposed approach in a multi-class environment. For a better understanding of the results, the average classification accuracies, processing times, and feature vector sizes of all the subjects for the various tasks are presented in Figure 4 and Table 5.   After deeply analyzing the results, it can be seen that GWO produced the highest classification rates of 94.83 ± 5.5%, 92.57 ± 6.9%, and 85.66 ± 7.3% for the MA, MI, and four-class (left-and right-hand MI, MA, and baseline) tasks, respectively. All of the channel features had classification rates of only 81.32%, 71.26%, and 62.79% for similar tasks. All of the optimization algorithms significantly improved the classification rate and reduced the feature vector size. However, GWO produced the best and most stable results since it was scalable, versatile, simple to use, and didn't need any derivation information from the search space. The search method for the algorithm benefitted from a balance between exploration and exploitation, producing an excellent convergence [93]. A two-sample t-test was also utilized to determine the statistical significance of the results. The results of GWO remained true compared to those from the full feature set and all the other optimization methods' results (p < 0.01). Table 5 presents the average processing time of each algorithm. The WOA had the lowest processing time and the smallest feature vector size with a reasonable accuracy. GWO used 29 features on average with only 2.17 s of processing time to yield the highest classification accuracy, as shown in Figure 4 and Table 5. As shown in Table 6, the results from this study were also compared to those from the other studies using the same dataset. Several researchers have presented task-based relevant feature selection approaches to improve the performance of fNIRS-based BCIs [36,45,94,95]. Aydin utilized both filterbased and wrapper-based methods to reduce the redundant features and extract useful information [45]. She showed that the wrapper-based approach had a better performance. In the filtering techniques, the features were selected based on the feature relevance to the dependent variable. She did not find any relevance between the feature and the model performance, while the wrapper methods trained a model using a subset of features to verify their usefulness. In some cases, in the filter-based approaches, a threshold value was selected to remove the redundant information. Wrapper-based techniques can take slightly longer to process, but they have shown that they produce an excellent classification accuracy. In this work, a wrapper-based metaheuristic feature selection framework was proposed to enhance the classification performance of fNIRS-based BCIs. The proposed technique and the literature are compared in Table 6. It is evident from the results that the proposed approach had a better classification performance as compared to the others.
In the study, only continuous metaheuristic feature selection approaches were applied. However, different binary metaheuristic feature selection approaches can be explored in the future to improve the classification accuracy and the feature vector size [96][97][98].
This study's practical application is in the BCI training phase. The model can be trained offline using the selected features that were determined using the presented wrapper-based approach. In the case of online testing, only the selected features were fed to the trained model for a better classification performance.

Conclusions
In this study, the wrapper-based metaheuristic optimization algorithms were utilized for activity-related feature selection in the fNIRS-based BCI applications. The performance of seven metaheuristic optimization algorithms (i.e., PSO, CSO, the FA, the BA, FPO, the WOA, and GWO) was analyzed using a k-NN-based cost function. The results demonstrated that all of the metaheuristic optimization algorithms significantly improved the classification accuracy and reduced the feature vector size. After extensive training and testing, GWO obtained the highest classification rates of 94.83 ± 5.5%, 92.57 ± 6.9%, and 85.66 ± 7.3% for the MA, MI (left-and right-hand), and four-class (left-and right-hand MI, MA, and baseline), respectively. Furthermore, the statistical analysis (p < 0.01) showed that GWO yielded better and more stable results. Therefore, the wrapper-based metaheuristic optimization algorithms were considered helpful for selecting the appropriate activity-related features for robust fNIRS-based BCI applications. In the future, other latest optimization algorithms and binary versions can be applied to measure their performance on fNIRS-based BCI applications.