1. Introduction
At present, the application of artificial intelligence methods to the analysis of biological and medical data is an important and actively researched scientific task [
1,
2,
3]. Among the works that are actively conducted in this direction are the analysis of medical images (CT and X-rays scans, etc.) [
4,
5], diagnostics of the human cardiovascular system [
6], genomic medicine [
7], different brain activity neurovisualization methods [
8,
9,
10], etc. Within the latter, of particular interest is the diagnosis of various brain conditions and pathologies based on neuroimaging data such as a fMRI, fNIRS, MEG, and EEG [
11,
12,
13]. In particular, machine learning methods including deep learning have found their application to process the brain signals in broad ranges, such as mental workload, disease prediction, stroke prediction [
14,
15], classification of EEG/MEG signals [
16,
17,
18], prediction of sleep stages [
19,
20], and finding EEG biomarkers [
21,
22]. In the case of applying machine learning methods to diagnose EEG/MEG features in real time, there is an opportunity to implement brain–computer interfaces for neurorehabilitation, control of human brain states and robotics [
23].
One of the central points of the application of artificial intelligence methods to medical and biological tasks is the interpretability of such approaches [
24,
25,
26,
27]. This is important for the creation of various assistive medical decision support systems [
28,
29], when a medical professional must understand and interpret the decision obtained using artificial intelligence methods. In this regard, in neuroscience, it is of great interest to develop and analyze various approaches for the diagnosis of neuroimaging data that are interpretable. Especially important is the interpretation of those features on the basis of which we create a particular machine learning system for application in biology and medicine. In the tasks of creating brain–computer interfaces for rehabilitation, communication with neurological patients, or the control of some external device, such as manipulators or exoskeletons, the most commonly used method is electroencephalography (EEG) to record brain activity and detect its features for subsequent command generation [
23].
Explainable artificial intelligence (XAI) algorithms are considered to follow the principle of explainability [
30]. A concept of explainability does not have a joint definition yet, so, there is a number of interpretations [
31]. One of them is that explainability in machine learning can be considered as “
the collection of features of the interpretable domain, that have contributed for a given example to produce a decision (e.g., classification or regression)” [
32]. So, analyzing the influence of the different features on classification process is necessary for the explainable machine learning methods.
For an EEG-based brain–computer interface (BCI), deep mastering interpretability can display how different factors contained in EEG affect the machine learning model selection. For instance, Bang at al. [
33] conducted analysis in comparison sample-sensible interpretation via the layer-wise relevance propagation (LRP) approach between the two subjects and revelead the potential reasons that cause the worse overall performance of one among all of them. The LRP approach is widely utilized by Sturn et al. [
34] to analyze the deep learning model designed for a motor imagery mission. They assign the elements leading to an incorrect classification of artifacts of visible interest and eye movements, which remain in the EEG channels of the occipital and frontal regions. Ozdenizci at al. [
35] proposed to apply a hostile inference approach to research strong features from EEG throughout unique subjects. Through interpreting the results with the LRP approach, they confirmed that their proposed method allowed the model of consciousness on neuroscience options in electroencephalogram, while it is less affected by artifacts from bone electrodes. Cui et al. [
36] used magnificence’s class activation map (CAM) method [
37] to examine character classifications of single-channel EEG alerts accumulated from a sustainably using enterprise. They identified that the model had been discovered to reveal brain activity patterns, such as alpha spindles and theta bursts, in addition to features that resulted from electromyography (EMG) activities, as proof to distinguish between drowsy and alert EEG alerts. In another work, Cui et al. [
38] proposed a completely unique interpretation technique through taking gain of the hidden state output through the use of the long short-term memory (LSTM) layer to interpret the CNN-LSTM version designed for driving force drowsiness recognition from single-channel EEG. The same authors currently mentioned a novel interpretation method [
39] based on a combination of the CAM approach [
37] and the CNN–fixation techniques [
40] for a multichannel EEG sign class and discovered stable functions across distinctive subjects for the venture of driver drowsiness popularity. Using the interpretation method, they also analyzed the reasons behind a few incorrectly classified samples. Regardless of the development, it is unclear to what extent the results of the interpretation may depend and how they may reflect the model’s selections. It is also not well defined in existing works why a specific interpretive technique is chosen above the other. These studies motivate us to conduct quantitative evaluations and comparisons of these interpretive strategies to gain in-depth knowledge of models designed for classification perceptual brain states of the group of voluntaries using EEG signals.
It should be noted that the convolutional neural networks (CNNs) being one of the most popular ML methods are commonly used and demonstrate a great success in image classification, natural language processing, computer vision, etc. [
41,
42]. However, their application to brain signals for achieving good accuracy requires a very deep CNN, which leads to a large number of parameters and high computational costs [
43,
44]. At the same time, the multilayer perceptrons are widely used for classifying EEG signals and demonstrate usually high efficiency [
45,
46].
Earlier in our works, various approaches based on machine learning have been investigated for EEG/MEG-based classification of brain states during the perception of different visual stimuli by the subject, including ambiguous images—Necker cubes [
47,
48,
49]. In particular, the use of features based on the physiology of brain processes has been shown to improve the efficiency of the classification of brain states corresponding to the perception of visual images [
50,
51]. In our previous paper [
52], the influence of image contrast (on the example of the famous painting by Leonardo da Vinci “Mona Lisa”) on information processing in the brain was investigated and the coherent resonance effect and reorganization of functional connections in the brain at a certain degree of image contrast were shown by the electroencephalography data. In the present work, we used this experimental dataset to classify brain states corresponding to different degrees of contrast. We also used a similar experimental dataset with EEG data on subjects’ perception of an ambiguous Necker cube with varying degrees of contrast to compare with perception of an artistic painting.
In this paper, we apply different machine-learning methods for the classification of brain states during visual perception and focus on the interpretability of the results. To estimate the influence of different features on the classification process and make the method more interpretable, we use the SHAP’s library technique. As data, we use 31-channel EEG recordings with a 250 Hz sampling rate filtered in five frequency bands. We find that a simple deep-learning model gives 100% for almost every dataset, which indicates an overfitting problem. So, we introduce four models with different combinations of two activation functions and different optimization methods. We find that the best optimization method is adagrad and the worst one is ftril. In addition, we find that only adagrad works well for both linear and tangent models. So, the contribution of the study is the following.
- We analyze the complex EEG dataset by using machine-learning techniques and find which optimization method is suitable for our dataset; 
- Apply SHAP for estimation of the influence of different features to make the ML model more interpretable; 
- Find the best optimizer that works well for both linear and tangent models. 
The paper is organized as follows. In 
Section 2, we describe the EEG datasets used for the analysis, research paradigm, mathematical models of the deep learning approach (activation functions, optimization methods and the method for feature importance estimation), and case studies with the description of the neural network’s structure. 
Section 3 contains the results of application of different models for the classification of image intensity. In 
Section 4, we compare the considered deep learning methods with each other. Finally, in 
Section 5, we draw the conclusions.
  2. Model and Methods
  2.1. Datasets Description
Our computational analysis is based on the experimental neurophysiological data obtained earlier in the special experiments on the visual perception of images with different contrast levels [
52]. The experiment consisted of observing images of varying brightness from 
 to 
, as shown in 
Figure 1. All the experimental EEG data of electric brain activity were recorded for 31 channels at a sampling rate of 250 Hz using the amplifier BE Plus LTM, manufactured by EB Neuro S.p.a., Florence, Italy. A detailed description of the experiment can be found in the ref. [
52]. We used data from 5 subjects for two types of images. Two datasets corresponding to the observation of the ambiguous Necker cube (
Figure 1A) and the Mona Lisa painting (
Figure 1B) were considered. The complete set of all observed stimuli was formed by changing the brightness 
I of the stimuli, as shown in 
Figure 1A,B. A schematic illustration of the used experimental protocols is shown in 
Figure 2. At the start and at the end of the experiment, background activity was registered for 120 s. Each image with intensity 
I was presented to the subject for 60 s. Between the presentations, there was 20 s of rest. This study was conducted in accordance with the Helsinki Declaration and was approved by the Ethics Committee of Kant Baltic Federal University. All EEG data used for the analysis can be found in the repository [
53].
  2.2. EEG Preprocessing
Experimenters performed EEG preprocessing to remove various registration artifacts. After experimental registration, the EEG signals were filtered with a fourth-order Butterworth  Hz bandpass filter and a 50 Hz notch filter. In addition, an independent component analysis (ICA) was performed to remove eye blinking and heartbeat artifacts. It should be noted that in this study, we did not conduct any experimental studies and only used previously recorded data that had already been cleared of artifacts by the above procedures and with which we did not perform any additional manipulations.
  2.3. Research Paradigm
The schematic representation of the research paradigm and overall structure of the research is presented in 
Figure 3.
We begin our consideration with the structure of the data to be analyzed. We consider trials of the 31-channel EEG with a duration of 1 min with a sampling rate of 250 Hz for each of 10 brightnesses 
I, i.e., at each moment of time 
n for brightness 
I, a data vector
        
        is registered. Here, index 
 corresponds to EEG channels {Fp1, Fp2, …O1, O2} (see 
Figure 1C), 
 is the signal registered in the 
i-th channel.
The most frequent and usual practice of EEG analysis is to look at different frequency ranges [
54,
55,
56]. Here, we use the standard EEG delta-(
 Hz), theta-(
 Hz), alpha-(
 Hz), beta-(
 Hz), and gamma-(
 Hz) frequency ranges [
57] for feature extraction and performance evaluation in perceptual brain states classification.
To obtain the data for each of the above frequency range, we reassigned the EEG signals (
1) to the total average, subtracted the mean, and filtered with a fourth-order Butterworth 
-Hz bandpass filter, where 
 and 
 are the boundaries of the frequency domain of interest [
58]. For example, for the 
-range 
 Hz and 
 Hz, and we obtained the alpha-band signal 
. Similarly, we obtain signals 
 for all other 
-, 
-, and 
- frequency bands of interest.
So, we characterize the brain state during the visual perception of the images of each brightness by 60 s × 250 samples/s = 15,000 number of features (
1) for each frequency band. We apply two deep learning models to classify brain states for different image brightnesses 
I. We consider three spatial domains of features: (i) all the EEG channels in the left and right hemispheres of the brain, (ii) EEG channels in the left hemisphere only, and (iii) EEG channels in the right hemisphere only.
We consider two separate strategies for learning and analyzing significant features that substantially affect model learning. In the first case, learning was based on the above-described features with separate frequency ranges, for each of which a different classification model was created. The input data were the trait values  and image brightness I. The second case used a single model that combined all the features  to predict image brightness I. In this case, different deep learning models with various neuron activation functions and various types of optimizers were used.
All deep learning models were tested on both datasets collected during the perception of images of Necker’s cube and Mona Lisa paintings. We applied SHAP (Shapley additive explanations) technology for feature importance estimation.
  2.4. Mathematical Model of the Deep Learning Approach
Two types of activation functions are used:  for the internal layers and  for the output. For a compiler, we use  as the optimizer and  for loss.
The 
 function 
 is described as
        
Here, ×, where  is the input layer value,  is the broadcasting value through the columns (bias),  is the weight value for , and L is the number of layers.
The standard 
 function 
 is defined as
        
        where 
 is the input vector, and 
K is the number of classes in the multi-class classifier. In our case, 
 is the number of intensities 
I (
Figure 1).
We also use rectified linear unit (ReLu) in the convolutions layers due to its computational simplicity and representational sparsity:
        where 
x is the input layer value.
Next, we will describe the optimization methods we use in the paper. Here, we use the following general notation:  is the model parameters we need to optimize,  is the objective function,  is the gradient of the objective function with respect to the parameters ,  is the learning rate determining the size of the steps, t is the number of step, and  is a smoothing that avoids division by zero.
  2.4.1. Stochastic Gradient Descent (SGD) as Optimizer
SGD is used for parameter updates for each training set. For example, if the dataset has 
 features and 
 target or label values, then for each step 
t, we have the following update rule [
59]:
          where 
.
  2.4.2. Root Mean Square Propagation (RMSprop) as an Optimizer
Root Mean Square propagation (RMSprop) [
59,
60] is very similar to gradient descent with momentum; the only difference is that it includes the second-order momentum instead of the first-order one, plus a slight change on the parameters’ update:
          where 
E is the decaying average over past squared gradients, and 
 is the momentum term.
  2.4.3. Adaptive Gradient Algorithm (Adagrad) as Optimizer
The adaptive gradient algorithm (Adagrad) [
61] is a gradient-based optimization technique that achieves just that: it adjusts the learning rate to the parameters, producing more substantial updates for uncommon parameters and modest changes for frequent ones. As a result, it is ideal for dealing with sparse data. Dean et al. [
62] discovered that Adagrad increased the robustness of SGD and used it to train large-scale neural networks at Google [
63], which learned to detect cats in Youtube videos [
64], among other things. In addition, Pennington et al. [
65] employed Adagrad to train GloVe word embeddings, because uncommon words need considerably bigger updates than frequent words, as mentioned in Equation (
7).
          
          where 
 is the sum of the squares of all past gradients, and ⊙ is the element-wise vector multiplication between 
 and 
.
  2.4.4. Extension of Adagrad (Adadelta) as Optimizer
Adadelta [
66] is an extension of Adagrad with a decreasing learning rate. The update rule of this methods is as follows [
67]:
          where 
 and 
.
  2.4.5. Adaptive Moment Estimation (Adam) as Optimizer
Adaptive Moment Estimation (Adam) [
68] is a method that can update parameters such as Adadelta and RMSprop. Here, the updated rule is as follows:
          where bias-corrected moments estimate:
          where 
 and 
 are decay rates.
  2.4.6. Extension to the Adaptive Movement Estimation (AdaMax) as Optimizer
The 
 influence in the Adam update rule scales the gradient inversely proportionally to the 
 norm of the past gradients (via the 
 term) and current gradient 
:
Then, we update to the 
 norm with parameterization 
 as 
 [
68]:
Norms for large 
p values generally become unstable. However, 
 also generally shows balanced behavior. That is why the authors propose AdaMax [
68] and show that 
 with 
 converges to the more stable value:  
Then, we obtain the AdaMax update rule:
  2.4.7. Nesterov-Accelerated Adaptive Moment Estimation (Nadam) as Optimizer
Nadam (Nesterov-accelerated Adaptive Moment Estimation) [
69] combines Adam and Nesterov accelerated gradient. Combining Equations (
9) and (
10) and noting that 
 which can be replaced by 
, we obtain the Nadam update rule:
  2.4.8. Follow the Regularized Leader (FTRL) as Optimizer
FTRL [
70] strikes a trade-off between the advantages and disadvantages of forward–backward splitting (FBS) [
71] and regularized dual averaging (RDA) [
72]. The update rule is as follows:
          where 
 is the average of previous sub-gradients. In FTRL, the learning rate is different for different dimensions. If the training data deem that the dimension 
i needs to take a wider step than dimension 
j, the below learning rate will accommodate such updates:
Here,  and  are a non-negative and non-decreasing sequence, and  is the learning rate.
  2.5. SHAP (SHapley Additive exPlanations) for Feature Importance Estimation
Current explicable machine learning such as SHAP [
73] supported the stochastic game [
74] for a proof of model predictions. The Shapley value comes from cooperative theory of games, as an example, Shapley regression values [
75] or Shapley sampling values [
76]. Shapley regression values are feature importances for linear models within the presence of multiple correlation. The technique desires model fitting for all attribute subsets 
. Every attribute ought to receive a gain that represents its contribution to the model prediction finally. To work out this gain for a given attribute 
, the predictions of the 2 models are compared on the present example 
x, wherever 
 denotes the values of the input options within the set 
. To take under consideration the influence of all alternative attributes, the variations are computed for each set 
. The ultimate Shapley price is calculated via a weighted average of these differences:
  2.6. Case Study
We implement two case studies shown in 
Table 1. In the first study, we obtained the maximum accuracy, which is 100% for almost every dataset that indicates an overfitting problem. So, we introduced a second method that has a data verification step and multiple model optimizers. Here, the equations for estimating the result:
  2.6.1. Case Study I
First, we implement a simple model of ANN with RMSprop optimizer (
Section 2.4.2) and a structure described in 
Table 2. The number of connections in the input layer corresponds to the number of features (15,000), and the number of output connections is equal to the number of brightnesses (10). The number of parameters is equal to the number of connections between the layers. For example, for the input and first hidden layer, it is defined as (15,000 + 1) × 2500 = 37,502,500, where “+1” is the additional channel for the intensity. The exception is the number of the parameters for the output layer: (15 + 1) × (10 + 1) = 176. Using this model, we obtain 100% accuracy for both datasets.
  2.6.2. Case Study II
Next, we use the models with verification, different activation functions and optimizers. As validation, we default to the validation built in the Keras Sequential model: we choose 10% of the original dataset as the data on which to evaluate the loss and any model metrics at the end of each epoch. The model is not trained on these data. These data are only used for tuning hyper-parameters to make the model eligible for working well on unknown data.
We consider four models, which differ from each other in activation function: tanh, relu or their combinations. The structure of the models is described in 
Table 3. The layers are the same as for Case I. So, the number of the parameters is also the same as for Case 1 but without an exception for the output layer: (15 + 1) × 10 = 160.
  2.6.3. Computing System
The configuration of the computing system we used to perform ML follows:
• RAM: 503 GB;
• CPU: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20 GHz;
• OS: Ubuntu 18.04.5 LTS 64 bit;
• GPU: NULL.
  5. Conclusions
We have applied different machine-learning methods for the classification of brain states during visual perception. As data, we used 31 EEG channels filtered in , , , , and  frequency bands corresponding to the perception of Mona Lisa and Necker cube images. In Case Study I, we used a deep learning model with eight layers, the tanh activation function, and the RMSprop optimizer. Using this model, we obtained the maximum accuracy which is 100% for almost every dataset that indicates an overfitting problem. To estimate the influence of different features on the classification process and make the method more interpretable, we use the SHAP’s library technique.
To avoid the overfitting problem of the first model, we introduced the second method (Case Study II), which has a data verification step. Here, we used four models with different combinations of two activation functions (tanh and relu) to cross-check which model works better (linear or tangent). We also used different optimizers to check how they perform for different models. We found that the best optimization method is Adagrad; it performs well for most of the frequency bands. In contrast, the FTRL method does not work at all. The list of optimizers from best to worst is: Adagrad > SGD > AdaMax > Adadelta > Nadam > RMSprop > Adam > FTRL. In addition, we found that only Adagrad works well for both linear and tangent models.