Visual Saliency and Image Reconstruction from EEG Signals via an Effective Geometric Deep Network-Based Generative Adversarial Network

: Reaching out the function of the brain in perceiving input data from the outside world is one of the great targets of neuroscience. Neural decoding helps us to model the connection between brain activities and the visual stimulation. The reconstruction of images from brain activity can be achieved through this modelling. Recent studies have shown that brain activity is impressed by visual saliency, the important parts of an image stimuli. In this paper, a deep model is proposed to reconstruct the image stimuli from electroencephalogram (EEG) recordings via visual saliency. To this end, the proposed geometric deep network-based generative adversarial network (GDN-GAN) is trained to map the EEG signals to the visual saliency maps corresponding to each image. The first part of the proposed GDN-GAN consists of Chebyshev graph convolutional layers. The input of the GDN part of the proposed network is the functional connectivity-based graph representation of the EEG channels. The output of the GDN is imposed to the GAN part of the proposed network to reconstruct the image saliency. The proposed GDN-GAN is trained using the Google Colaboratory Pro platform. The saliency metrics validate the viability and efficiency of the proposed saliency reconstruction network. The weights of the trained network are used as initial weights to reconstruct the grayscale image stimuli. The proposed network realizes the image reconstruction from EEG signals.


Introduction
Elucidating the function of the brain in perceiving the input data from the outside world is of particular importance to help improve biometric innovations and BCI challenges. Brain recording techniques play an essential role in realizing this concept. The recordings pave the way to modeling the representation of the information in the brain to recognize how it works to perceive the input data. This modeling has been the subject of a number of studies in the fields of brain encoding and decoding [1].
EEG as one of the most popular non-invasive brain recording methods has been used vastly in studies related to understanding the brain activities in various circumstances; for example, experiments concerning attention, memory [2], motor control [3], drowsiness in the brain [4], EEG-based driving safety monitoring [5], emotions [6], driver fatigue [7,8], visual decoding [9,10], brain activities during sleep [11], and movement intention detection [12]. Some studies have focused on the relation between visual input to the brain and EEG recordings. In 2010, Ghebreab et al. [13] investigated the recorded EEG signals in response to natural visual stimulation, and the prediction of visual inputs was realized using EEG responses. A better accuracy was achieved in comparison to a similar work by Kay et al. in 2008 [14]. These studies have had the potential to reveal the effects of visual features such as color [15], orientation [16], and position [17] on the brain signals of the visual cortex.
The representation of visual stimuli in the brain relates to important points of the picture. To illustrate the priority of a location in a visual image to represent in the brain, and to identify them efficiently, the concept of a saliency map was first proposed by Koch and Ulman in 1985 [18]. Followed by the concept introduced by Koch and Ullman, Itti et al. in 1998 introduced a computational model corresponding to the understanding of the saliency map [19]. Following the work of Itti et al. in 1998 [19], detecting rarity, distinctiveness, and uniqueness in a scene is compulsory for salient object detection. Based on the proposed model by Itti [19], many models have been developed for predicting image saliency.
Realizing how the salient region affects the brain signal is of great importance to understanding how the visual system works. Although some works have been made to explore the relationship between the brain activity through recorded EEG signals and the salient regions of the visual stimuli, the mapping of the EEG signals to image saliency has not been realized. Moreover, the use of dynamic information between the connected EEG channels according to the functional connectivity between different brain regions has not been considered to explore the connection between brain activity and salient regions.
To achieve an efficient mapping of EEG signals to the salient region corresponding to the visual stimuli, a deep network based on the graph representaion of EEG records is introduced. The mapping would extract the visual saliency map related to the recorded EEG signals. The proposed network consists of two parts, including the geometric network and the generative adversarial network. The graph representation of the EEG records facilitates to exploit the functional connectivity between different channels in each EEG recordings in the classification procedure of the geometric deep network part of the proposed network. The overall model realizes the visual saliency reconstruction through the EEG records.
The contributions made by this article can be highlighted as: (i) It provides an efficient deep network to extract a saliency map of visual stimuli from visually provoked EEG signals.
(ii) Reconstruction of the visual stimuli is possible through the proposed deep network.
(iii) It provides a geometric visual decoding network for extracting features from the EEG recordings to identify 40 different patterns of EEG signals corresponding to 40 image categories.
(iv) A graph representation of the EEG channels is imposed as an input to the proposed GDN-GAN, in which functional connectivity between 128 EEG channels is employed to construct the graph.
(v) In the proposed method, the time samples of EEG channels are used directly as the graph nodes to remove the feature extraction phase and to decrease the computational burden.
(vi) For the first time, it presents a model to connect the EEG recordings, visual saliency, and visual stimuli together.
(vii) For the first time, it proposes a fine-tuning process to realize image reconstruction from EEG signals via visual saliency reconstruction.
The remainder of this paper is arranged as follows. Section 2 reviews the related works. Section 3 provides the details of the EEG-ImageNet database, and reviews the mathematical preliminaries of graph convolution and generative adversarial networks. Section 4 describes the details and the structure of the proposed framework for EEGbased visual saliency detection and visual stimuli reconstruction. Section 5 provides and presents the experimental results, and validates the performance of the proposed framework compared with the state-of-the-art methods, and finally, the conclusions are provided in Section 6.

Related Work
In a number of early works on salient region detection [20][21][22], saliency was considered as being unique, and was frequently calculated as center-surround contrast for every pixel. In 2005, Hu et al. [23] used generalized principal component analysis (GPCA) [24] to compute salient regions. GPCA has been used to estimate the linear subspaces of the mapped image without segmenting the image, and salient regions have been determined by considering the geometric properties and feature contrast of regions. Rosin [25] proposed an approach for salient object detection, which has required very simple operations for each pixel, such as moment preserving binarization, edge detection, and threshold decomposition. Valenti et al. [26] proposed an isophote-based framework where isocenter clustering, color boosting, and curvedness have been used for the estimation of the saliency map. In addition, some supervised learning-based models for saliency detection were proposed, such as support vector machine in 2010 with Zhong et al. [27], regression in 2016 with Zhou et al., and neural networks with Duan in 2016 [28].
Some of the saliency detection methods are based on models developed for simulating the visual attention processes. Visual attention is a selective procedure that occurs for understanding the the visual input to the brain from the surrounding environment. Neisser, in 1967 [29], suggested that bottom-up and top-down processes occur in the brain during the time of the processing objects of a visual scene. Bottom-up is a pre-attentive that considers primitive feature-driven, and top-down is a task-driven attentive model. According to these processes, bottom-up-based models, top-down-based models, and some others, considering both of the processes, have been proposed for visual attention.
An analysis of a bottom-up-based visual attention mechanism has resulted in bottomup-based saliency detection models. It is a fast process and it uses low-level visual properties such as color, intensity, and orientation. A number of researchers have made efforts to improve the performance of the bottom-up-based saliency models. In 2013, Zhang and Sclaroff measured the contour information of regions using a set of Boolean maps to segment the salient objects from the background, and the efficiency of the model was demonstrated by five sets of eye tracking databases [30]. In 2015, Mauthner et al. proposed an estimation of the joint distribution of motion and color features based on Gestalt theory, in which the local and global foreground saliency likelihoods have been described with an encoding vector, and these individual likelihoods generated the final saliency map [31].
Top-down visual attention process resulted in the top-down-based saliency map detection models. The intention and thoughts have been involved in this process, and it is impressed by the prior knowledge and given task to the brain. To realize the difference of the impact of these two processes on saliency models, consider an image including two kinds of fruits. Two kinds will have the same saliency level in the bottom-up model. However, in the top-down model, the given task will have an impact on the saliency levels of each kind. Top-down saliency-based models, as in the work of Xu et al. in 2014 [32], have been conducted through contextual guidance and pre-defining of the discriminant features and allocating learned weights for different features, as performed by Zhao and Koch in 2011 [33], and in 2017, Yang [34] adapted feature space in a supervised manner to obtain the saliency output.
The third category is an integration of the bottom-up and top-down saliency detection models. The detection of possible salient regions is achieved through the bottom-up process, and the effect of the given task is processed according to the top-down model.
After designating the process model between these three models, the features extracted from every pixel in the input, or the spatial attributes according to the regions are considered to compute the saliency features. Although real-time saliency detection with hand-crafted features has good performance, it does not work well in challenging scenarios to capture salient objects. One of the proposed solutions to these challenges is using neural networks [35,36]. One of the most popular networks in machine learning are convolutional neural networks (CNNs) [35], and they have been implemented to solve a number of vision problems such as edge detection, semantic segmentation [37], and object recognition [38]. Recently, in the work by Shengfeng He et al. and Ghanbin Li et al. [39,40], the effectiveness of CNNs has been shown when applied to salient object detection. A series of techniques has been proposed to learn saliency representations from large amounts of data by exploiting the different architectures of CNNs. Some of the models proposed for saliency detection via neural networks use multilayer perceptrons (MLPs). In these models, the input image is usually oversegmented into small regions and feature extraction is performed using a CNN. The extracted features are fed to an MLP to determine the saliency value of each small region. The saliency problem in [39] has been solved using the one-dimensional convolution-based methods by He et al. Li and Yu [40] have utilized a pre-trained CNN as a feature extractor, such that the input image has been decomposed into a series of non-overlapping regions and a CNN with three different-scale inputs has been proposed to extract features from the decomposed regions. Advanced features at different scales have been captured using three subnetworks of the proposed CNN, and have been concatenated to feed into a small MLP with only two fully connected layers. These dense layers act as a regressor to output a distribution over binary saliency labels.
Two recently proposed deep learning-based saliency models are salicon [41,42] and salnet [43]. Like other saliency detection methods, the purpose of the salicon is to realize and to predict visual saliency. This model has used the coefficients of pre-trained AlexNet, VGG-16, and GoogleNet. The last layer of the proposed salicon is a convolutional layer that is used to extract the salient points. The initial parameters have been determined using the pre-trained network based on ImageNet dataset, and the back propagation has been used to optimize the evaluation criterion, in spite of previous approaches that used support vector machine. The training process in salnet has been achieved using the Euclidean distance between the mapped predicted salient points and the ground truth pixels. A shallow and a deep network have been presented. The shallow net consists of three convolutional layers and two fully connected layers with trained weights. ReLU is used as the activation function of each layers of shallow net. The deep network consists of 10 layers and 25.8 million parameters.
In recent years, some efforts have been made to understand the connection between the visual saliency content and the brain activity. In 2018, Zhen Liang et al. [44] presented a model to study this connection and extracted sets of efficient features of EEG signals to map to the visual salient related features of the video stimuli. The model has used the work of Tavakoli et al. in 2017 [45]. The reconstruction of the features of the salient visual points based on the features of the EEG signal has been performed with good accuracy in [44], and prediction of the temporal distribution of salient visual points has been done using EEG signals recorded in a real environment. In another study [46], the identification of the objects in images recorded by robots was the purpose of the study, and a method based on P300 wave was applied to identify the objects. The significant challenge for extracting the objects of interest in navigating the robots is how to use a machine to extract the objects of interest for humans. The combination of a P300-based BCI and a Fuzzy color extractor has been applied to identify the region of interest. Humbeeck et al. [47] have presented a model for calculating the importance of the salient points for the fixation positions. Brain function related to the extracted model has been studied using the eye-tracker and recording the EEG signal. An evaluation of the connection between the importance of salient points and the amplitude of the EEG signal has been done via this modeling. A multimodal learning of EEG and image modalities has been performed in [48] to achieve a Siamese network for image saliency detection. The idea of the work in [48] is the training of a common space of brain signal and image input stimuli by maximizing a compatibility function between these two embeddings of each modality. The estimation of saliency is achieved by masking the image with different scales of image patch and computing the corresponding variation in compatibility. This process is performed at multiple image scales, and results in a saliency map of the image.
In this article, we propose a novel deep network for mapping the visually provoked EEG signals to image saliency. In the next section, we explain the database settings and the mathematical background of the propsed method.

Materials and Methods
The EEG-ImageNet database is used in this paper and is explained in detail in this section. The mathematical background of Chebyshev graph convolution will be explained to know the function of a convolutional layer of the geometric deep network. Furthermore, we have an overview of generative adversarial networks and saliency evaluation metrics.

Database Settings
In this section, the details of the EEG-ImageNet database is described. This dataset is publicly available in perceive lab [48,49]. The EEG-ImageNet dataset has been recorded using a 128-channel cap (actiCAP 128Ch) [50]. Figure 1 illustrates the EEG placement according to this standard. The EEG-ImageNet includes the EEG signals of six human subjects produced as the result of visual stimulation.
The  To record the data as described in [49], each image has been shown on the computer screen for 500 ms, and a set of 50 images of each category has been shown to each of the subjects in 25 s. A total running time of 1400 s has been dedicated to recording the EEG data of each subject. The total number of records used in our experiments is equal to 11,965.

Chebyshev Graph Convolution
In this section, we make a brief overview on graph convolution. The research of Michaël Defferrard et al. [52] was the cause of the popularization of graph signal processing (GSP). The functions in GSP take into consideration the properties of the graph's components, and also the structure of the graph. GSP is used to expand convolutions to the graph domain, and this field of research uses signal processing functions such as the Fourier transform and applies them to the graphs. The use of Fourier transform in GSP leads to graph spectral filtering, also called graph convolution [53].
We explain graph convolution, as described in [53]. Let D ∈ R (N×N) and W ∈ R (N×N) , respectively; denote the diagonal degree matrix and the adjacency matrix of a graph. The i-th diagonal element of the degree matrix can be calculated by Then, L, the Laplacian matrix of the graph, is expressed as The basis functions in the graph domain are calculated according to the eigenvectors of the graph Laplacian matrix. The eigenvectors of the graph Laplacian matrix denoted by U can be acquired via the singular value decomposition (SVD): in which the columns of U = [u 0 , ..., u N−1 ] ∈ R (N×N) constitute the Fourier basis, and Calculating the eigenvectors of the Laplacian returns the Fourier basis for the graph. The graph convolution operation is defined as (4). Substituting f (L) in (4) with the Chebyshev polynomial expansion of L, we will have the Chebyshev graph convolution of X.
where L can be calculated from W based on (2), and the calculation of Λ can be conducted using (3).
The approximation of the f (Λ) is performed via the K-order Chebyshev polynomials. Approximating the f (Λ) function is accomplished via the normalized version of Λ. The largest element among the diagonal entries of Λ is defined by λ Max , and the normalized Λ is as follows:Λ = 2Λ/λ max − I N (5) where I N is the N × N identity matrix, and the diagonal elements ofΛ lie in the interval of [−1,1]. The approximation of g(Λ) based on the K-order Chebyshev polynomials framework is as follows: where θ k denotes the coefficient of Chebyshev polynomials, and T k (Λ) can be acquired according to the following formulas: According to (6), the graph convolution operation defined in (4) can be expressed using (7), as illustrated in (8).
The expression of Chebyshev graph convolution in (8) shows that it is equivalent to the combination of the convolutional results of x, with each components of the Chebyshev polynomial [53].

Generative Adversarial Network
Generative deep modeling is considered as an unsupervised learning task that discovers and learns the contents in input data in such a way that the extracted model can be used to generate new examples that could have been extracted plausibly from the original dataset. A spatial case of generative models is the generative adversarial network (GAN) that constructs two sub-networks, including generator and discriminator, to solve the problem.
The generator network is trained to generate new examples, and the classification of these examples as either real or fake is performed through the discriminator sub-network. The two sub-networks are trained in an adversarial way, such that the generator part outputs some examples to the real data, and the discriminator part is fooled and cannot diagnose a difference between the real domain and the generated examples. The generator should learn how to generate data in such a way that detection between fake and real cannot happen by the discriminator.
The two sub-networks are trained simultaneously, such that a generative model G settles random vector y adapted from preceding distribution P(y) into the domain data; additionally, a discriminative model D tries to detect dissimilarity between true examples obtained from the training input data domain P and simulated examples from the generator G.
Such networks are trained inconsistently until none of them can make additional progress against one another. An illustration of the GAN objective function is depicted as follows: In the cost function of GAN as in (9), x is the real data and y signifies the feature vector imposed to the generator; furthermore, G(y) portrays the output of the generator, given a feature vector y. D(x) is the output of the discriminator with real image data, and has to be as close as possible to 1, to perform better. D(G(y)) represents the output of the discriminator, considering the generated samples indicated with G(y). The probability density of x and y is represented accordingly, with P data (x) and p y (y) in the cost function of (9).
In the training procedure of a GAN, G is trained in a way that reduces log (1 − D(G(y))) to mislead discriminator D. Contrarily, D is trained so that it can increase the likelihood that the generated data is analogous to the real data, and the likelihood would be near to 1 and far from 0, which is the likelihood of being fake data.

Saliency Metrics
In this section, saliency evaluation metrics are described. Ground truth is necessary for calculating these metrics. Another input would be the saliency map. Considering these two inputs and computing these metrics, the degree of the similarity between them would be available.
Similarity (SIM) is a metric for measuring the intersection between distributions [54]. The similarity between two distributions can be measured with this metric. The input maps are normalized, and the sum of the minimum values at each pixel is computed as SIM. Considering a saliency map SM and a continuous fixation map F M : In (10), iteration is made for discrete pixel locations j. For the same distributions, SIM is equal to one, while if there is no overlap between distributions, SIM would be zero.
Structural similarity (SSIM) is calculated using the different windows of an image [55]. Considering two windows g and h of size K × K, SSIM can be calculated as follows: µ g is the mean-value of g; µ h is the mean-value of h; σ g 2 is the variance of g; σ gh is the covariance of g and h; c 1 = ((k 1 )L) 2 ; c 2 = ((k 2 )L) 2 ; two variables to stabilize the division with a weak denominator; L is the dynamic range of the pixel-values (typically, this is (2 (bitsperpixel) ) − 1); k 1 = 0.01 and k 2 = 0.03 by default.
Pearson's correlation coefficient (CC) is a metric for evaluating the linear relationship between distributions [54]. Considering saliency and fixation maps, SM and F M , CC can be calculated as follows: In (12), σ(SM, F M ) is the covariance of SM and F M . CC is invariant to linear transformations. This metric corresponds to a symmetric function, and it is the reason for why it would deal equally with false positives and false negatives. If both the saliency map and ground truth have similar magnitudes, high positive CC values occur.
The normalized scanpath saliency (NSS) is computed as the average normalized saliency at fixated locations. Becuase the mean-value of saliency is subtracted during computation, NSS is robust against linear transformations [54]. This metric is sensitive to false positives. False positives would contribute to lower the normalized saliency value at each fixation location, and the overall NSS would be reduced. Given a saliency map SM and a binary map of fixation locations F B : where i indexes the ith pixel, and K is the total number of fixated pixels. The shuffled area under the ROC curve (s-AUC) is a metric that uses the receiver operating characteristic (ROC) curve. Considering various thresholds of the saliency map, the ROC curve is obtained by plotting the true positives against the false positives. This metric needs sampling thresholds to obtain the ROC curve [54]. An important issue in computing the s-AUC metric is how to sample thresholds to approximate the ROC curve. Sampling the threshold is performed at a fixed step size (from 0 to 1 by increments of 0.1), and the calculation of this metric would be realized.

Proposed Geometric Deep Network-Based Generative Adversarial Network
The details of the proposed geometric deep network-based generative adversarial network (GDN-GAN) for visual saliency and image reconstuction is explained in this section, and the structure of the proposed framework is shown in Figure 2.

The Proposed Network Architecture
The proposed geometric deep network-based generative adversarial network (GDN-GAN) architecture contains two parts of sequential layers. Each part consists of a number of layers to map the EEG signals to the image saliency and to reconstruct the image stimuli. The GDN part extracts discriminative features of the different categories that the input belongs to. The GAN part maps the extracted feature vector to the image saliency. The trained weight vectors of the network parameters are used as initial weight vectors to train the network to map the EEG signal to the image stimuli and realize the image reconstruction from the brain activity. The detailed schematic of the proposed network architecture is represented in Figure 2. After functional connectivity-based graph embedding of the recorded visually evoked EEG signals, it imposed to the GDN part of the proposed network. Figure 3 shows the detailed structure of the first GDN part of the network, and as it can be seen, it includes four layers of graph convolution. The Laplacian of the input graph is necessary to estimate the graph convolution of the input in each layer. The estimation is performed via the Chebyshev polynomial expansion of the Laplacian graph. Then, a batch normalization filters the output of each layer. After the fourth graph convolution layer, the extracted feature vector is passed through a dropout layer. Then, the flattened output of the dropout layer is fed to a dense fully connected layer, and a log-softmax function is used for the classification of the output of the fully connected layer.
The weights are trained to classify 40 categories of image stimulation and the flattened vector before the last dense layer is used to impose to the next GAN part of the network. The dimension of the flattened vector is equal to 6400. Figure 4 illustrates the differences in the dimensions of every layer of the GDN. As each of the recorded EEG signals includes 128 channels, the constructed graph as input to the proposed GDN part in Figure 2 has 128 nodes. Every node in the constructed graph includes 440 samples. The input dimension of the graph convolutional layer independent of the number of graph nodes is considered to be 440, equal to the number of samples in each node. The obtained graph with the first graph convolution has 128 nodes with 440 samples in each vertex. A graph with 128 nodes with 220 samples in each vertex is the output of the second graph convolution, and the output of the third graph convolution operation is a 128-node graph with 110 samples in each of the nodes, and accordingly, the graph output of the fourth layer has 50 samples in each node. The attained 128-node graph with 50 samples in each node outputs a vector with 6400 elements. The flattened vector is passed through a dense layer and the dimensions of the inputs and outputs of the dense layer are 6400 and 40, respectively.  Table 1 shows the dimensions of weight tensors for different layers of GDN part of the proposed GDN-GAN. Moreover, it shows the total number of parameters of graph convolutional layers according to the order of the Chebyshev polynomial expansion considered for each layer. Figure 5 illustrates different layers of the GAN part of the proposed network. Tables 2 and 3 give information about the details of the generator and discriminator parts of the proposed network, respectively. The generator part of the GAN consists of two dense layers, followed by four sequential transposed two-dimensional (2D) convolution layers, and one 2D convolution layer and leaky rectified linear unit is used as the activation function of all layers except for the first dense layer. The output of the GDN is imposed to the generator, the input dimension of the generator is equal to 6400, and the output dimension of the first layer is 100. The output dimension of the second dense layer is equal to 20,000. The reshape layer converts the shape of the 20,000-dimensional vector to a threedimensional output to impose to a 2D convolutional layer. Eight two-dimensional output with (50, 50) dimensions are imposed to the first transposed two-dimensional convolution layer. The kernel size in each of the transposed convolutional layers is equal to 4 × 4, and the number of filters in each of them is equal to eight. The size of the strides in the first transposed convolution layer is equal to 2 × 2, in the second transposed convolutional layer, it is equal to 3 × 3, and in the next two transposed layers, it is equal to 1 × 1. The output of the fourth transposed convolution 2D is imposed to the 2D convolution layer. The kernel size of the 2D convolution layer is considered as being equal to 2 × 2, and the size of the strides in this layer is equal to 2 × 2. The output dimension of this layer is equal to (299, 299), and is imposed to the last reshape layer. The output of the generator is a 299 × 299-dimensional image. The schematic view of the outputs of each layer and the differences in the dimensions of the generator part of the proposed GDN-GAN are illustrated in Figure 6.    Reshape (None, 299, 299, 1) 0  The adversarial part of the proposed GAN has three 2D convolution layers with the rectified linear unit as the activation function. The size of the kernel for each of these convolutional layer is considered equal to 4 × 4, the size of the strides is equal to 2 × 2, and the number of filters for each of them is equal to four. The output of the third 2D convolution is flattened and imposed to a dense layer with an output dimension that is equal to one, to discriminate between fake or real images generated by the generator part of the GAN. Figure 7 illustrates schematic view of dimensions of different layers and it presents a tangible view of the outputs in each phase of the network.

Training and Evaluation
In order to fit the proposed GDN part of the proposed GDN-GAN to the EEG-ImageNet dataset, a training procedure is implemented, and the parameter weights of the network are optimized. A 10-fold cross-validation strategy is used to train and evaluate the proposed network. A standard gradient descent (SGD) is used to optimize the proposed GDN in each iteration, and the optimum parameters of the GDN are determined with the convergence of the train and test accuracy. The trained weights of the GDN are transfered to the GDN-GAN to train the reconstruction part of the network.
Binary cross-entropy is used as a loss function for the GAN part of the GDN-GAN. Discriminator loss is considered as the sum of the loss of the original image and the loss of the generated image. For the loss of the discriminator output of the original image, instead of the ones vector as reference for calculating the cross-entropy between the reference and the original image, 0.9 is used as the coefficient of the ones vector. For the loss of discriminator output of the generated image, the cross-entropy is calculated between the generated image and the zeros vector with dimensions equal to the generated image. Generator loss is considered as the cross-entropy between the generated image and the ones vector with dimensions equal to the generated image. An Adam optimizer with a learning rate equal to 0.0001 is used to train both the generator and discriminator networks.
The tuning of different parameters of the proposed GDN-GAN is achieved through a trial-error procedure. Training is performed with the use of different parameters available in Table 4 as a search space. The optimal values for training with good convergence are represented in this table.

Results and Discussion
In this section, the simulation results of the proposed GDN-GAN are presented. Our framework is implemented on a laptop with a 2.8 GHz Core i7 CPU, 16 GB RAM, and a GeForce GTX 1050 GPU using the EEG-ImageNet database described in Section 3.1, available in perceive lab [48,49]. The proposed network is trained using the Google Colaboratory Pro platform.
At first, in order to illustrate the effect of different visual stimuli on brain activity during the visual process in the brain, we consider the average of the time-domain samples of each EEG channel among all the recordings, in accordance with the particular category of visual stimulation. A representational similarity analysis of signals is represented in Figure 9. This representation shows the similarity between brain activity according to different categories. This is a good evidence that EEG signals contain visually related information in order to lead a person to the recognition of the surrounding environment.
The functional connectivity estimation of EEG channels is the first step of the GDN part of the proposed GDN-GAN. Approximating the connectivity matrix according to the specific sparsity level is achieved, and the number of nonzero elements of the corresponding adjacency matrix would decrease to avoid computational complexity. The adjacency matrix would be the sparsely approximated connectivity matrix. Figure 10 illustrates the circular connectivity, considering the threshold level for sparsifying the connectivity matrix with the best training convergence result. The circular connectivities for the green (Ch 1 − Ch 32 ), yellow (Ch 33 − Ch 64 ), red (Ch 65 − Ch 96 ), and white (Ch 97 − Ch 128 ) electrodes according to Figure 1 are shown seperately. Figure 11a shows the training/test accuracy of GDN part of the proposed GDN-GAN, and Figure 11b shows the training/test loss function variations with respect to the number of iterations in this network for the classification of 40 different categories of visual stimuli. Figure 12 shows the receiver operating characteristic (ROC) plot for the GDN part of the proposed GDN-GAN and other state-of-the-art methods for classification of the EEG-ImageNet dataset, including region-level stacked bi-directional LSTMs [56], stacked LSTMs [57], and Siamese network [48]. The superiority of the GDN in terms of the area under the ROC can be seen in this figure compared to the other existing methods.
Furthermore, the performance of the GDN against the above-mentioned state-of-theart methods in terms of precision, F1-score, and recall metrics is shown in Table 5.  To demonstrate the efficiency of the GDN, we compare the performance of our method with traditional feature-based CNN and MLP. For this purpose, three hidden layers for MLP and CNN with a learning rate of 0.001 have been considered. Maximum, skewness, variance, minimum, mean, and kurtosis have been used as feature vectors for every single channel. According to Figure 13    A good confirmation to the performance of the proposed method is the confusion matrix shown in Figure 14. The confusion matrix is an appropriate illustration of the performance of a network on test splits in the case of multi-class classification. Figure     This table illustrates the saliency evaluation metrics according to the proposed method. The EEG signals are categorized in first part of the GDN-GAN. According to the extracted label, image stimuli is determined, and this image with the extracted feature of the first phase of the proposed method is imposed to the GAN part of the network to map the EEG signal to the saliency map of the image stimuli. After training, to test the GAN part, the EEG signals are imposed to the GDN-GAN, and the extracted images are compared to the original ground truth data through different saliency evaluation metrics, and the average of these metrics are reported in this table according to each category. Furthermore, the overall SSIM, CC, NSS, and s-AUC are represented through computing of the average of the saliency evaluation metrics among all categories.
According to this table, the proposed category-level performance of the visual saliency reconstruction method is over 90% except for six categories including Revolver, Running shoe, Lantern, Cellular phone, Golf ball and Mountain tent, in terms of SSIM and s-AUC. SSIM interprets the structural similarity index using the mean and standard deviation of pixels of a selected window with fixed size in reconstructed image and the ground truth data, and it would bring a reliable measure of similarity. The s-AUC uses true positives and false positives according to the pixels of the reconstructed image in the locations of fixations in ground truth data, and is a confident metric of similarity between the two images. Considering these details, SSIM and s-AUC illustrate the limitations of the proposed GDN-GAN. However, considering the detailed values of the four saliency metrics, this table shows that the proposed GDN-GAN is a reliable and efficient method to map the EEG signals to the saliency map of the visual stimuli.
The trained GDN-GAN for saliency reconstruction is fine-tuned for image construction issues. The loss plots in the result of training the generator and discriminator networks for visual saliency and image reconstruction are represented in Figure 15 for three number of categories. In addition, the SSIM and CC plots of both visual saliency and image reconstruction per epoch for these categories can be seen in this figure.
The loss plots corresponding to both the saliency reconstruction and image reconstruction illustrate that the variations in the generator and discriminator loss plots tend to oscillate around one, as saliency evaluation metrics, including SSIM and CC, start to con-vergeṪhese are the behaviors of GANs, and these plots are confirmation of the effectiveness of the proposed reconstruction of the GDN-GAN.
The results of visual saliency and image reconstruction for all of the 40 categories of image stimuli are illustrated in Figures 16-19. In addition, the ground truth data and the gray-scale versions of the original input image stimuli are shown in these figures. The visual evaluations of these figures besides the saliency evaluation metrics confirm the efficiency of the proposed GDN-GAN.
A comparison of the proposed GDN-GAN with state-of-the-art methods for image saliency extraction is conducted, and the performance metrics are reported in Table 7. The results of SalNet [43], SALICON [42], visual classifier-driven detector, ref. [48] and neural-driven detector [48] are demonstrated in this table. SALICON and SalNet are valuable approaches, considering the image data for saliency map extraction according to the eye-fixation points of the eye tracking process while a subject looking at an image. Another valuable approach, the visual classifier-driven detector and the visual neural-driven detector by Pallazo et al. [48], merges two modalities of EEG signals and image data to extract the image saliency map efficiently. Our proposed GDN-GAN is the first method that maps the EEG signals to the corresponding saliency map of the visual stimuli and reconstructs the saliency map and image stimuli. Considering the metrics according to these state-of-the-art methods concerning saliency map extraction in Table 7, this confirms the efficiency of the the proposed GDN-GAN.  Figure 15. Training loss and accuracy plots; from left to right: Image used as stimulation, generator, and discriminator loss for image saliency reconstruction, SSIM, and CC for image saliency reconstruction, generator, and discriminator loss for image reconstruction, SSIM, and CC for image reconstruction for three categories, including 8, 21, and 40.    In spite of the fact that the proposed GDN-GAN have a good performance in the reconstruction process, the limitations of the approach cannot be ignored. The first is that the ground truth data are generated using the pre-trained Open-Salicon using the image samples corresponding to the EEG-ImageNet database. This point should be considered in future works, and the solution is to use a good eye-tracker device and to record the eye fixation maps at the same time as the EEG recordings. These recorded eye fixation maps should be used as the ground truth data in future works.
Another limitation of the proposed GDN-GAN is the two-phase process of saliency reconstruction and three-phase of image reconstruction, considering the functional connectivitybased graph representation of the EEG signals imposed as the input to the network. An end-to-end process should be considered as the target deep network to decrease the training phases, eventually reducing the computational complexity, and hence increasing the speed of the network.

Conclusions
In this paper, an innovative graph convolutional generative adversarial network is proposed to realize the visual stimulation reconstruction using the EEG signals recorded from human subjects while they are looking at images from 40 different categories of the ImageNet database. The graph representation of the EEG records is imposed to the proposed network, and the network is trained to reconstruct the image saliency maps. The effectiveness of the proposed method is demonstrated with different saliency performance metrics. The trained weights are used as the initial weights of the proposed network to reconstruct the gray-scale versions of images used as visual stimulation. The results demonstrate the viability of the proposed GDN-GAN for image reconstruction from brain activity.
This research would be applicable to BCI projects for helping disabled people to communicate with their surrounding world. Neural decoding of the visually provoked EEG signals in BCI will interpret the brain activity of the subject and realize the automatic detection of the stimuli. It will pave the way toward mind reading and writing via EEG recordings, and is a preliminary step to help blind people with producing a module to realize vision through the generation of EEG signals corresponding to the visual surrounding environment.
The limitation concerning the ground truth data would be considered in future works to have a deep network that acts more similarly to real-world circumstances. The ground truth data in the proposed GDN-GAN are generated using the Open-Salicon pre-trained weights. These data should be recorded using a good eye tracker device at the same time as the EEG recordings. Considering the eye fixation maps of the subjects as the ground truth data would increase the efficiency of the proposed GDN-GAN in BCI applications.