A New Framework Combining Diffusion Models and the Convolution Classifier for Generating Images from EEG Signals

The generation of images from electroencephalography (EEG) signals has become a popular research topic in recent research because it can bridge the gap between brain signals and visual stimuli and has wide application prospects in neuroscience and computer vision. However, due to the high complexity of EEG signals, the reconstruction of visual stimuli through EEG signals continues to pose a challenge. In this work, we propose an EEG-ConDiffusion framework that involves three stages: feature extraction, fine-tuning of the pretrained model, and image generation. In the EEG-ConDiffusion framework, classification features of EEG signals are first obtained through the feature extraction block. Then, the classification features are taken as conditions to fine-tune the stable diffusion model in the image generation block to generate images with corresponding semantics. This framework combines EEG classification and image generation means to enhance the quality of generated images. Our proposed framework was tested on an EEG-based visual classification dataset. The performance of our framework is measured by classification accuracy, 50-way top-k accuracy, and inception score. The results indicate that the proposed EEG-Condiffusion framework can extract effective classification features and generate high-quality images from EEG signals to realize EEG-to-image conversion.


Introduction
Decoding the correlation between brain signals and human visual perception has grown to be a research focus.It can promote the progress of cognitive neuroscience and expand the application of brain signals.One of the most popular methods is to reconstruct images from brain signals [1][2][3].
Brain signals can be obtained through both invasive and noninvasive methods.Noninvasive methods have attracted much attention because of their advantages of high security, high acceptance by users, and wide application [4].The electroencephalogram (EEG) is an important non-invasive brain signal, as well as functional magnetic resonance imaging (fMRI) [5].In recent years, owing to the consistency of the structure of fMRI and image data as visual stimuli, many scholars have attempted to reconstruct images from fMRI [6][7][8][9][10].However, owing to the large size of the fMRI equipment and the high level of professional skills needed, its application scope is limited.On the other hand, EEG equipment that can record signals wirelessly, is more portable, and is easier to operate, which reduces the cost and difficulty of signal collection [11].In addition, research in [12] has shown that brain signals recorded by EEG can be used to decode visual stimuli.Therefore, methods that can efficiently extract EEG features and reconstruct high-quality images urgently need to be studied.
Currently, deep learning methods are broadly adopted in EEG decoding.Recurrent neural networks (RNNs) and convolutional neural networks (CNNs) are two established Brain Sci.2024, 14, 478 3 of 19 signal passed through the framework first passes through a feature extraction block, which obtains effective input features.The feature extraction block is centered on convolutional neural networks, combining the advantages of EEGNet and residual networks.It can not only extract features from EEG data in both temporal and spatial dimensions but also avoid problems such as gradient explosion, overfitting, and decreased learning ability, which arise from deep networks.Then, we perform position encoding and shape transformation on the obtained EEG features to adapt to the input format of the SD model.Finally, we use EEG features and image pairs to fine-tune the SD model.We integrate the EEG classification task with the image generation task and use features with obvious semantic distinctions to generate images.The principal contributions of our work are outlined below: • We designed a framework called EEG-ConDiffusion for generating images from EEG signals.It integrates the powerful feature extraction capabilities of CNNs and the image generation capabilities of the SD model to achieve decoding from EEG signals to images.In addition, our framework has been validated on an EEG-based visual classification dataset [12].• We propose a convolutional neural network for EEG feature extraction.We design temporal convolutional layers and spatial convolutional layers separately to achieve multi-dimensional, comprehensive extraction of EEG features.At the same time, we introduce multiple residual blocks to improve network depth while avoiding problems such as gradient explosion and overfitting.• We use the features extracted through the positional encoding network to change the size of the input features to adapt to the input of the SD model.In addition, the distribution of the input features more closely matches the text.

•
In addition, we use processed EEG features and image data to fine-tune the stable diffusion model, making it suitable for EEG-to-image tasks.Then, the extracted EEG features are fed into the SD model with updated training parameters to complete the image generation task.
Section 2 introduces the main components of EEG-ConDiffusion and the corresponding implementation methods.In Section 3, we tested the proposed EEG-ConDiffusion method on a public dataset [12] and displayed the results.In Section 4, we discuss the validation results for each stage.Additionally, we summarize the whole paper in Section 5.

Materials and Methods
This section presents the dataset used for experimental verification in this paper.In addition, the various components of the proposed EEG-ConDiffusion framework are presented and explained in detail.

The Public Dataset
We conducted performance verification on the EEG-based visual classification dataset [12].The dataset recorded 12,000 trials of EEG data from six subjects as they viewed visual stimuli.The dataset was recorded using 128-electrode Brainvision EEG equipment with a sampling frequency of 1000 Hz.A total of 2000 images were used as visual stimuli, including 50 for each of the 40 different categories in the ImageNet dataset [13].Each image was rendered for 0.5 s.We adopted the same data preprocessing method as [16] and selected 20-460 ms of EEG data.Therefore, the EEG data shape for each trial was set (1,128,440).The dataset was divided into 70%, 15%, and 15% ratios, which were utilized for training, validation, and testing of experiments, respectively.Furthermore, we separately filtered the signals in the frequency ranges of 1-70 Hz and 5-95 Hz for subsequent model evaluation.

The Proposed EEG-ConDiffusion Framework
Figure 1 shows the overall structure of our proposed EEG-ConDiffusion framework.Our EEG-ConDiffusion framework is mainly composed of three stages: feature extraction, fine-tuning of the SD model, and image generation.EEG signals are extracted by a convo-Brain Sci.2024, 14, 478 lutional neural network model, which we call EEGConvNet.With the EEGConvNet model, we can obtain highly distinguishing class features.We use the category features as the guiding conditions of the SD model, so as to ensure the authenticity of the generated images.The EEG features are processed as the embedding input of the SD model through a position embedding network and a pretrained text encoder, FrozenCLIPEmbedder.We refer to this embedding input as conditional EEG (ConEEG).ConEEG was used as a condition to input the SD model through U-Net, and then model fine-tuning and image generation were performed.
testing of experiments, respectively.Furthermore, we separately filtered the signals in the frequency ranges of 1-70 Hz and 5-95 Hz for subsequent model evaluation.

The Proposed EEG-ConDiffusion Framework
Figure 1 shows the overall structure of our proposed EEG-ConDiffusion framework.Our EEG-ConDiffusion framework is mainly composed of three stages: feature extraction, fine-tuning of the SD model, and image generation.EEG signals are extracted by a convolutional neural network model, which we call EEGConvNet.With the EEGConvNet model, we can obtain highly distinguishing class features.We use the category features as the guiding conditions of the SD model, so as to ensure the authenticity of the generated images.The EEG features are processed as the embedding input of the SD model through a position embedding network and a pretrained text encoder, FrozenCLIPEmbedder.We refer to this embedding input as conditional EEG (ConEEG).ConEEG was used as a condition to input the SD model through U-Net, and then model fine-tuning and image generation were performed.Inspired by EEGNet [16], we jointly use TC and SC blocks to better extract the temporal and spatial information of EEG signals.For the TC block, we set the parameter of out_channels to 32 and the stride to (1,2) in the Conv2d layer.The parameters of the out_channels correspond to the width of the network.Increasing the stride can reduce the amount of computation and quickly obtain important information.For the SC block, we Inspired by EEGNet [16], we jointly use TC and SC blocks to better extract the temporal and spatial information of EEG signals.For the TC block, we set the parameter of out_channels to 32 and the stride to (1,2) in the Conv2d layer.The parameters of the out_channels correspond to the width of the network.Increasing the stride can reduce the amount of computation and quickly obtain important information.For the SC block, we also increased the number of out_channels in Conv2d to 64.In the DC block, we set the out_channels parameter to 128 and the stride parameter to (1,2).To avoid semantic confusion and gradient explosion caused by deepening the network and to further extract richer semantic information from the EEG, we introduce residual networks [40] in EEG-ConvNet.Two double-layer residual blocks are inserted between the TC block, SC block, and DC block.
In addition, the EEG features output by the DC block pass through a linear layer and a Conv2d layer, and the squeeze function is used to reduce the dimension.Thus, the output Brain Sci.2024, 14, 478 5 of 19 EEG feature shape is processed as (77,768), which can serve as the input of the SD model.Our EEGConvNet structure has two output branches.One to output the EEG features for the SD model and the other to output classification results.
Brain Sci.2024, 14, x FOR PEER REVIEW 5 of 19 also increased the number of out_channels in Conv2d to 64.In the DC block, we set the out_channels parameter to 128 and the stride parameter to (1,2).To avoid semantic confusion and gradient explosion caused by deepening the network and to further extract richer semantic information from the EEG, we introduce residual networks [40] in EE-GConvNet.Two double-layer residual blocks are inserted between the TC block, SC block, and DC block.In addition, the EEG features output by the DC block pass through a linear layer and a Conv2d layer, and the squeeze function is used to reduce the dimension.Thus, the output EEG feature shape is processed as (77,768), which can serve as the input of the SD model.Our EEGConvNet structure has two output branches.One to output the EEG features for the SD model and the other to output classification results.
When training the EEGConvNet model, the input shape size of the EEG data was 1,128,440.After passing through the TC block, the SC block, residual blocks, and the DC block, the output shape of the EEG eigenvector is (128,1,7).Then, the EEG eigenvector is further processed by a linear layer and a Conv2d layer with a kernel size of (1,1) to obtain the shape output (77,1,768).Furthermore, the EEG feature vector (77,768) and the classification vector (1,40) were output from the EEG data through two branches.In addition, we apply operations such as BatchNorm, dropout, and average pooling in the EEGConvNet model and use ReLU as the activation function.These layers can expand the receptive field of the model and reduce overfitting.

Fine-Tuning the SD Model with EEG-Image Pairs
Diffusion models are currently one of the hottest directions in the research of artificial intelligence-generated content (AIGC).The SD model is a text-to-image model developed based on latent diffusion models (LDMs) proposed by Rombach et al. [24].LDMs perform the diffusion process on the latent space, which greatly reduces the computational complexity and cost.In addition, the cross-attention method is proposed for multimodal training, and the task of multimodal conditional generation is realized.In this paper, we input When training the EEGConvNet model, the input shape size of the EEG data was 1,128,440.After passing through the TC block, the SC block, residual blocks, and the DC block, the output shape of the EEG eigenvector is (128,1,7).Then, the EEG eigenvector is further processed by a linear layer and a Conv2d layer with a kernel size of (1,1) to obtain the shape output (77,1,768).Furthermore, the EEG feature vector (77,768) and the classification vector (1,40) were output from the EEG data through two branches.In addition, we apply operations such as BatchNorm, dropout, and average pooling in the EEGConvNet model and use ReLU as the activation function.These layers can expand the receptive field of the model and reduce overfitting.

Fine-Tuning the SD Model with EEG-Image Pairs
Diffusion models are currently one of the hottest directions in the research of artificial intelligence-generated content (AIGC).The SD model is a text-to-image model developed based on latent diffusion models (LDMs) proposed by Rombach et al. [24].LDMs perform the diffusion process on the latent space, which greatly reduces the computational complexity and cost.In addition, the cross-attention method is proposed for multimodal training, and the task of multimodal conditional generation is realized.In this paper, we input the EEG features that passed through the feature extraction stage as a conditional input into the cross-attention module in U-Net to guide the training of the model.
Jonathan et al. [19] were the first to provide a rigorous mathematical derivation of diffusion models (DMs) and established a complete framework for the forward process, reverse process, and model training.The positive diffusion propagation can be regarded as a Markov process.Given an image x 0 , x 0 ∈ R H×W×3 , the distribution of the natural image is defined as x 0 ∽ q(x 0 ).In the forward process, Gaussian noise is gradually added to the input x 0 , resulting in multiple noisy samples x 1 , x 2 . .., x T .The variance of the added noise Brain Sci.2024, 14, 478 6 of 19 is a fixed value β t ϵ(0, 1).The mean value of the noise is determined by β t and x t .After the T noise operation, image x T becomes a pure noise image that conforms to the standard normal distribution.The forward process can be represented as follows: The mathematical description of the conditional probability distribution of the process is as follows: q(x t |x t−1 The reverse process is to denoise from the Gaussian noise and reconstruct the original data, which is also a Markov process.In this process, we need to construct a parameter distribution to make an estimate.The mean and variance of the predicted noise are determined by the input time t and the current image x t .The process can be expressed as: The conditional probability distribution for the reverse process is expressed as follows: During the inverse process, our goal is to denoise the pure noise image x T and reconstruct the image that approximates the original x 0 .That is, the noise removed in the reverse process is expected to approximate the noise added in the forward process.Therefore, this process can be transformed into predicting a noise ϵ θ (x t , t) to fit the noise ϵ added at time t.Thus, the optimization of the inverse process can be simplified as follows [19]: Compared with DMs, LDMs undergo the diffusion process in a lower-dimensional latent space.As shown in Figure 1, the given real image x 0 is mapped from the pixel space to the latent space through an AutoEncoderKL ε and is represented as z = ε(x 0 ), z ∈ R h×w×3 .The reverse process can be simplified as the following formula [24]: The optimization of the objective function is achieved by a time-conditional UNet.In this paper, in addition to the time condition, we also need to introduce EEG signals as the control condition y; that is, we also need to model the conditional distribution in the form of p(z|y).This can be achieved by the conditional denoising autoencoder ϵ θ (z t , t, y).Based on the EEG-image pairs, we can train the model via the following formula: where τ θ (y) is the FrozenCLIPEmbedder in Figure 1.The encoder converts the conditional EEG signal into an intermediate expression, which is then mapped to the middle layer of U-Net via cross-attention blocks.The formula for cross attention is described as: Brain Sci.2024, 14, 478 V are learnable projection matrices.
During the model fine-tuning process, we fix the remainder of the SD model and optimize the encoder τ θ (y), cross-attention head, and projection head at the same time.We used loss analysis based on Formula (7) to fine-tune the model.The encoder used in the SD model is a pretrained FrozenCLIPEmbedder.It is a text-image alignment model pretrained by the contrastive language-image pretraining (CLIP) [41] method.As shown in Figure 1, the EEG signal undergoes position embedding after passing through the feature extraction stage to match the text data.In this paper, EEG feature vectors that have undergone feature extraction and position encoding are used instead of text input to the pretrained FrozenCLIPEmbedder.The FrozenCLIPEmbedder was fine-tuned to help align the EEG feature vector space with the image feature space.Fine-tuning the cross-attention head is essential for bridging the pretraining conditional space and the latent space of the EEG features.

Image Generation
Image generation is sampled from the standard normally distributed x T .The reconstructed image corresponding to x 0 is calculated via the inverse diffusion process.In the image generation stage, as shown in Figure 1, the feature vectors obtained from feature extraction and position encoding are fed into the fine-tuned FrozenCLIPEmbedder to form an intermediate expression for input to U-Net.After the fine-tuning of the EEG-image pairs, the cross-attention head and projection head in U-Net learn the relationships between the EEG signals and image features.The alignment of the EEG feature space and image feature space is realized to some extent.Therefore, the feature vector output from the fine-tuned U-Net is regarded as the expression of the generated image in the latent space.AutoEncoderKL is used to decode the latent expression of the image and eventually recover the image of the pixel space.α t = 1 − β t and α t = t ∏ i=1 α t are defined using parametric renormalization.The mean value at time t is calculated by the following formula: The process of image generation is shown in Algorithm 1.
As shown in Algorithm 1, the EEG feature generates the prompt representation x 0 ∈ R 77×768 after feature extraction and position embedding.x 0 ∽ q(x 0 ) represents the data distribution corresponding to the real image.x 0 generates the expression E (x 0 ) of the potential space through the FrozenCLIPEmbedder.x T is formed through the forward noise addition process.Then, step 6 in Algorithm 1 is carried out to complete the image reconstruction of the potential space.According to step 6 and formula (9), we obtain x t−1 = µ θ (x t , t).Moreover, µ θ (x t , t) has completed the prediction in model fine-tuning and can directly use the model with updated parameters for image reconstruction.Then, the image is converted to pixel space through step 7. Therefore, the EEG-to-image task can be realized according to the algorithm in Algorithm 1.

Results
The proposed EEG-ConDiffusion framework was validated on an EEG-based visual classification dataset [12].The visual stimuli in this dataset are derived from 40 classes of images in the ImageNet dataset [13].The proposed framework combines EEG classification and image generation tasks, and we present the results of both tasks separately.

Results of EEG Classification
In the EEG classification experiments, we validated them separately on data within the ranges of 1-70 Hz and 5-95 Hz.The data were divided according to percentages of 70%, 15%, and 15%.The input shape of the EEG data is X ∈ R b×1×128×440 .Additionally, b represents the value of the batch size, which is designed to be 64.To train the model, we used the Adam optimizer to run 1000 epochs to minimize losses.In addition, the initial learning rate is configured as 0.01.Furthermore, in order to reduce computational resources and save model training time, our classification model training in the feature extraction stage and model fine-tuning in the image generation stage are conducted separately.The EEG data are first fed into our proposed ConvNet, and, after sufficient training, the model weights are saved.When generating images, we directly use the pretrained ConvNet to obtain highly discriminative category features.

Evaluation Metrics for the EEG Classification Task
The classification accuracy along with kappa values were used to evaluate the classification efficiency of our proposed ConvNet.The accuracy is calculated as follows: where T P and F P represent true positives and false positives, respectively.T N and F N represent true negatives and false negatives, respectively.The formula for calculating kappa is as follows: P 0 represents the classification accuracy.P e represents a completely random rate of the classification accuracy.In the 40-category classification task in this paper, P e = 0.025.

Classification Results Compared with the Baselines
To evaluate the feature extraction capability of the proposed EEGConvNet, we compared it with recent methods such as LSTM [12], EEGNet [16], and ChannelNet [1].As shown in Tables 1 and 2, we compared the performances of the proposed EEGConvNet and benchmark models on 5-95 Hz and 1-70 Hz data.We trained a single-person model for six subjects and evaluated the accuracy and kappa values of the EEGConvNet model and benchmark models.From Tables 1 and 2, we can observe that the proposed EEGConvNet is verified on 5-95 Hz and 1-70 Hz data, and the average accuracy and average kappa value obtained are better than those of the benchmark models.The average accuracy of EEGConvNet was 67.97% on 5-95 Hz data, which was 56.34%, 30.60%, and 47.79% higher than that of LSTM [12], EEGNet [16], and ChannelNet [1], respectively.The average kappa value of EEGConvNet was 0.67, which was 0.58, 0.31, and 0.49 higher than that of LSTM [12], EEGNet [16], and ChannelNet [1], respectively.On the 1-70 Hz data, the average accuracy and average kappa value of EEGConvNet are 99.87% and 1.00, respec-Brain Sci.2024, 14, 478 9 of 19 tively, which are also better than those of the benchmark models.This indicates that the classification performance and model stability of EEGConvNet are good.In addition, we conducted a Wilcoxon signed-rank test between the proposed EEG-ConvNet and the benchmark models in Table 3. EEGConvNet was tested with LSTM [12], EEGNet [16], and ChannelNet [1], and the p values obtained were 1, 2, and 3, respectively, all of which were less than 0.05.This shows that the improvement in the classification effectiveness of our proposed EEGConvNet model is statistically significant compared to that of the benchmark model.Additionally, by comparing Tables 1 and 2, we observe that the average accuracy and average kappa values obtained by the tested model in the signals of 1-70 Hz are better than those obtained at 5-95 Hz.We list the differences between the results in Table 4.The average accuracies of LSTM [12], EEGNet [16], ChannelNet [1], and EEGConvNet at 1-70 Hz are improved by 1%, 1%, 1%, and 1%, respectively, compared to those at 5-95 Hz.This shows that different frequencies have an important impact on the classification results of the model, and the data make it easier to extract and classify features at 1-70 Hz.Therefore, in the subsequent image generation stage, we use data at 1-70 Hz for model fine-tuning.
Furthermore, the EEGConvNet model has achieved an average classification accuracy of 99.87% on data ranging from 1 to 70 Hz, demonstrating its ability to accurately classify EEG data.However, to further enhance classification accuracy, it is usually necessary to increase the depth and width of the network, which will increase the complexity and training time of the model.Therefore, EEGConvNet is already sufficiently and excellently suited for the feature extraction stage prior to classification tasks.

Results of Generating Images from EEG Signals
From the analysis in Section 2, it can be seen that the data with a frequency range of 1-70 Hz can be better extracted by the proposed EEGConvNet.In the image generation stage, we use the data extracted by EEGConvNet from 1 to 70 Hz as a condition and input a fine-tuned SD model to guide image generation.The images we need to reconstruct come from 40 categories of data from the ImageNet [13] dataset.To evaluate the image generation of the EEGConDiffusion framework, we generated three predicted images for each category for comparison with the real image.When fine-tuning the model, we trained the SD model with 500 epochs using a learning rate of 5.3 × 10 −5 .Due to limited laboratory conditions, we trained with a GPU model RTX 3090 (24 GB) from Intel Corporation in the United States with a batch size set to three.The total training time for each subject is approximately 58 h.We train six individual models in parallel using six servers of the same model.In addition, the number of sampling steps for the SD model is set to 250.In the image generation stage, the number of sampling steps for the SD model is also 250.

Evaluation Metrics for the Image Generation Tasks
To assess the quality of image generation, we used the Fréchet inception distance (FID) [42], inception score (IS) [43,44], and top-k [45] classification task accuracy as evaluation metrics.FID evaluates the differences between the generated image and the ground truth image by measuring the distance between their feature levels.Its mathematical description is as follows: (12) where µ r and µ g represent the feature means of the authentic image and the reconstructed image, respectively.∑ r and ∑ g represent the covariance matrix of the real picture and the generated picture, respectively.T r represents the trace.The IS is used to evaluate the clarity and diversity of image generation.It uses the pretrained Inception model [44] to perform category discrimination and calculates the authenticity score and diversity score of the generated images.Its mathematical expression is: where E x∼p g represents the average value of the traversed images.D KL stands for KL divergence.P(y|x) and P(y) represent the conditional distribution and edge probability distribution of the image category, respectively.The top-k classification task is used to evaluate the category accuracy of image generation.In this paper, we use the pretrained ImageNet1K [45] model to evaluate the classification accuracy of the 50-way top-1 and top-5 classification accuracies of the generated images.

Ablation Experiments
Since the training and fine-tuning of the SD model require considerable time and computational costs, we need to determine the conditions for the input of the SD model before the fine-tuning stage.To this end, we performed four types of ablation experiments before fine-tuning the SD model.They are described as follows: • The EEG data were used directly.The data shape is changed to (77,768) through a linear layer and a conv1d layer and then imported into a fixed SD model;

•
Position embedding was performed on the EEG data.The input shape is then changed to (77,768) through the linear and conv1d layers, and the fixed SD model is then entered; • The pretrained EEGConvNet model was used to extract features from the EEG data.
The features output by the classification model, with a shape of (77,768), are then used as input to the pretrained SD model; • After extracting the EEG features using the pretrained EEGConvNet model, position embedding was performed on the EEG data.The shape features (77,768) are then fed into the pretrained SD model.
As shown in Table 5, we evaluated the image generation performance under four different experiments using FID.We expect the generated image to have a smaller FID value.From the FID results, we can see that the average FIDs for EEGs with shape adaptation are 243.88 and 243.03 for position embedding.Position embedding slightly improves the generation effect of images.The FID after feature extraction is 46.37, which is quite different from the result without feature extraction.This shows that EEG feature extraction through EEGConvNet plays an important role in improving image generation performance.Therefore, we finally use EEG features that have undergone feature extraction and positional embedding as the guiding conditions for the SD model.

Image Generation Results Compared with the Baselines
To test the image generation quality of our EEG-ConDiffusion framework, we compared it with five methods developed in recent years.Kavasidis et al. [25] combined LSTM feature extraction and conditional GAN to improve the image generation effect and verified the effect of VAE and GAN for image generation.Zheng et al. [27] combined feature extraction and the SNGAN to generate images.Khare et al. [29] proposed a NeuroVision framework for image generation in combination with conditional ProGAN (cProGAN).We used the IS to evaluate the image generation performance by comparing the proposed EEG-ConDiffusion framework with the four baseline models.The higher the value of IS, the better the image is generated.In addition, we used the top-1 and top-5 tasks to verify the classification accuracy of the images generated by the proposed EEG-ConDiffusion framework.The IS and top-k classification results of the proposed EEG-ConDiffusion framework are shown in Table 6.In addition, as shown in Figure 3, the classification accuracy of the EEG-ConDiffusion framework on the top-1 and top-5 classification tasks is visualized.fied the effect of VAE and GAN for image generation.Zheng et al. [27] combined feature extraction and the SNGAN to generate images.Khare et al. [29] proposed a NeuroVision framework for image generation in combination with conditional ProGAN (cProGAN).We used the IS to evaluate the image generation performance by comparing the proposed EEG-ConDiffusion framework with the four baseline models.The higher the value of IS, the better the image is generated.In addition, we used the top-1 and top-5 tasks to verify the classification accuracy of the images generated by the proposed EEG-ConDiffusion framework.The IS and top-k classification results of the proposed EEG-ConDiffusion framework are shown in Table 6.In addition, as shown in Figure 3, the classification accuracy of the EEG-ConDiffusion framework on the top-1 and top-5 classification tasks is visualized.Table 7 shows the average IS results of the EEG-ConDiffusion framework and the baselines.The IS result obtained by our EEG-ConDiffusion method is 12.38, which is 7.89, 7.31, 6.85, and 7.23 higher than that of VAE [25], GAN [25], SNGAN [27], and cProGAN [29], respectively.We can see that the IS results of GAN-based image generation methods Table 7 shows the average IS results of the EEG-ConDiffusion framework and the baselines.The IS result obtained by our EEG-ConDiffusion method is 12.38, which is 7.89, 7.31, 6.85, and 7.23 higher than that of VAE [25], GAN [25], SNGAN [27], and cProGAN [29], respectively.We can see that the IS results of GAN-based image generation methods are better than those of VAE.Our EEG-ConDiffusion IS results are better than those of all baseline models, and the improvement effect is obvious.This shows the powerful ability of pretrained SD models to generate images from EEG signals.Figure 4 shows the images of pandas generated by our EEG ConDiffusion and benchmark models.The images generated using the VAE method are the most blurry.The images generated using our EEG-ConDiffusion method not only have higher clarity than the benchmark model but also have higher realism.Figure 4 shows the images of pandas generated by our EEG ConDiffusion and benchmark models.The images generated using the VAE method are the most blurry.The images generated using our EEG-ConDiffusion method not only have higher clarity than the benchmark model but also have higher realism.[25], GAN [25] SNGAN [27], cProGAN [29], and our EEG-ConDiffusion, as well as GT images.GT represents the ground truth images.

Metric
In addition, we compared our EEG-ConDiffusion framework with Visual GAN [46] Shimizu et al. proposed a Sinc-EEGNet model to extract features of brain signals.The Sinc-EEGNet model combines a CNN model and attention mechanism, and its classification accuracy is 45%.Then, the Sinc-EEGNet model was combined with the GAN model for image generation.The average classification accuracy of the images generated with the test set is 18.4%.As can be seen from Tables 2 and 6, the average classification accuracy of our EEG-ConDiffusion framework is 99.87% in the feature extraction stage, and the average classification accuracy of the top-1 task in the image generation stage is 25.21%.As shown in Figure 5, renderings of 40 category images were generated for the EEG-ConDiffusion framework and the Visual GAN [46] model.We can see that the images generated by the EEG-ConDiffusion framework are sharper and more realistic.[25], GAN [25], SNGAN [27], cProGAN [29], and our EEG-ConDiffusion, as well as GT images.GT represents the ground truth images.
In addition, we compared our EEG-ConDiffusion framework with Visual GAN [46].Shimizu et al. proposed a Sinc-EEGNet model to extract features of brain signals.The Sinc-EEGNet model combines a CNN model and attention mechanism, and its classification accuracy is 45%.Then, the Sinc-EEGNet model was combined with the GAN model for image generation.The average classification accuracy of the images generated with the test set is 18.4%.As can be seen from Tables 2 and 6, the average classification accuracy of our EEG-ConDiffusion framework is 99.87% in the feature extraction stage, and the average classification accuracy of the top-1 task in the image generation stage is 25.21%.As shown in Figure 5, renderings of 40 category images were generated for the EEG-ConDiffusion framework and the Visual GAN [46] model.We can see that the images generated by the EEG-ConDiffusion framework are sharper and more realistic.
To further demonstrate the generation effect of our EEG-Diffusion method, we present the generated images of pizzas, elephants, cars, and pianos in Figure 6 and compare them with the ground truth images.
The above discussions are based on intra-subject experiments.In order to evaluate the generalization performance of our proposed model.We used the model of subject 1 with the highest IS as the optimal model.The model was used to combine the test set data of other subjects to generate corresponding images.As shown in Table 8, we calculated the IS of S2~S6, and the average IS result was 11.74, which was better than the four benchmark models.As shown in Figure 7, the images generated by model migration are of high quality.Thus, the generalization performance of our EEG-ConDiffusion framework is demonstrated.To further demonstrate the generation effect of our EEG-Diffusion method, we present the generated images of pizzas, elephants, cars, and pianos in Figure 6 and compare them with the ground truth images.The above discussions are based on intra-subject experiments.In order to evaluate the generalization performance of our proposed model.We used the model of subject 1 with the highest IS as the optimal model.The model was used to combine the test set data of other subjects to generate corresponding images.As shown in Table 8, we calculated the IS of S2~S6, and the average IS result was 11.74, which was better than the four benchmark models.As shown in Figure 7, the images generated by model migration are of high quality.Thus, the generalization performance of our EEG-ConDiffusion framework is demonstrated.To further demonstrate the generation effect of our EEG-Diffusion method, we present the generated images of pizzas, elephants, cars, and pianos in Figure 6 and compare them with the ground truth images.The above discussions are based on intra-subject experiments.In order to evaluate the generalization performance of our proposed model.We used the model of subject 1 with the highest IS as the optimal model.The model was used to combine the test set data of other subjects to generate corresponding images.As shown in Table 8, we calculated the IS of S2~S6, and the average IS result was 11.74, which was better than the four benchmark models.As shown in Figure 7, the images generated by model migration are of high quality.Thus, the generalization performance of our EEG-ConDiffusion framework is demonstrated.4, all the tested models achieved better results on the 1-70 Hz data than on the 5-95 Hz data.This shows that the data in the dataset with a filter range of 1-70 Hz can be better extracted for features.Therefore, in the subsequent SD model fine-tuning and image generation stages, we use data with a frequency range of 1-70 Hz.In addition, Table 1 shows that the average accuracy of EEGConvNet is 67.97%, which is 56.34%, 47.79%, and 30.60% higher than that of LSTM [12], ChannelNet [1], and EEGNet [16], respectively.As shown in Table 2, the average accuracy of EEGConvNet is 99.87%, which is 38.55%, 6.26%, and 0.79% higher than that of LSTM [12], ChannelNet [1], and EEGNet [16], respectively.When EEGConvNet is validated in both frequency ranges, the average kappa values are also greater than those of the baseline models.In addition, as shown in Table 3, we also performed a Wilcoxon signed-rank test on the results between the EE-GConvNet and the baseline models.The obtained p values were all less than 0.05, which proved the statistical significance of EEGConvNet.
In the model fine-tuning stage, we use the EEG features that have gone through the process of feature extraction, shape adaptation, and position embedding as the conditions of the SD model to train the model.The SD model is a pretrained text-to-image model   4, all the tested models achieved better results on the 1-70 Hz data than on the 5-95 Hz data.This shows that the data in the dataset with a filter range of 1-70 Hz can be better extracted for features.Therefore, in the subsequent SD model fine-tuning and image generation stages, we use data with a frequency range of 1-70 Hz.In addition, Table 1 shows that the average accuracy of EEGConvNet is 67.97%, which is 56.34%, 47.79%, and 30.60% higher than that of LSTM [12], ChannelNet [1], and EEGNet [16], respectively.As shown in Table 2, the average accuracy of EEGConvNet is 99.87%, which is 38.55%, 6.26%, and 0.79% higher than that of LSTM [12], ChannelNet [1], and EEGNet [16], respectively.When EEGConvNet is validated in both frequency ranges, the average kappa values are also greater than those of the baseline models.In addition, as shown in Table 3, we also performed a Wilcoxon signed-rank test on the results between the EEGConvNet and the baseline models.The obtained p values were all less than 0.05, which proved the statistical significance of EEGConvNet.
In the model fine-tuning stage, we use the EEG features that have gone through the process of feature extraction, shape adaptation, and position embedding as the conditions of the SD model to train the model.The SD model is a pretrained text-to-image model based on LDMs [24].LDMs transform the diffusion process of DMs [19] into a low-dimensional latent space, which reduces the computational complexity and cost.This allows us to fine-tune the SD model with just a single RTX3090 GPU.In addition, LDMs introduce cross-attention in U-Net, thus enabling multimodal transitions.This is also the key to the realization of text-to-image in the SD model.This inspired us to use LDMs to implement EEG-to-image tasks.However, training LDMs from the start to the finish requires a large computational cost, long computation time, and problems such as schema crashes during training.To overcome this difficulty, we used a pretrained SD model, which has gained great attention in the text-to-image field.Only small EEG-image data pairs are needed to fine-tune the SD model to realize the consistency of image and EEG features.
In the image generation stage, we send the EEG features that have gone through the process of feature extraction, shape change, and position embedding into the cross-attention module of the fine-tuned SD model.To validate the performance of the images generated by our EEG-Condiffusion framework, we tested them on the EEG-based visual classification dataset [12] and the ImageNet dataset [13].As shown in Table 5, we performed ablation experiments for different EEG preprocessing methods and evaluated them with FID results.The results show that the FID of the EEG input obtained through EEGConvNet and position embedding is 41.62, which is 202.26, 201.41, and 4.75 lower than that obtained through shape adaptation only, shape adaptation, and position embedding and feature extraction only, respectively.The process of position embedding slightly reduces the FID results.The feature extraction process has a significant impact on the FID results.This suggests that feature extraction plays an important role in image generation.As shown in Tables 6  and 7, we show the image generation results of the EEG-ConvDiffusion framework and benchmark models and evaluate them with the IS results.The results show that the average IS of the EEG-ConDiffusion framework is 12.38, which is 7.89, 7.31, 6.85, and 7.23 higher than those of the VAE [25], GAN [25], SNGAN [27], and cProGAN [29], respectively.In addition, as shown in Figure 3, we show the accuracy of the top-1 and top-5 classification tasks for images generated by the model for each subject based on the EEG-ConDiffusion framework.The results show that this method can guarantee the authenticity of image generation to a certain extent.Figure 4 shows the panda images generated by our EEG-ConDiffusion framework and benchmark models and compares them with the ground truth images.The results show that the images generated by the VAE [25] method are the fuzziest, while the clarity of the images generated by the GAN [25], SNGAN [27], and cProGAN [29] methods is improved, but the generated images lack authenticity.The images generated by our EEG-ConDiffusion method are improved in terms of clarity and image authenticity.In addition, we compare the classification accuracy of the proposed framework with Visual GAN [46] in the feature extraction and generation stages.This proves that our EEG-ConDiffusion framework improves the quality of image generation with effective EEG features.In addition, as shown in Table 8, we performed model transfer experiments between subjects based on the model of subject 1 and obtained high-quality generated images.The generalization performance of our proposed EEG-ConDiffusion framework is demonstrated.To further demonstrate the generative performance of our EEG-ConDiffusion framework, we show the images of pizzas, elephants, cars, and pianos generated by the framework and the corresponding ground truth images in Figure 5.The images generated by each single-person model are similar to the ground truth images.

Conclusions
In summary, we combined EEG feature extraction and image generation tasks to construct an EEG-ConDiffusion framework for generating images from EEG signals.Our work explored the feasibility of using multimodal large models to study EEG signals.The EEG-ConDiffusion framework uses EEGConvNet to extract highly discriminating semantic information from EEG signals and then uses the LDM-based SD model to generate highresolution images.We evaluated the feature extraction stage and the image generation stage and revealed the powerful feature extraction ability and image generation performance of the EEG-ConDiffusion framework.
In future work, we will make more efforts in EEG-image feature matching.Additionally, the use of multi-level EEG features to match image features proposed by Shen et al. [47] is a very good inspiration.In addition, we will optimize the inter-subject transfer experiments to enhance the model's generalizability.

Figure 1 .
Figure 1.The overall structure of the proposed EEG-ConDiffusion framework. represents the output noise of the diffusion model.2.2.1.Feature Extraction Extracting highly discriminative semantic features from EEG signals is a key step in realizing EEG-to-image.Hence, we designed a CNN structure called EEGConvNet for EEG feature extraction.As shown in Figure 2, the EEGConvNet model is composed of a temporal convolution (TC) block, two residual (Res) blocks, a spatial convolution (SC) block, a deep convolution (DC) block, and two-branch output networks.Inspired by EEGNet[16], we jointly use TC and SC blocks to better extract the temporal and spatial information of EEG signals.For the TC block, we set the parameter of out_channels to 32 and the stride to(1,2) in the Conv2d layer.The parameters of the out_channels correspond to the width of the network.Increasing the stride can reduce the amount of computation and quickly obtain important information.For the SC block, we

Figure 1 .
Figure 1.The overall structure of the proposed EEG-ConDiffusion framework.ε θ represents the output noise of the diffusion model.2.2.1.Feature Extraction Extracting highly discriminative semantic features from EEG signals is a key step in realizing EEG-to-image.Hence, we designed a CNN structure called EEGConvNet for EEG feature extraction.As shown in Figure 2, the EEGConvNet model is composed of a temporal convolution (TC) block, two residual (Res) blocks, a spatial convolution (SC) block, a deep convolution (DC) block, and two-branch output networks.Inspired by EEGNet[16], we jointly use TC and SC blocks to better extract the temporal and spatial information of EEG signals.For the TC block, we set the parameter of out_channels to 32 and the stride to(1,2) in the Conv2d layer.The parameters of the out_channels correspond to the width of the network.Increasing the stride can reduce the amount of computation and quickly obtain important information.For the SC block, we also increased the number of out_channels in Conv2d to 64.In the DC block, we set the out_channels parameter to 128 and the stride parameter to (1,2).To avoid semantic confusion and gradient explosion caused by deepening the network and to further extract richer semantic information from the EEG, we introduce residual networks[40] in EEG-ConvNet.Two double-layer residual blocks are inserted between the TC block, SC block, and DC block.In addition, the EEG features output by the DC block pass through a linear layer and a Conv2d layer, and the squeeze function is used to reduce the dimension.Thus, the output

Figure 2 .
Figure 2. The structure and parameters of the EEGConvNet model. ∽  represent the out_channels for the different Conv2d layers. indicates the out_features of different linear layers. ∽  ,  ∽  , and  ∽  represent the kernel_size, stride, and padding of different Conv2d layers, respectively. and  represent num_features of different BatchNorm2d layers. and  are used to refer to the deactivation probabilities  for different Dropout layers.Layers represent the number of residual blocks.

Figure 2 .
Figure 2. The structure and parameters of the EEGConvNet model.F 1 ∽ F 5 represent the out_channels for the different Conv2d layers.F 6 indicates the out_features of different linear layers.K 1 ∽ K 6 , S 1 ∽ S 4 , and P 1 ∽ P 4 represent the kernel_size, stride, and padding of different Conv2d layers, respectively.Num 1 and Num 2 represent num_features of different BatchNorm2d layers.Rate 1 and Rate 2 are used to refer to the deactivation probabilities P for different Dropout layers.Layers represent the number of residual blocks.

Algorithm 1 :
The Pipeline of the Image Generation Stage 1: input EEG x ∈ R 1×128×440 2: do: Feature extraction and position embedding

Figure 3 .
Figure 3.The top-1 and top-5 classification results of the proposed EEG-ConDiffusion framework.

Figure 3 .
Figure 3.The top-1 and top-5 classification results of the proposed EEG-ConDiffusion framework.

Figure 4 .
Figure 4. Comparison of the proposed EEG-ConDiffusion framework and benchmark models for generating images of pandas.This figure includes images generated with VAE[25], GAN[25] SNGAN[27], cProGAN[29], and our EEG-ConDiffusion, as well as GT images.GT represents the ground truth images.

Figure 5 .
Figure 5.Comparison between the images generated by our EEG ConDiffusion and the Visual GAN [46].

Figure 6 .
Figure 6.Comparison between the images generated by our EEG ConDiffusion and the ground truth.S1, S2, S3, S4, S5, and S6 represent the 6 subjects.GT represents the ground truth.

Figure 5 .
Figure 5.Comparison between the images generated by our EEG ConDiffusion and the Visual GAN [46].

Figure 5 .
Figure 5.Comparison between the images generated by our EEG ConDiffusion and the Visual GAN [46].

Figure 6 .
Figure 6.Comparison between the images generated by our EEG ConDiffusion and the ground truth.S1, S2, S3, S4, S5, and S6 represent the 6 subjects.GT represents the ground truth.

Figure 6 .
Figure 6.Comparison between the images generated by our EEG ConDiffusion and the ground truth.S1, S2, S3, S4, S5, and S6 represent the 6 subjects.GT represents the ground truth.

Figure 7 .
Figure 7.Comparison between the images generated by our EEG ConDiffusion and the ground truth in the inter-subject transfer learning experiment.S1, S2, S3, S4, S5, and S6 represent the 6 subjects.GT represents the ground truth.
In this work, we propose a framework called EEG-ConDiffusion that combines feature extraction and pretrained conditional diffusion to generate images from EEG signals.We divide the EEG-ConDiffusion framework into three stages: feature extraction, SD model fine-tuning, and image generation.In the feature extraction stage, we propose an EEGConvNet to perform feature extraction on EEG signals.The output EEG features are reshaped and positionally embedded to accommodate the input of the SD model.We validated the classification performance of the EEGConvNet and baselines on EEG data at 5-95 Hz and 1-70 Hz, respectively.The EEG data are derived from the EEG-based visual classification dataset [12].As shown in Table

Figure 7 .
Figure 7.Comparison between the images generated by our EEG ConDiffusion and the ground truth in the inter-subject transfer learning experiment.S1, S2, S3, S4, S5, and S6 represent the 6 subjects.GT represents the ground truth.
In this work, we propose a framework called EEG-ConDiffusion that combines feature extraction and pretrained conditional diffusion to generate images from EEG signals.We divide the EEG-ConDiffusion framework into three stages: feature extraction, SD model fine-tuning, and image generation.In the feature extraction stage, we propose an EEGConvNet to perform feature extraction on EEG signals.The output EEG features are reshaped and positionally embedded to accommodate the input of the SD model.We validated the classification performance of the EEGConvNet and baselines on EEG data at 5-95 Hz and 1-70 Hz, respectively.The EEG data are derived from the EEG-based visual classification dataset [12].As shown in Table

Table 1 .
Accuracy and kappa values of the EEG classification task on the EEG-based visual classification dataset.The frequency range of the EEG data is 5-95 Hz.

Table 2 .
Accuracy and kappa values of the EEG classification task on the EEG-based visual classification dataset.The frequency range of the EEG data is 1-70 Hz.

Table 3 .
Wilcoxon signed-rank test between the proposed EEGConvNet and the baselines.

Table 4 .
Differences in the classification performances of EEGConvNet and the baselines on 1-70 Hz and 5-95 Hz data.

Table 5 .
FID results of four ablation experiments and the proposed EEG-ConDiffusion framework.

Table 6 .
The IS and top-k classification results of the proposed EEG-ConDiffusion framework.

Table 6 .
The IS and top-k classification results of the proposed EEG-ConDiffusion framework.

Table 7 .
The IS results of the EEG-ConDiffusion framework and baseline models.

Table 8 .
The IS results of the EEG-ConDiffusion framework in the inter-subject transfer learning experiment.

Table 8 .
The IS results of the EEG-ConDiffusion framework in the inter-subject transfer learning experiment.