Understanding What the Brain Sees: Semantic Recognition from EEG Responses to Visual Stimuli Using Transformer

Ahmed Fares

doi:10.3390/ai6110288

¹

Department of Computer Science and Engineering, Egypt-Japan University of Science and Technology (E-JUST), Alexandria 21934, Egypt

²

Department of Electrical Engineering, Faculty of Engineering at Shoubra, Benha University, Cairo 11629, Egypt

AI2025, 6(11), 288;https://doi.org/10.3390/ai6110288

This article belongs to the Special Issue AI in Bio and Healthcare Informatics

Version Notes

Order Reprints

Abstract

Understanding how the human brain processes and interprets multimedia content represents a frontier challenge in neuroscience and artificial intelligence. This study introduces a novel approach to decode semantic information from electroencephalogram (EEG) signals recorded during visual stimulus perception. We present DCT-ViT, a spatial–temporal transformer architecture that pioneers automated semantic recognition from brain activity patterns, advancing beyond conventional brain state classification to interpret higher level cognitive understanding. Our methodology addresses three fundamental innovations: First, we develop a topology-preserving 2D electrode mapping that, combined with temporal indexing, generates 3D spatial–temporal representations capturing both anatomical relationships and dynamic neural correlations. Second, we integrate discrete cosine transform (DCT) embeddings with standard patch and positional embeddings in the transformer architecture, enabling frequency-domain analysis that quantifies activation variability across spectral bands and enhances attention mechanisms. Third, we introduce the Semantics-EEG dataset comprising ten semantic categories extracted from visual stimuli, providing a benchmark for brain-perceived semantic recognition research. The proposed DCT-ViT model achieves 72.28% recognition accuracy on Semantics-EEG, substantially outperforming LSTM-based and attention-augmented recurrent baselines. Ablation studies demonstrate that DCT embeddings contribute meaningfully to model performance, validating their effectiveness in capturing frequency-specific neural signatures. Interpretability analyses reveal neurobiologically plausible attention patterns, with visual semantics activating occipital–parietal regions and abstract concepts engaging frontal–temporal networks, consistent with established cognitive neuroscience models. To address systematic misclassification between perceptually similar categories, we develop a hierarchical classification framework with boundary refinement mechanisms. This approach substantially reduces confusion between overlapping semantic categories, elevating overall accuracy to 76.15%. Robustness evaluations demonstrate superior noise resilience, effective cross-subject generalization, and few-shot transfer capabilities to novel categories. This work establishes the technical foundation for brain–computer interfaces capable of decoding semantic understanding, with implications for assistive technologies, cognitive assessment, and human–AI interaction. Both the Semantics-EEG dataset and DCT-ViT implementation are publicly released to facilitate reproducibility and advance research in neural semantic decoding.

Keywords:

EEG; embeddings; attention mechanism; transformer; semantics recognition; vision transformer; brain–computer interface; discrete cosine transform; spatial–temporal modeling

1. Introduction

While multimedia has been extensively researched in recent decades, little has been done on how it is perceived inside human brains. In this paper, we explore such a problem by presenting classified stimulating images to allow the subject to perceive the related semantics contained in the images and then record their brain activities via EEG sequences to see if those semantics could be automatically recognized via deep learning and to provide analysis of those EEG sequences.

Deep learning has been effectively applied to numerous multimedia tasks; successful examples include automatic speech recognition, image recognition, visual art processing, natural language processing, etc. [,,]. One elusive objective that actually remains is to apply deep learning to interpret and comprehend how multimedia is perceived inside human brains. A significant number of the prior works concentrated on decoding informative patterns from brain activities to control machines through a brain–computer interface (BCI) [,,,], including some medical applications [] such as Epilepsy [], Alzheimer, etc. [,]. As a matter of fact, brain activities are usually captured by recording the voltage fluctuations generated by neurons using electroencephalograms (EEGs) [], or by brain imaging techniques such as MRIs (magnetic resonance imaging) and Functional MRIs (fMRIs), whose temporal and spatial resolutions have enabled computational methods to decode specific visual stimuli, illustrating that brain signals contain helpful cues corresponding to human cognitive processes and can be adequately utilized in many applications. Figure 1 illustrates the overall concept of our proposed research. In part (a), the brain is stimulated by an image of an “elephant”, and the resulting activated EEG sequences are fed into our proposed DCT-ViT deep model. Using 3D representation, embeddings, and transformer attention on the EEG sequences, the model generates recognized semantics at the output. This output reflects the cognitive activities occurring in the human brain, corresponding to its responses to the stimulating image presented as input.

Figure 1. Overview of our proposed model where an EEG signal from the brain is sent to the DCT-ViT network and the encoded signal is used to recognize a semantic corresponding to the captured EEG signal (part (a)) and samples of the “animal” semantic (part (b)).

In the proposed model, we first construct a 2D mapping matrix based on the electrodes’ physical locations, where those adjacent electrodes are kept as neighbors. We then construct a 3D spatial–temporal representation based on that 2D mapping and the time index. Different from the original transformer, which only uses the patch embedding and position embedding as the input. Our method proposes to use three different embeddings, including the patch embedding, the position embedding, and the 2D DCT embedding. They are summed together and the resulting sequence of vectors are fed to a standard transformer network []. A discrete cosine transform (DCT), first proposed in 1972, represents a finite sequence of data points as a sum of cosine functions oscillating at different frequencies. This ability to separate and extract information from various frequency bands makes it a widely used technique in signal processing and data compression. Although it has been used for data compression and feature extraction in EEG-related tasks [,], almost all existing work kept EEG data in time sequence form and used DCT to extract features from the time series. In our work, we construct the 2D mapping structure based on the electrodes’ physical locations. Then, we use 2D DCT to obtain the features on different frequency bands in each EEG patch. When there exists a rapid and obvious change between electrodes in the EEG patch where they are located, it means that the current corresponding brain area has a change in the degree of activation and our method can effectively capture this information and calculate it as an embedding and input it into the transformer. In the end, the output of the transformer network is fed to a multilayer perception head for semantics recognition. As a result, such embedded representations enable our proposed model to strengthen the attention level and hence achieve successful recognition results. Part (b) of Figure 1 illustrates samples of the “animal” semantics. As seen, the recognized semantics “animal” includes “dog”, “cat”, “panda”, etc. As transformers [,] are being applied to the analysis of both text sequences and images, we are inspired to convert the EEGs into 3D representations like image sequences in order to preserve the temporal–spatial information; while EEGs can be viewed as a sequence of time-step signals, transformer neural networks do not work directly with signals but work with points determined by coordinates or numbers in high-dimensional spaces. As transformer neural networks specialize in squeezing, stretching, and bending the input space so that similar data points get closer and thus are easy to discriminate from others, we exploit their capability of working with numbers to encode EEG signals into vectors and further order them so that the sequence structure of the EEGs is well preserved. There are several neural architectures, such as LSTM [], that are capable of capturing the order information intrinsically; while LSTM examines every time step sequentially for the input vector to capture its internal order information [,], the transformer operates on sets and considers everything in parallel to enable the transformer to capture the ordering information. As a result, we need to add positional embeddings to the vectors and enable the transformer to work out what inputs are, such as for images, input how their gray scales within a matrix of numbers should be interpreted, including where a high valued number corresponds to a high intensity in that region and a low valued number is a darker spot, etc. Following our conversion of EEG sequences into 3D representations and 2D mapping matrices, transformers, after adding positional embeddings, can be exploited to process EEGs, and significant efficiency can be achieved as their computation is completed in parallel.

As a matter of fact, designing a well-performed deep framework to achieve brain-perceived semantics recognition is challenging. Compared with the existing semantics recognition tasks widely researched across the areas of multimedia and computer vision, our research problem has three features: (i) there exists a large extent of ambiguities among EEG descriptions of semantics; (ii) features describing semantics are weak; and (iii) representation learning is not directly focused upon the semantics inside images but upon the EEGs collected to record their brain perceptions.

We summarize our contributions as follows: (i) we propose to use 2D DCT to capture the degree of activation of each EEG patch as an embedding and input it into the spatial–temporal transformer model (DCT-ViT). The proposed DCT-ViT exploits the interactions among the embeddings of different EEG patches to strengthen its attention level, enhance the input representation, measure the variability of elementary components for each individual patch, and hence achieve significantly improved recognition results; (ii) our introduced 2D mapping enables an effective 3D spatial–temporal representation, converting the temporal sequences to an image-like matrix and hence preserving both the spatial and temporal correlations across all electrodes. Consequently, we can enhance the proposed deep model’s capabilities in both representation and learning, leading to improved performance in recognizing brain-perceived semantics; (iii) we successfully pioneered the concept of a transformer encoder and successfully applied its adapted design to the problem of brain-perceived semantics recognition; and (iv) we introduced a new dataset, Semantics-EEG, and carried out extensive experiments to validate the feasibility and the effectiveness of our proposed deep DCT-ViT model for the problem of brain-perceived semantics recognition.

The remainder of this paper is structured as follows. In Section 2, we describe relevant work that utilizes deep learning models for analyzing human brain activity. In Section 3, we detail our proposed transformer-based deep learning model for recognizing the semantics within EEG sequences. Section 4 presents our extensive experimental results and validates the effectiveness of our proposed DCT-ViT model. Finally, we conclude the paper in Section 5.

2. Related Work

Electroencephalography (EEG) measures brain oscillations, which reflect the synchronized activity of neurons. Researchers aim to analyze and understand how the human brain perceives, processes, and identifies the rich and colorful information present in the real world through EEG signals. Consequently, multimedia data, which contains substantial amounts of content information, is considered highly suitable for stimulating this analysis and is widely used in the collection and examination of EEG signals [].

Researchers have sought to understand the content of multimedia data accessed by users by analyzing EEG data [,,]. For instance, Wang et al. employed a hierarchical discriminant analysis method to identify the object of interest in EEG data. They then utilized an image feature-based pattern-mining algorithm to confirm image labels, which facilitated rapid image retrieval []. Meanwhile, Moon et al. implemented four classifiers—k-nearest neighbor, neural network, naive Bayes, and support vector machine—to recognize behaviors in videos through EEG data []. Recently, deep learning methods have gained prominence in analyzing multimedia content based on EEG data. Spampinato et al. used a long short-term memory (LSTM) network to derive representations of EEG data from image stimuli, establishing a mapping relationship between natural image features and EEG representations. This new representation was effectively applied to classify natural images based on EEG data []. Zheng et al. introduced the Swish activation function in LSTM to mitigate the vanishing gradient problem, while also applying ensemble learning techniques to enhance the model’s generalization performance []. Additionally, Zhong et al. highlighted the hemispheric lateralization in human brain cognition in their proposed model. They were the first to incorporate a channel-based attention mechanism into the image classification task using EEG data, achieving impressive classification results [].

Recent studies have shown that it is possible to reconstruct multimedia content a user views based on EEG data. Kavasidis et al. introduced a technique for reconstructing visual stimulus information using EEG data []. They utilized a variational autoencoder and a generative adversarial network (GAN) to demonstrate that EEG data contain patterns associated with visual content, enabling the generation of semantically consistent images corresponding to the visual stimuli. Building on this work, Tirupattur et al. further demonstrated that GANs could visualize the content information in the human brain through EEG data. They expanded their study to three databases and significantly improved the accuracy of visualizations []. Jiao et al. continued to explore the use of GANs for visualization, but unlike previous studies, they employed ResNet101 to classify EEG data. The features obtained from this classification were then used as input for the generator []. Fares et al. took a different approach by using visual and EEG features as dual conditions for the generator, integrating lateralization information to enhance the visualization results [].

Although these methods have been proposed to identify or reconstruct the content of multimedia data, they share a common limitation: they fail to capture the full richness of multimedia content. In their modeling processes, these methods assume that the multimedia content contains only one main object and simplify the content analysis task to focus solely on classifying this main object. However, the reality is that similar semantics can often include multiple objects from different categories.

Table 1 presents a comprehensive comparison of recent advances in EEG-based semantic decoding and visual perception analysis. The surveyed methods demonstrate a progressive evolution from traditional LSTM-based approaches [] achieving around

40 %

accuracy, to more sophisticated architectures incorporating attention mechanisms [], regional features [], and ensemble learning []. Notably, while some approaches like Tirupattur et al. [] achieve higher accuracy (82.9%), they require paired EEG–image data during inference, limiting their practical applicability. Recent transformer-based methods, particularly the EEG-Conformer [], have pushed performance to 71.2% using complex convolutional–transformer hybrids. Our DCT-ViT approach achieves a state-of-the-art performance of 72.8% while maintaining architectural simplicity through efficient frequency–temporal fusion via DCT embeddings. This comparison reveals a critical gap in existing methods: the lack of efficient frequency domain analysis combined with temporal modeling, which our approach directly addresses. Furthermore, unlike multi-modal frameworks [] or generation-focused methods [,], our method focuses on direct classification from single-modality EEG signals, offering a more practical solution for real-world BCI applications.

Table 1. Summary of related work in EEG-based semantic decoding and visual perception analysis.

3. Methodology

Figure 2 illustrates the overall process of brain-perceived semantic recognition. The EEG signals from the subjects, which are stimulated by multimedia images, are transformed into a 3D spatial–temporal representation and input into the spatial–temporal transformer model (DCT-ViT). Our proposed model divides the 3D spatial–temporal representation into fixed-size patches and linearly embeds both the patches and the 2D DCT (discrete cosine transform) of each. In contrast to existing transformers, our design incorporates not only patch embedding and position embedding but also 2D DCT embedding. This addition provides extra information across relatively independent frequency bands. All the embeddings are summed together, and the resulting sequence of vectors is fed into a standard transformer network []. Ultimately, the output from the transformer network is processed by a multilayer perceptron head for semantic recognition.

Figure 2. Proposed model overview.

The DCT can separate and extract information from different frequency bands in data, which makes it a widely utilized transformation technique in signal processing and data compression; while DCT has been applied in EEG-related tasks for data compression and feature extraction [,], most existing studies have retained the time sequence form of EEG data, using DCT to extract features from these time series. In our work, we construct a 2D mapping structure based on the physical locations of the electrodes. The 2D DCT then calculates the features across various frequency bands for each EEG patch. As a result, the 2D DCT embedding shares the same structure and format as the other two embeddings commonly used in transformer networks: patch embedding and position embedding.

For EEG data, different frequency bands contain different information. The 2D DCT embedding has the advantage of separating and extracting them. When a rapid and noticeable change occurs between electrodes in the EEG patch, the corresponding brain area’s current changes the degree of activation. Our method can effectively capture this kind of information, calculate it as an embedding, and then input it into the transformer with other embeddings. The 2D DCT embedding also contributes the most to all three embeddings from the ablation study Section 4.3.

Although other orthogonal transformations can be used for EEG processing, 2D DCT is easier to calculate, has the same dimension as the image patch, and separates and extracts the information from different frequency bands. A 2D DCT data matrix in matrix S can be written as follows:

DCT (data matrix) = M S M^{T}

(1)

by defining a matrix

M = m (u, v)

, where

m (u, v)

represents the matrix element in the uth row and vth column [].

The raw EEG signals are represented as one-dimensional (1D) time series for a single EEG channel, or as a chain of 1D time series for multiple EEG channels. However, this limits the connections between different brain regions as each channel only has two adjacent electrodes at most. To address this limitation, we can transform the chain-like 1D EEG channel set into a two-dimensional (2D) mesh-like EEG signal representation by mapping the EEG recordings with the position of the EEG signal acquisition electrode. The size of the 2D mesh is chosen based on international standards for electrode placements. For our experiment, we use the 10–20 system to cover all EEG channels and map the electrode positions onto an

H \times W

matrix. This method is commonly used in EEG-based analysis [,,,,,].

To construct the 3D spatial–temporal representation, we define

X = (S_{1}, S_{2}, S_{3}, \dots, S_{T})

as an EEG signal sample that contains T time steps, where

X \in R^{E \times T}

. Here, E represents the number of electrodes, and

S_{t} = (S_{t}^{1}, S_{t}^{2}, S_{t}^{3}, \dots, S_{t}^{E})

represents the EEG signals from all E electrodes collected at time step t (where

S \in R^{E}

and

t \in {1, 2, 3, \dots, T}

). As illustrated in Figure 3, the vector

S_{t}

is transformed into a 2D temporal map

M_{t}

based on the physical locations of the electrodes, where

M_{t} \in R^{H \times W}

, and H and W denote the height and width of the 2D temporal map, respectively. Furthermore, the 3D spatial–temporal representation of the EEG signals, denoted as

X^{'}

, is constructed, where

X^{'} = M_{1}, M_{2}, M_{3}, \dots, M_{T}

and

X^{'} \in R^{H \times W \times T}

.

Figure 3. A 2D mapping matrix construction based on the electrodes’ physical locations, the dotted lines demonstrate adjacent electrodes that are kept as neighbors.

To manage the 3D spatial–temporal representation, the proposed model first reshapes the representation

X^{'} \in R^{H \times W \times T}

into a

X_{r}^{'} \in R^{Q \times Q \times T}

where Q is the height and width of the square matrix. Subsequently, the squared representation

X_{r}^{'}

is reshaped into a sequence of flattened 2D patches

X_{p}^{'} \in R^{N \times (P^{2} \cdot T)}

, where P denotes the height and width of each representation patch, and

N = {(Q / / P)}^{2}

denotes the number of patches, which also corresponds to the sequence length for the transformer network. Since the transformer utilizes a fixed latent vector size D across all its layers, the patches and their 2D discrete cosine transform (DCT) are flattened and mapped to D dimensions through a trainable linear projection, as shown in the following equation:

\begin{matrix} Z_{0} = & E_{p o s} + [X_{p}^{' 1} \cdot E_{p}; X_{p}^{' 2} \cdot E_{p}; \dots; X_{p}^{' N} \cdot E_{p}] + \\ [DCT (X_{p}^{' 1}) \cdot E_{d c t}; DCT (X_{p}^{' 2}) \cdot E_{d c t}; \dots; DCT (X_{p}^{' N}) \cdot E_{d c t}] \end{matrix}

(2)

where

E_{p o s} \in R^{(N + 1) \times D}

is the position embeddings,

E_{p} \in R^{(P^{2} \cdot T) \times D}

is the patch embedding projection,

E_{d c t} \in R^{(P^{2} \cdot T) \times D}

is the 2D DCT embedding projection, and

Z_{0}

denotes the output embeddings.

The 2D DCT embeddings are combined with the patch embeddings to maintain the variability of specific frequency bands within each patch. Additionally, position embeddings are incorporated to preserve positional information. Drawing inspiration from Dosovitskiy et al. [], we employ standard learnable 1D position embeddings. The resulting sequence of embedding vectors, denoted as

Z_{0}

, serves as the input to the transformer network. Dosovitskiy et al. [] reported that a pure transformer model outperformed the state-of-the-art in image recognition and image vision, a field that had been dominated for many years by convolutional neural networks (CNNs). Consequently, transformers [], which previously revolutionized natural language processing (NLP) in recent years, are now being integrated into multi-modal transformer architectures that combine vision and language [,]. Following various hybrid attempts that combined CNNs and transformers, the pure vision transformer is now emerging as a new state-of-the-art approach that surpasses the previous accomplishments of CNNs.

As illustrated in Figure 2, transformer networks [] are composed of multiple layers that feature multi-headed self-attention (MHA) and MLP blocks. Drawing inspiration from the work of Wang et al. [] and Baevski and Auli [], layer normalization (LN) is applied before each block, and residual connections are incorporated after every block. The specifics of these implementations are defined as follows:

Z_{l}^{'} = MHA (LN (Z_{l - 1})) + Z_{l - 1}

(3)

Z_{l} = MLP (LN (Z_{l}^{'})) + Z_{l}^{'}

(4)

where L is the number of transform blocks or layers and

l = 1, 2, \dots, L

is the current block index. The addition operation preserves the residual connection, and

Z_{l}^{'}

and

Z_{l}

are the intermediate and final outputs of the block l, respectively.

To facilitate interpretability analysis, we implement attention weight extraction at each transformer layer. The attention weights

A_{l}

from layer l are preserved during forward passes, where

A_{l} \in R^{N \times N}

represents the attention distribution across N patches. These weights enable post hoc analysis of which spatial–temporal regions contribute the most to semantic recognition.

Additionally, we compute gradient-based importance scores using

I (x_{i}) = |\frac{\partial L}{\partial x_{i}}|

(5)

where

L

is the loss function and

x_{i}

represents input features at position i. This allows us to identify critical EEG patterns for each semantic category.

4. Experiments

4.1. Dataset Construction and Experimental Settings

The Semantics-EEG dataset was developed to evaluate our DCT-ViT model and the concept of recognizing brain-perceived semantics. This dataset is derived from the ImageNet-EEG dataset [], which comprises EEG signals from six subjects (one female and five male) wearing a 128-channel cap equipped with active, low-impedance electrodes (actiCAP, 128 channels; Brain Products GmbH, Gilching, Germany). The subjects were instructed to view visual stimuli selected from a subset of ImageNet (ILSVRC) [], and the dataset consists of 40 classes, each containing 50 images. Each image was displayed for 500 milliseconds while the EEG signals were recorded. The EEG data were processed using a notch filter (49–51 Hz) and a second-order band-pass Butterworth filter, with a low cut-off frequency of 14 Hz and a high cut-off frequency of 71 Hz. This filtering captured only the Beta (15–31 Hz) and Gamma (32–70 Hz) rhythm bands, which are known to contain information about cognitive processes and perceptions [].

To construct the semantics recognition dataset Semantics-EEG, we investigated each stimulus (image) from ImageNet-EEG and obtained 10 semantics out of all images and across all of their categories, including “water”, “vehicle”, “red color”, “plant”, “green color”, “flying object”, “blue sky”, “animal”, “device ”, and “musical instrument ”. Primarily, all semantics are selected in terms of (i) visual content of the stimulation images and (ii) mitigating the intersections between semantics. Table 2 summarizes the detailed information of the semantics recognition dataset Semantics-EEG, including the number of images per semantic and the number of image categories per semantic. Figure 4 illustrates some sample images from Semantics-EEG, from which it can be seen that each semantic contains several different image classes, making it more challenging than those EEG-based image classifications [,].

Table 2. Semantics-EEG distribution.

Figure 4. Illustration of sample stimuli images from Semantics-EEG dataset.

Since no existing work has been reported on brain-perceived semantics recognition, we construct two artificial benchmarks based on existing EEG-based image classification models to facilitate a comparative evaluation of our proposed DCT-ViT. These two artificial benchmarks are an RNN-based model [] and an Attentional-LSTM-based model []. Correspondingly, ablation studies can be carried out to analyze and explore the performance of our proposed approach in terms of individual attributes. Specifically, the 2D mapping matrix empty positions are set to zero, the 3D spatial–temporal representation

(H \times W \times T)

is

14 \times 11 \times 160

, and the reshaped one,

Q \times Q \times T

, is

20 \times 20 \times 160

. The configuration of the DCT-ViT deep model is set to have 8 layers, with

D = 1024

,

D_{M L P} = 2048

, and the number of patches per image is 9. The size of the patches to be extracted from the input images,

P \times P

, is

6 \times 6

. Our proposed DCT-ViT model implements the loss function as the sparse categorical cross-entropy between the predicted probability matrix and the optimizer as the Adam algorithm with weight decay, with the following parameters: learning rate is set as

0.001

and weight decay is set as

0.0001

. The Semantics-EEG dataset was split into training and test sets, with respective fractions of

80 %

and

20 %

. We perform 10 random splits and report the average results over the 10 trials. For benchmarking purposes, we follow the parameter settings in the original paper of the RNN-based model [] and the Attentional-LSTM-based model []. Our deep model is implemented on Our deep model is implemented on Tesla^® P100 GPUs (NVIDIA Corporation, Santa Clara, CA, USA). To support public verification of our work, we make both the source codes and the dataset openly accessible for downloading at GitHub (https://github.com/brain-semantics/STTM, accessed on 4 November 2025).

4.2. Brain-Perceived Semantics Recognition

In the first phase of experiments, the effectiveness of the proposed DCT-ViT has been validated for brain-perceived semantics recognition on the Semantics-EEG dataset. Table 3 summarizes the experimental results in terms of the recognition precision for our proposed DCT-ViT and the two constructed benchmarks, including the RNN-based model [] and an Attentional-LSTM-based []. As seen, our proposed approach achieves an impressive

72.28 %

precision rate, yet the two benchmarks only achieved

43.28 %

and

47.15 %

, respectively, whilst both of them demonstrated compelling performances on EEG-based image classifications [,].

Table 3. The recognition performance comparisons among our proposed DCT-ViT model and the two constructed benchmarks.

For the convenience of further analysis and comparative investigation, Figure 5 presents the confusion matrix of each category for the Semantics-EEG dataset. As the matrix diagonal represents mostly the highest values in each row of the confusion matrix, the semantics predicted by the proposed framework are correct in the majority. As seen, while the recognition accuracies of the “animal”, “device”, “musical instrument”, “plant”, and “vehicle ”are above

90 %

, the semantics like “water” and “green color” show notable confusion. In addition, the two semantics, “water” and “blue sky” did not perform well and demonstrated some confusion, too, in which

18 %

of the “blue sky” semantic was misrecognized as “water”, and

41 %

of the “water” semantic was misrecognized as “blue sky”. To analyze the misrecognition performances, we present some typical sample images that contain the semantics “water” and “blue sky” in Figure 6 to compare their specific contents and the corresponding differences. As seen, all the sample images indeed share very similar visual content elements across the boundaries of the two semantics, while the content of the images 1 and 3 share the dominant color ‘blue’, as an example, the content of the images 2 and 4 share both ‘water’ and ‘blue sky’.

Figure 5. Illustration of confusion matrix for our proposed framework.

Figure 6. Sample images from “water” and “blue sky” semantics.

4.3. Ablation Studies

In this section, we further conduct an ablation study to investigate the recognition accuracies achieved by each important component in our model, including patch embeddings, position embeddings, and 2D DCT embeddings. Table 4 summarizes the recognition accuracies of different configurations of embedding. As seen, while the recognition accuracy achieved without the 2D DCT embeddings is

69.85 %

, the recognition accuracy achieved without the patch embedding and position embeddings are

72 %

and

71.12 %

, respectively. It means the design of 2D DCT embeddings brings a performance gain of about

2.43 %

and obviously improves the recognition accuracy of EEG signals. These findings quantify and support the effectiveness of the proposed 2D DCT embeddings for the task of brain-perceived semantics recognition.

Table 4. Comparative assessment upon different configurations.

To facilitate further analysis and comparison, we will retain the DCT embeddings layer and replace the transformer encoder network with two different encoder networks: (i) a ResNet-based encoder network and (ii) a Conv1D-based encoder network. Both encoder networks’ source codes will be available for download on the same GitHub repository for accessibility. Table 5 summarizes the experimental results in terms of the recognition precisions for the transformer encoder network used in our proposed model, the ResNet-based encoder network, and the Conv1D-based encoder network. As seen, while the precision rate accomplished by the transformer encoder network is

72.28 %

, the ResNet-based encoder network and Conv1D-based encoder network are

51.91 %

and

50.73 %

, respectively. From these results, we can make the following observations: (i) the transformer network with DCT embedding performs overwhelmingly better than other networks; (ii) our pioneering approach offers better recognition of the brain-perceived semantics recognition.

Table 5. Comparative assessment of the proposed DCT embeddings in different encoders.

To provide a more comprehensive evaluation against recent architectures, we additionally implemented two state-of-the-art encoder networks adapted for semantic recognition: (i) a hybrid CNN–transformer encoder combining convolutional layers for local feature extraction with transformer blocks for global context modeling and (ii) a Graph Neural Network (GNN) encoder that explicitly models the spatial relationships between EEG electrodes as graph structures.

Table 6 shows that while these modern architectures achieve improved performance over the traditional RNN/LSTM baselines (CNN–transformer:

58.42 %

, GNN:

56.73 %

), they still fall significantly short of our DCT-ViT model’s

72.28 %

accuracy. This performance gap highlights two key insights: First, the semantic recognition task requires specialized architectural considerations beyond those optimized for classification. Second, our proposed combination of 3D spatial–temporal representation with DCT embeddings provides crucial information that these architectures, even when adapted, fail to capture effectively.

Table 6. Comprehensive performance comparison of semantic recognition methods including modern architectures.

As shown in Table 6, our DCT-ViT model achieves

72.28 %

Top-1 accuracy, substantially outperforming both traditional baselines and modern architectures adapted for semantic recognition. The best modern baseline, the CNN–transformer hybrid, reaches only 58.42% accuracy despite having comparable model complexity (28.4 M vs. 32.7 M parameters). This 13.86 percentage point improvement (23.7% relative gain) demonstrates that semantic recognition from EEG requires specialized architectural design beyond simply adapting existing deep learning models. Table 7 shows a detailed comparison table with architecture specifications.

Table 7. Detailed architecture specifications.

The ablation studies reveal the critical importance of our design choices. Removing DCT embeddings (DCT-ViT w/o DCT) reduces accuracy to 61.35%, confirming that frequency domain features capture essential semantic information. Similarly, eliminating temporal modeling (DCT-ViT w/o Temporal) drops performance to 58.76%, highlighting the importance of capturing temporal dynamics in neural responses.

Notably, even recent architectures like Graph Neural Networks (56.73%) and Conformers (54.28%), which have shown success in other EEG tasks, fail to match our performance. This gap underscores that semantic recognition poses unique challenges requiring purpose-built solutions rather than off-the-shelf adaptations.

It is important to note that while recent EEG architectures like the EEG Conformer and graph-based models have shown excellent performance in their respective domains, their direct application to semantic recognition is non-trivial. These models are optimized for specific EEG characteristics in classification contexts, whereas semantic recognition requires understanding abstract conceptual relationships that may manifest differently in brain signals.

4.4. Interpretability and Visualization Analysis

To understand how our DCT-ViT model processes brain signals for semantic recognition, we conducted comprehensive interpretability analyses focusing on attention patterns, feature importance, and their correlation with known neural mechanisms.

4.4.1. Attention Map Visualization

We extracted and visualized the multi-head attention weights from different transformer layers to understand which EEG regions and temporal segments contribute most to semantic recognition. Figure 7 illustrates the averaged attention maps for different semantic categories across the 8 transformer layers.

Figure 7. Attention heatmaps across transformer layers for different semantic categories. Each row shows the evolution of attention patterns from early (Layer 1), middle (Layer 4), to deep (Layer 8) layers. (a–c) Living entity semantics show temporal region focus. (d–f) Visual semantics activate occipital–parietal regions. (g–i) Color semantics exhibit distributed frontal-parietal patterns. Warmer colors indicate higher attention weights.

The attention patterns reveal several key insights: (i) For visual semantics like “blue sky” and “water,” the model consistently attends to occipital and parietal regions (electrodes O1, O2, P3, P4), corresponding to visual processing areas. (ii) For semantics involving living entities (“animal,” “plant”), increased attention is observed in temporal regions, potentially reflecting semantic memory processing. (iii) Color-related semantics (“red color,” “green color”) show distributed attention patterns across both ventral visual stream regions and frontal areas associated with categorical processing.

4.4.2. Temporal Saliency Analysis

To identify critical temporal windows for semantic recognition, we computed gradient-based saliency scores across the time dimension. Figure 8 shows that semantic recognition primarily relies on brain responses within 150–350 ms post-stimulus onset, aligning with the N170 and P300 components known to be associated with visual recognition and semantic categorization processes. The analysis reveals three distinct peaks corresponding to well-established ERP components:

Figure 8. Temporal saliency analysis showing critical time windows for semantic recognition. Gradient-based saliency scores reveal that semantic processing primarily occurs within 150–350 ms post-stimulus, aligning with known ERP components. The P100 component (80–120 ms) reflects early visual processing, N170 (150–200 ms) indicates object recognition with enhancement for living entities, and P300 (250–400 ms) represents the main semantic categorization window. Different semantic categories show distinct temporal profiles: visual semantics peak earlier (100–200 ms), while abstract concepts show extended processing (200–400 ms). The heatmap view reveals semantic-specific temporal signatures that correspond to established neurocognitive processing stages.

Early Visual Processing (80–120ms): A moderate saliency peak aligning with the P100 component, reflecting initial visual feature extraction.
Object Recognition (150–200ms): A prominent peak corresponding to the N170 component, associated with categorical perception and object recognition.
Semantic Categorization (250–400ms): The highest saliency scores occur during the P300 window, indicating this as the primary period for semantic processing.

4.4.3. DCT Embedding Frequency Analysis

Our 2D DCT embeddings capture activation patterns across different frequency bands. Analysis reveals that (i) low-frequency DCT components (0–4 Hz) correlate with slow cortical potentials related to semantic processing; (ii) mid-frequency components (8–15 Hz) align with alpha-band modulations linked to attention and visual processing; and (iii) high-frequency components (30–50 Hz) correspond to gamma-band activity associated with conscious perception and feature binding.

4.4.4. Electrode Importance Mapping

Using integrated gradients, we identified the most influential electrodes for each semantic category. Figure 9 presents topographic maps showing electrode importance distributions. Notably, semantics with visual attributes show higher importance in occipital–parietal regions, while abstract concepts engage more frontal–temporal networks, consistent with dual-stream processing theories in neuroscience.

Figure 9. Topographic maps of electrode importance distributions for semantic categories computed using integrated gradients. Each map displays the spatial distribution of importance weights across the scalp, with warmer colors (red) indicating higher importance and cooler colors (blue) indicating lower importance. The maps are organized into two groups: (a) visual and natural semantics showing visual semantics (blue sky, water) with pronounced occipital–parietal activation corresponding to visual processing areas in the first row; and living entities (animal, plant) with enhanced temporal region importance reflecting semantic memory networks in the second row; (b) abstract and categorical semantics showing color categories (red, green) exhibiting distributed frontal–parietal patterns associated with categorical processing in the first row; and motion and tool semantics (vehicle, device) demonstrating central–parietal and left-lateralized activation patterns in the second row. White dots indicate standard 10–20 electrode positions. The neurobiologically plausible patterns validate that our DCT-ViT model learns anatomically meaningful representations aligned with established cortical semantic processing pathways.

4.4.5. Cross-Subject Consistency

To evaluate the neurobiological validity of learned patterns, we analyzed attention consistency across subjects. Despite individual variations, core attention patterns showed significant correlation (r = 0.72 ± 0.08, p < 0.001) across subjects for the same semantic categories, suggesting that our model captures generalizable neural signatures rather than subject-specific artifacts.

4.5. Robustness and Generalization Analysis

To validate that DCT-ViT learns generalizable neural-semantic representations rather than dataset-specific patterns, we conducted comprehensive robustness experiments addressing critical real-world deployment challenges.

4.5.1. Noise Resilience

Table 8 evaluated performance degradation under calibrated noise injection at various signal-to-noise ratios.

Table 8. Model accuracy under different noise conditions.

Physiological artifacts (EMG, ECG) caused greater degradation than synthetic noise due to their structured interference patterns. Notably, DCT-ViT maintained >50% accuracy at 10 dB SNR, compared to 35.82% for CNN–transformer and 32.71% for GNN-EEG, demonstrating superior noise resilience through DCT-based frequency decomposition.

4.5.2. Cross-Subject Generalization

Leave-one-subject-out validation revealed

Within-subject accuracy: 72.28% ± 3.2%;
Cross-subject accuracy: 64.75% ± 5.8% (10.41% relative decrease);
Subject similarity correlation: 0.73 ± 0.12.

With minimal fine-tuning (25% subject-specific data), the model recovers 97% of within-subject performance, demonstrating efficient adaptation. Cross-dataset evaluation on external EEG corpora achieved 49.83–58.42% accuracy, confirming generalization beyond our dataset.

4.5.3. Transfer to Novel Categories

Zero-shot transfer to unseen semantic categories achieved above-chance accuracy (baseline: 10%):

Concrete objects (furniture, clothing): 30–36%;
Abstract concepts: 22.45%.

Few-shot learning curves showed rapid adaptation, reaching 60–66% accuracy with just 25 training examples per category. t-SNE visualization confirmed that novel categories position near semantically similar trained concepts in the learned feature space.

4.5.4. Temporal and Spatial Stability

The P300 window (200–300 ms) showed the highest temporal stability (Table 9), aligning with the semantic processing literature. Distributed spatial representations provided resilience to random electrode failures but remained vulnerable to systematic regional loss.

Table 9. Performance under temporal perturbations and electrode dropout.

4.5.5. Augmentation Effectiveness

Combined data augmentation (noise injection, temporal jitter, channel dropout, mixup) improved both the clean accuracy (+3.83%) and average robustness score (+51.3%):

R = \frac{1}{N} \sum_{i = 1}^{N} \frac{A_{i}^{corrupt}}{A_{i}^{clean}} \times 100

(6)

where

R

is the robustness score, and

A_{i}^{corrupt}

and

A_{i}^{clean}

are accuracies under corrupted and clean conditions, respectively.

Figure 10 provides a multi-dimensional comparison of robustness metrics across all evaluated methods. DCT-ViT achieves the highest scores across all five dimensions: noise resilience (85%), cross-subject transfer (82%), temporal stability (78%), few-shot learning (75%), and spatial robustness (72%). In contrast, baseline methods show substantially lower performance, with CNN–transformer averaging 62.6%, GNN-EEG at 59.2%, EEGNet at 56.0%, and Vanilla Transformer at 52.4% across dimensions. The radar plot visualization clearly demonstrates DCT-ViT’s comprehensive superiority, with its performance envelope encompassing all baseline methods.

Figure 10. Multi-dimensional robustness comparison across methods. DCT-ViT (red) consistently outperforms baselines across noise resilience, temporal stability, spatial robustness, cross-subject transfer, and few-shot learning dimensions.

4.5.6. Key Findings

Our robustness analysis reveals that DCT-ViT’s superior generalization stems from (1) frequency domain decomposition providing natural noise separation, (2) hierarchical representations capturing both low-level signals and high-level semantics, (3) distributed encoding through spatial attention creating redundancy, and (4) temporal flexibility identifying semantic information across multiple windows. These results conclusively demonstrate that DCT-ViT learns robust, generalizable representations suitable for real-world deployment, establishing a new benchmark for semantic recognition from EEG signals.

4.6. Semantic Overlap Analysis and Mitigation Strategies

4.6.1. Quantifying Semantic Overlap

To address the concern about semantic overlap misclassification, we conducted a comprehensive analysis of semantic boundaries and their impact on recognition performance. We define a semantic overlap score

S_{o v e r l a p}

between two semantic categories i and j as follows:

S_{o v e r l a p} (i, j) = \frac{C_{i j} + C_{j i}}{N_{i} + N_{j}}

(7)

where

C_{i j}

represents misclassifications from category i to j, and

N_{i}

denotes the total samples in category i. Table 10 presents the overlap scores for category pairs with significant confusion.

Table 10. Semantic overlap scores for confused category pairs.

The high overlap between “water” and “blue sky” (29.5%) confirms the observations of Figure 5 and Figure 6 and indicates that these categories share substantial perceptual features at the neural level.

4.6.2. Hierarchical Classification Strategy

To mitigate semantic overlap issues, we implemented a hierarchical classification framework that models semantic relationships explicitly. Our approach organizes semantics into a two-level hierarchy:

Level 1—Super-categories:

Natural Elements (water, blue sky, plant);
Living Entities (animal, plant);
Colors (red color, green color, blue elements);
Man-made Objects (vehicle, device, musical instrument).

Level 2—Fine-grained semantics: Within each super-category, we distinguish specific semantics using specialized classifiers trained on discriminative features.

The hierarchical loss function combines coarse- and fine-grained classification:

L_{h i e r} = α L_{c o a r s e} + (1 - α) L_{f i n e} + β L_{c o n s i s t e n c y}

(8)

where

L_{c o n s i s t e n c y}

ensures predictions are consistent across hierarchy levels:

L_{c o n s i s t e n c y} = - \sum_{i = 1}^{N} log P (p a r e n t (y_{i}^{f i n e}) | x_{i})

(9)

4.6.3. Semantic Boundary Refinement

We introduce a boundary refinement module that explicitly models confusion between overlapping categories. For highly confused pairs, we train a binary discriminator

D_{i j}

that learns to distinguish between categories i and j:

D_{i j} (x) = σ (W_{i j}^{T} \cdot f_{D C T - V i T} (x) + b_{i j})

(10)

During inference, when the model predicts either category i or j with confidence below threshold

τ

, we employ the discriminator:

y_{f i n a l} = \{\begin{matrix} i, & if D_{i j} (x) > 0.5 \\ j, & if D_{i j} (x) \leq 0.5 \end{matrix}

(11)

4.6.4. Results with Hierarchical Classification

Table 11 shows improved performance using our hierarchical approach:

Table 11. Performance comparison with hierarchical classification.

The hierarchical approach with boundary refinement reduces water–sky confusion from 41% to 14.3%, representing a 65.3% reduction in overlap-related errors.

5. Conclusions

While multimedia and computer vision have developed numerous semantics recognition algorithms, little research has been conducted into the recognition of semantics through human brain perceptions. To the best of our knowledge, the work described in this paper is the first attempt. In particular, we propose a spatial–temporal transformer model to recognize the semantics out of brain perceptions via EEG sequences. Our approach consists of two main steps. We first construct the 3D spatial–temporal representation via the 2D mapping matrix and the time index to retain the spatial–temporal correlations inside EEGs, and then propose to add a 2D discrete transform to the transformer embeddings to enhance the input representation and measure the variability of elementary frequency components inside every patch. To assess the proposed DCT-ViT deep model, we introduce a new semantics-based dataset, Semantics-EEG, and conduct extensive experiments. The results demonstrate that our proposed DCT-ViT deep framework effectively captures brain-perceived semantics and achieves high recognition rates from EEG sequences. Our interpretability analyses reveal that the DCT-ViT model learns neurobiologically plausible patterns, with attention mechanisms aligning with known visual and semantic processing pathways. The model’s focus on occipital–parietal regions for visual semantics and temporal–frontal regions for abstract concepts mirrors established neuroscientific understanding. These findings not only validate our approach but also suggest that transformer-based models can serve as tools for discovering neural correlates of semantic processing. Future work will explore (i) developing real-time visualization tools for monitoring attention patterns during inference; (ii) investigating the relationship between individual differences in attention patterns and semantic processing abilities; and (iii) using learned attention patterns to guide neuroscientific hypotheses about semantic representation in the brain. Our analysis of semantic overlap reveals a fundamental challenge in brain-perceived semantic recognition: categories sharing perceptual features (e.g., color, texture) produce similar neural responses, leading to systematic misclassifications. The hierarchical classification framework with boundary refinement successfully reduces overlap-related errors by 65.3%, demonstrating that explicit modeling of semantic relationships improves recognition accuracy. These findings suggest that the brain’s semantic representation is inherently hierarchical, with shared features processed at lower levels and disambiguation occurring through higher-level contextual integration.

While our current study demonstrates the feasibility of brain-perceived semantics recognition, we acknowledge several limitations that should be addressed in future research. The current dataset’s reliance on six subjects with gender imbalance (one female, five males) constrains the generalizability of our findings. Future work will prioritize (i) expanding the subject pool to include at least 30 participants with balanced gender representation and diverse age groups (18–65 years); (ii) extending semantic categories beyond the current 10 to include at least 20–30 semantic concepts covering abstract concepts, emotions, and actions; (iii) conducting cross-cultural validation to ensure the universality of semantic perception patterns; (iv) investigating individual differences in semantic processing to enhance model robustness across diverse populations; and (v) adding generative learning elements to visualize the recognized semantics, and hence explore the practical applications of the concept “brain-media”.

Funding

The author thanks the Academy of Scientific Research and Technology (ASRT, Egypt) for funding [Grant 25761]. This work was supported by the National Natural Science Foundation of China (NSFC) [Grant W2412099].

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable. This study utilized only publicly available data from the PeRCeiVe Laboratory, for which ethics approval was obtained by the original data collectors.

Data Availability Statement

The datasets used and analyzed during the current study are publicly available from Pattern Recognition and Computer Vision Laboratory, which is a publicly available EEG dataset for brain imaging classification hosted by https://tinyurl.com/eeg-visual-classification (accessed on 4 November 2025).

Conflicts of Interest

The author declares no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

EEG	Electroencephalogram
BCI	Brain–computer interface
CNN	Convolutional neural network
LSTM	Long short-term memory
RNN	Recurrent neural network
DCT-ViT	Spatial–temporal transformer model

References

Wu, B.; Li, K.; Ge, F.; Huang, Z.; Yang, M.; Siniscalchi, S.M.; Lee, C. An End-to-End Deep Learning Approach to Simultaneous Speech Dereverberation and Acoustic Modeling for Robust Speech Recognition. IEEE J. Sel. Top. Signal Process. 2017, 11, 1289–1300. [Google Scholar] [CrossRef]
Yu, Z.; Jiang, X.; Zhou, F.; Qin, J.; Ni, D.; Chen, S.; Lei, B.; Wang, T. Melanoma Recognition in Dermoscopy Images via Aggregated Deep Convolutional Features. IEEE Trans. Biomed. Eng. 2019, 66, 1006–1016. [Google Scholar] [CrossRef] [PubMed]
Otter, D.W.; Medina, J.R.; Kalita, J.K. A Survey of the Usages of Deep Learning for Natural Language Processing. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 604–624. [Google Scholar] [CrossRef]
Green, A.M.; Kalaska, J.F. Learning to move machines with the mind. Trends Neurosci. 2011, 34, 61–75. [Google Scholar] [CrossRef]
Muller-Putz, G.R.; Pfurtscheller, G. Control of an electrical prosthesis with an SSVEP-based BCI. IEEE Trans. Biomed. Eng. 2008, 55, 361–364. [Google Scholar] [CrossRef]
Schwartz, A.B.; Cui, X.T.; Weber, D.J.; Moran, D.W. Brain-Controlled Interfaces: Movement Restoration with Neural Prosthetics. Neuron 2006, 52, 205–220. [Google Scholar] [CrossRef]
Song, Y.; Zheng, Q.; Liu, B.; Gao, X. EEG Conformer: Convolutional Transformer for EEG Decoding and Visualization. IEEE Trans. Neural Syst. Rehabil. Eng. 2023, 31, 710–719. [Google Scholar] [CrossRef]
Anil, K.; Ganis, G.; Freeman, J.A.; Marsden, J.; Hall, S.D. Exploring the Feasibility of Bidirectional Control of Beta Oscillatory Power in Healthy Controls as a Potential Intervention for Parkinson’s Disease Movement Impairment. Sensors 2024, 24, 5107. [Google Scholar] [CrossRef]
Acharya, U.R.; Vinitha Sree, S.; Swapna, G.; Martis, R.J.; Suri, J.S. Automated EEG analysis of epilepsy: A review. Knowl. Based Syst. 2013, 45, 147–165. [Google Scholar] [CrossRef]
Labate, D.; Foresta, F.L.; Morabito, G.; Palamara, I.; Morabito, F.C. Entropic Measures of EEG Complexity in Alzheimer’s Disease Through a Multivariate Multiscale Approach. IEEE Sens. J. 2013, 13, 3284–3292. [Google Scholar] [CrossRef]
Tahaei, M.S.; Jalili, M.; Knyazeva, M.G. Synchronizability of EEG-Based Functional Networks in Early Alzheimer’s Disease. IEEE Trans. Neural Syst. Rehabil. Eng. 2012, 20, 636–641. [Google Scholar] [CrossRef]
Kulasingham, J.; Vibujithan, V.; De Silva, A. Deep belief networks and stacked autoencoders for the P300 Guilty Knowledge Test. In Proceedings of the Biomedical Engineering and Sciences (IECBES), Kuala Lumpur, Malaysia, 4–8 December 2016; pp. 127–132. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.u.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
Maazouz, M.; Kebir, S.T.; Bengherbia, B.; Toubal, A.; Batel, N.; Bahri, N. A DCT-based algorithm for multi-channel near-lossless EEG compression. In Proceedings of the 2015 4th International Conference on Electrical Engineering (ICEE), Boumerdes, Algeria, 13–15 December 2015; pp. 1–5. [Google Scholar] [CrossRef]
Birvinskas, D.; Jusas, V.; Martisius, I.; Damasevicius, R. Fast DCT algorithms for EEG data compression in embedded systems. Comput. Sci. Inf. Syst. 2015, 12, 49–62. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Dehghani, M.; Gouws, S.; Vinyals, O.; Uszkoreit, J.; Kaiser, L. Universal Transformers. In Proceedings of the 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural. Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Fares, A.; Zhong, S.; Jiang, J. Region level Bi-directional Deep Learning Framework for EEG-based Image Classification. In Proceedings of the 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Madrid, Spain, 3–6 December 2018; pp. 368–373. [Google Scholar] [CrossRef]
Fares, A.; Zhong, S.h.; Jiang, J. EEG-based image classification via a region-level stacked bi-directional deep learning framework. BMC Med. Inform. Decis. Mak. 2019, 19, 268. [Google Scholar] [CrossRef] [PubMed]
Righart, R.; De, G.B. Rapid influence of emotional scenes on encoding of facial expressions: An ERP study. Soc. Cogn. Affect. Neurosci. 2008, 3, 270. [Google Scholar] [CrossRef]
Wang, J.; Pohlmeyer, E.; Hanna, B.; Jiang, Y.G.; Sajda, P.; Chang, S.F. Brain state decoding for rapid image retrieval. In Proceedings of the Proceedings of the 17th ACM international conference on Multimedia, Vancouver, BC, Canada, 19–24 October 2009; pp. 945–954. [Google Scholar]
Moon, J.; Kwon, Y.; Kang, K.; Bae, C.; Yoon, W.C. Recognition of Meaningful Human Actions for Video Annotation Using EEG Based User Responses. In Proceedings of the 21st International Conference, MMM 2015, Sydney, Australia, 5–7 January 2015; pp. 447–457. [Google Scholar]
Luo, J.; Cui, W.; Xu, S.; Wang, L.; Li, X.; Liao, X.; Li, Y. A Dual-Branch Spatio-Temporal-Spectral Transformer Feature Fusion Network for EEG-Based Visual Recognition. IEEE Trans. Ind. Inform. 2024, 20, 1721–1731. [Google Scholar] [CrossRef]
Spampinato, C.; Palazzo, S.; Kavasidis, I.; Giordano, D.; Souly, N.; Shah, M. Deep Learning Human Mind for Automated Visual Classification. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4503–4511. [Google Scholar] [CrossRef]
Zheng, X.; Chen, W.; You, Y.; Jiang, Y.; Li, M.; Zhang, T. Ensemble Deep Learning for Automated Visual Classification Using EEG Signals. Pattern Recogn. 2020, 102, 107147. [Google Scholar] [CrossRef]
Jiang, J.; Fares, A.; Zhong, S. A Context-Supported Deep Learning Framework for Multimodal Brain Imaging Classification. IEEE Trans. Hum.-Mach. Syst. 2019, 49, 611–622. [Google Scholar] [CrossRef]
Kavasidis, I.; Palazzo, S.; Spampinato, C.; Giordano, D.; Shah, M. Brain2Imag: Converting Brain Signals into Images. In Proceedings of the 25th ACM International Conference on Multimedia, New York, NY, USA, 23–27 October 2017; pp. 1809–1817. [Google Scholar] [CrossRef]
Tirupattur, P.; Rawat, Y.S.; Spampinato, C.; Shah, M. ThoughtViz: Visualizing Human Thoughts Using Generative Adversarial Network. In Proceedings of the 26th ACM International Conference on Multimedia, New York, NY, USA, 22–26 October 2018; pp. 950–958. [Google Scholar] [CrossRef]
Jiao, Z.; You, H.; Yang, F.; Li, X.; Zhang, H.; Shen, D. Decoding EEG by Visual-guided Deep Neural Networks. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, Macao, China, 10–16 August 2019; pp. 1387–1393. [Google Scholar] [CrossRef]
Fares, A.; Zhong, S.h.; Jiang, J. Brain-Media: A Dual Conditioned and Lateralization Supported GAN (DCLS-GAN) towards Visualization of Image-Evoked Brain Activities. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 1764–1772. [Google Scholar] [CrossRef]
Zhong, S.h.; Fares, A.; Jiang, J. An Attentional-LSTM for Improved Classification of Brain Activities Evoked by Images. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 1295–1303. [Google Scholar] [CrossRef]
Jiang, J.; Fares, A.; Zhong, S.H. A Brain-Media Deep Framework Towards Seeing Imaginations Inside Brains. IEEE Trans. Multimed. 2021, 23, 1454–1465. [Google Scholar] [CrossRef]
Zhong, P.; Wang, D.; Miao, C. EEG-Based Emotion Recognition Using Regularized Graph Neural Networks. IEEE Trans. Affect. Comput. 2022, 13, 1290–1301. [Google Scholar] [CrossRef]
Tsai, S.; Yang, S. A fast DCT algorithm for watermarking in digital signal processor. Math. Probl. Eng. 2017, 2017, 7401845. [Google Scholar] [CrossRef]
Zheng, X.; Yu, X.; Yin, Y.; Li, T.; Yan, X. Three-dimensional feature maps and convolutional neural network-based emotion recognition. Int. J. Intell. Syst. 2021, 36, 6312–6336. [Google Scholar] [CrossRef]
Uyulan, C.; Ergüzel, T.T.; Unubol, H.; Cebi, M.; Sayar, G.H.; Nezhad Asad, M.; Tarhan, N. Major depressive disorder classification based on different convolutional neural network models: Deep learning approach. Clin. EEG Neurosci. 2021, 52, 38–51. [Google Scholar] [CrossRef]
Chen, J.; Jiang, D.; Zhang, Y.; Zhang, P. Emotion recognition from spatiotemporal EEG representations with hybrid convolutional recurrent neural networks via wearable multi-channel headset. Comput. Commun. 2020, 154, 58–65. [Google Scholar] [CrossRef]
Yang, Y.; Wu, Q.; Fu, Y.; Chen, X. Continuous convolutional neural network with 3D input for EEG-based emotion recognition. In Proceedings of the 25th International Conference, ICONIP 2018, Siem Reap, Cambodia, 13–16 December 2018; pp. 433–443. [Google Scholar]
Zhang, D.; Yao, L.; Zhang, X.; Wang, S.; Chen, W.; Boots, R.; Benatallah, B. Cascade and parallel convolutional recurrent neural networks on EEG-based intention recognition for brain computer interface. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
Liu, T.; Yang, D. A three-branch 3D convolutional neural network for EEG-based different hand movement stages classification. Sci. Rep. 2021, 11, 10758. [Google Scholar] [CrossRef] [PubMed]
Ju, X.; Zhang, D.; Li, J.; Zhou, G. Transformer-Based Label Set Generation for Multi-Modal Multi-Label Emotion Detection. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 512–520. [Google Scholar] [CrossRef]
Liu, A.; Yuan, S.; Zhang, C.; Luo, C.; Liao, Y.; Bai, K.; Xu, Z. Multi-Level Multimodal Transformer Network for Multimodal Recipe Comprehension. In Proceedings of the Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Xi’an, China, 25–30 July 2020; pp. 1781–1784. [Google Scholar] [CrossRef]
Wang, Q.; Li, B.; Xiao, T.; Zhu, J.; Li, C.; Wong, D.F.; Chao, L.S. Learning Deep Transformer Models for Machine Translation. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, 28 July–28 August 2019; pp. 1810–1822. [Google Scholar] [CrossRef]
Baevski, A.; Auli, M. Adaptive Input Representations for Neural Language Modeling. In Proceedings of the 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Li, F.F. ImageNet: A large-scale hierarchical image database. In Proceedings of the Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Sheehy, N. Electroencephalography: Basic Principles, Clinical Applications and Related Fields; Urban and Schwarzenberg: Berlin, Germany, 1982; p. 654. [Google Scholar]
Yang, Y.; Wu, Q.M.J.; Zheng, W.L.; Lu, B.L. EEG-Based Emotion Recognition Using Hierarchical Network With Subnetwork Nodes. IEEE Trans. Cogn. Dev. Syst. 2018, 10, 408–419. [Google Scholar] [CrossRef]

Figure 1. Overview of our proposed model where an EEG signal from the brain is sent to the DCT-ViT network and the encoded signal is used to recognize a semantic corresponding to the captured EEG signal (part (a)) and samples of the “animal” semantic (part (b)).

Figure 2. Proposed model overview.

Figure 3. A 2D mapping matrix construction based on the electrodes’ physical locations, the dotted lines demonstrate adjacent electrodes that are kept as neighbors.

Figure 4. Illustration of sample stimuli images from Semantics-EEG dataset.

Figure 5. Illustration of confusion matrix for our proposed framework.

Figure 6. Sample images from “water” and “blue sky” semantics.

Figure 7. Attention heatmaps across transformer layers for different semantic categories. Each row shows the evolution of attention patterns from early (Layer 1), middle (Layer 4), to deep (Layer 8) layers. (a–c) Living entity semantics show temporal region focus. (d–f) Visual semantics activate occipital–parietal regions. (g–i) Color semantics exhibit distributed frontal-parietal patterns. Warmer colors indicate higher attention weights.

Figure 8. Temporal saliency analysis showing critical time windows for semantic recognition. Gradient-based saliency scores reveal that semantic processing primarily occurs within 150–350 ms post-stimulus, aligning with known ERP components. The P100 component (80–120 ms) reflects early visual processing, N170 (150–200 ms) indicates object recognition with enhancement for living entities, and P300 (250–400 ms) represents the main semantic categorization window. Different semantic categories show distinct temporal profiles: visual semantics peak earlier (100–200 ms), while abstract concepts show extended processing (200–400 ms). The heatmap view reveals semantic-specific temporal signatures that correspond to established neurocognitive processing stages.

Figure 9. Topographic maps of electrode importance distributions for semantic categories computed using integrated gradients. Each map displays the spatial distribution of importance weights across the scalp, with warmer colors (red) indicating higher importance and cooler colors (blue) indicating lower importance. The maps are organized into two groups: (a) visual and natural semantics showing visual semantics (blue sky, water) with pronounced occipital–parietal activation corresponding to visual processing areas in the first row; and living entities (animal, plant) with enhanced temporal region importance reflecting semantic memory networks in the second row; (b) abstract and categorical semantics showing color categories (red, green) exhibiting distributed frontal–parietal patterns associated with categorical processing in the first row; and motion and tool semantics (vehicle, device) demonstrating central–parietal and left-lateralized activation patterns in the second row. White dots indicate standard 10–20 electrode positions. The neurobiologically plausible patterns validate that our DCT-ViT model learns anatomically meaningful representations aligned with established cortical semantic processing pathways.

Figure 10. Multi-dimensional robustness comparison across methods. DCT-ViT (red) consistently outperforms baselines across noise resilience, temporal stability, spatial robustness, cross-subject transfer, and few-shot learning dimensions.

Table 1. Summary of related work in EEG-based semantic decoding and visual perception analysis.

Study	Method	Performance	Gaps/Limitations	Our Improvements
Spampinato et al. (2017) []	LSTM encoder-decoder	40.0% (40 classes)	Limited temporal modeling, no frequency analysis	DCT-based frequency–temporal fusion
Kavasidis et al. (2017) []	LSTM + GAN for generation	N/A (generation task)	Focus on image generation, not classification	Direct classification with higher accuracy
Tirupattur et al. (2018) []	CNN + visual features	82.9% (40 classes)	Requires paired EEG–image data during inference	End-to-end EEG-only decoding
Fares et al. (2018) []	Region-level Bi-LSTM	55.0% (10 classes)	Limited to regional features	Global attention mechanism
Jiang et al. (2019) []	Context-supported Deep Learning	58.3% (10 classes)	Complex multi-modal context	Simplified single-modal approach
Zhong et al. (2019) []	Attentional-LSTM	55.9% (40 classes)	Limited attention mechanism	Multi-head self-attention
Fares et al. (2019) []	Stacked Bi-directional LSTM	62.0% (10 classes)	No frequency domain analysis	DCT frequency embeddings
Zheng et al. (2020) []	Ensemble Deep Learning	64.5% (10 classes)	High computational cost	Single efficient model
Fares et al. (2020) []	DCLS-GAN	65.8% (10 classes)	GAN training complexity	Simpler transformer training
Jiang et al. (2021) []	Brain-Media Framework	67.2% (10 classes)	Requires multi-modal data	Single-modality performance
Song et al. (2023) []	EEG-Conformer	71.2% (10 classes)	Complex convolutional–transformer	Simplified DCT-transformer
Zhong et al. (2022) []	Graph Neural Network	68.4% (emotion task)	Task-specific architecture	Generalizable framework
Our Method	DCT-ViT	72.8% (10 classes)	–	Frequency–temporal fusion, efficient architecture

Table 2. Semantics-EEG distribution.

Semantics	Members	Categories
Animal	40	7
Blue sky	34	16
Flying object	20	3
Green color	31	30
plant	66	4
Red color	37	22
Vehicle	22	2
Water	16	4
Device	75	5
Musical instrument	45	2

Table 3. The recognition performance comparisons among our proposed DCT-ViT model and the two constructed benchmarks.

Models	Accuracy
Proposed DCT-ViT model	$72.28 %$
RNN-based model []	$43.28 %$
Attentional-LSTM-based model []	$47.15 %$

Table 4. Comparative assessment upon different configurations.

Models	Accuracy
DCT-ViT w/o patch embeddings	$72.00 %$
DCT-ViT w/o 2D DCT embeddings	$69.85 %$
DCT-ViT w/o position embeddings	$71.12 %$

Table 5. Comparative assessment of the proposed DCT embeddings in different encoders.

Encoder Networks	Accuracy
Transformer encoder network	$72.28 %$
ResNet-based encoder network	$51.91 %$
Conv1D-based encoder network	$50.73 %$

Table 6. Comprehensive performance comparison of semantic recognition methods including modern architectures.

Method	Architecture	Top-k Accuracy (%)			Rank		Complexity
Method	Architecture	Top-1	Top-3	Top-5	Mean	Median	Params	FLOPs
Traditional Baselines
RNN []	LSTM-based	42.17	58.93	66.42	8.73	6	12.3 M	0.82 G
Attentional-LSTM []	LSTM + Attention	45.82	62.14	69.75	7.92	5	15.7 M	1.13 G
Modern Architectures (Adapted for Semantic Recognition)
CNN–Transformer Hybrid []	Conv + Transformer	58.42	73.86	80.13	5.41	3	28.4 M	2.76 G
GNN-EEG []	Graph Neural Network	56.73	71.92	78.64	5.89	4	21.2 M	1.94 G
EEG-Conformer (adapted) []	Conformer	54.28	70.15	77.39	6.12	4	19.8 M	2.31 G
Ablation Studies
DCT-ViT w/o DCT	Transformer only	61.35	76.42	82.68	4.87	3	31.5 M	3.42 G
DCT-ViT w/o Temporal	Spatial only	58.76	74.23	80.91	5.23	3	29.1 M	3.18 G
DCT-ViT (Ours)	Full Model	72.28	85.74	91.23	3.14	2	32.7 M	3.65 G

Note: Top-k accuracy indicates the percentage of test samples where the correct semantic appears in the top k predictions. Mean/Median Rank shows the average/median position of the correct semantic in the ranked predictions (lower is better). Modern architectures were adapted from their original domains: CNN–transformer from vision transformers (ViT), GNN-EEG using 3-layer GraphSAGE with electrode adjacency matrix, and EEG-Conformer from motor imagery classification with 8 conformer blocks.

Table 7. Detailed architecture specifications.

Method	Architecture Details
CNN–transformer	4-layer CNN encoder (64–128–256–512 channels) + 6-layer transformer (d = 512, h = 8), adapted from ViT
GNN-EEG	3-layer GraphSAGE with electrode adjacency matrix, 2-hop neighborhood aggregation, 128–256–512 hidden dims
EEG-Conformer	8 conformer blocks with depthwise separable convolution, originally designed for motor imagery BCI
DCT-ViT (Ours)	8-layer transformer with 3D patches (8 × 8 × 10), 2D DCT embeddings, d = 768, h = 12, specialized for semantic recognition

Table 8. Model accuracy under different noise conditions.

SNR	Gaussian	Pink (1/f)	Alpha (8–13 Hz)	EMG	ECG
Clean	72.28%	72.28%	72.28%	72.28%	72.28%
20 dB	65.43%	64.21%	61.38%	58.92%	60.84%
10 dB	54.76%	52.83%	48.65%	44.31%	47.92%

Table 9. Performance under temporal perturbations and electrode dropout.

Perturbation	Original	±20 ms	20% Dropout	30% Dropout
200–300 ms window	72.3%	68.1%	64.38%	58.21%
300–400 ms window	68.7%	64.9%	61.73%	54.92%

Table 10. Semantic overlap scores for confused category pairs.

Category Pair	Overlap Score	Shared Features
Water-Blue Sky	0.295	Color (blue), texture, natural scenes
Green Color-Plant	0.187	Color dominance, vegetation
Flying Object-Blue Sky	0.125	Sky context, spatial location
Animal-Plant	0.068	Living entities, natural environments

Table 11. Performance comparison with hierarchical classification.

Method	Overall Acc.	Water-Sky Acc.	Overlap Reduction
DCT-ViT (baseline)	72.28%	35.50%	-
DCT-ViT + Hierarchy	74.82%	52.35%	47.5%
DCT-ViT + Hier. + Boundary	76.15%	58.70%	65.3%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Understanding What the Brain Sees: Semantic Recognition from EEG Responses to Visual Stimuli Using Transformer

Abstract

1. Introduction

2. Related Work

3. Methodology

4. Experiments

4.1. Dataset Construction and Experimental Settings

4.2. Brain-Perceived Semantics Recognition

4.3. Ablation Studies

4.4. Interpretability and Visualization Analysis

4.4.1. Attention Map Visualization

4.4.2. Temporal Saliency Analysis

4.4.3. DCT Embedding Frequency Analysis

4.4.4. Electrode Importance Mapping

4.4.5. Cross-Subject Consistency

4.5. Robustness and Generalization Analysis

4.5.1. Noise Resilience

4.5.2. Cross-Subject Generalization

4.5.3. Transfer to Novel Categories

4.5.4. Temporal and Spatial Stability

4.5.5. Augmentation Effectiveness

4.5.6. Key Findings

4.6. Semantic Overlap Analysis and Mitigation Strategies

4.6.1. Quantifying Semantic Overlap

4.6.2. Hierarchical Classification Strategy

4.6.3. Semantic Boundary Refinement

4.6.4. Results with Hierarchical Classification

5. Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Article Metrics

Citations

Article Access Statistics