Efficient Feature Extraction for EEG-Based Classification: A Comparative Review of Deep Learning Models

Louisa Hallal; Jason Rhinelander; Ramesh Venkat; Aaron Newman

doi:10.3390/ai7020050

,

and

¹

Division of Engineering, Saint Mary’s University, Halifax, NS B3H 3C3, Canada

²

Sobey School of Business, Saint Mary’s University, Halifax, NS B3H 3C3, Canada

³

Department of Psychology and Neuroscience, Dalhousie University, Halifax, NS B3H 4R2, Canada

^*

Author to whom correspondence should be addressed.

AI2026, 7(2), 50;https://doi.org/10.3390/ai7020050
(registering DOI)

This article belongs to the Topic Theoretical Foundations and Applications of Deep Learning Techniques

Version Notes

Order Reprints

Abstract

Feature extraction (FE) is an important step in electroencephalogram (EEG)-based classification for brain–computer interface (BCI) systems and neurocognitive monitoring. However, the dynamic and low-signal-to-noise nature of EEG data makes achieving robust FE challenging. Recent deep learning (DL) advances have offered alternatives to traditional manual feature engineering by enabling end-to-end learning from raw signals. In this paper, we present a comparative review of 88 DL models published over the last decade, focusing on EEG FE. We examine convolutional neural networks (CNNs), Transformer-based mechanisms, recurrent architectures including recurrent neural networks (RNNs) and long short-term memory (LSTM), and hybrid models. Our analysis focuses on architectural adaptations, computational efficiency, and classification performance across EEG tasks. Our findings reveal that efficient EEG FE depends more on architectural design than model depth. Compact CNNs offer the best efficiency–performance trade-offs in data-limited settings, while Transformers and hybrid models improve long-range temporal representation at a higher computational cost. Thus, the field is shifting toward lightweight hybrid designs that balance local FE with global temporal modeling. This review aims to guide BCI developers and future neurotechnology research toward efficient, scalable, and interpretable EEG-based classification frameworks.

Keywords:

electroencephalography; classification; feature extraction; deep learning; CNNs; Transformers

1. Introduction

Electroencephalography (EEG) is a non-invasive and affordable technique for studying brain activity across clinical, cognitive, engineering, and security domains [1]. The classification of different brain states or conditions, such as motor imagery (MI), emotional responses, seizure detection, or cognitive workload, is a major focus in EEG research [2]. EEG is also utilized for biometric authentication and identification, treating neural signals as unique signatures to verify or differentiate individuals [3,4]. Success in these tasks depends on FE to transform raw, noisy, and variable EEG signals into informative and robust representations for machine learning (ML) algorithms [5]. Traditional EEG FE combined common spatial patterns (CSP), wavelet transforms (WTs), and spectral power analysis (SPA) techniques with shallow ML methods like support vector machines (SVMs) for classification. While effective in controlled settings, they are sensitive to noise, intrasubject variability, and the high-dimensional nature of EEG signals [6,7,8]. These approaches require domain expertise and extensive preprocessing. In contrast, modern DL models offer the advantage of learning features directly from raw or minimally processed EEG data. DL models have been applied in classification tasks and user authentication [9,10,11]. Although promising, they face large computational requirements, limited training data, and difficulties in generalizing across subjects and sessions. Many studies have recently focused on designing DL architectures that balance representational power with efficiency. Researchers have adapted CNNs, recurrent networks, Transformer-based models, and hybrid designs to capture temporal, spatial, and frequency features and reduce the computational complexity for EEG signals [12,13]. Efficient FE is key to improve the classification accuracy (Acc.) and enable practical implementation in BCIs and mobile neurotechnology systems [14]. In this review, we survey and compare DL models developed for efficient FE in EEG-based classification between 2015 and 2025. We examine their design principles, computational trade-offs, and reported performance across various EEG applications. The contributions of this paper are as follows:

A comprehensive review of 114 DL-based EEG classification papers, including systematic reviews, CNN-based models, Transformer-based models, CNN–Transformer hybrids, and other recurrent-based hybrids.
An evaluation and discussion of 88 DL-based EEG models, covering the most common network architectures, along with an analysis of efficiency and performance challenges.
An in-depth trade-off analysis using different approaches of evaluation to cover a wider spectrum of possible trade-offs.
The identification of current challenges in DL-based EEG classification and potential directions to inform future research.

The remainder of this review is organized as follows. We first describe the methodology and paper collection process. Next, we provide a brief background for EEG-based classification and existing surveys. We then survey DL models for efficient FE, grouped by architectural family—namely convolutional, Transformer-based, recurrent-based, and hybrid networks. Comparative trade-off insights across these models are highlighted, followed by a discussion of emerging trends, future directions, and recommendations.

2. Methods

To ensure transparency in conducting our comparative review, we followed a procedure inspired by the PRISMA reporting principles (Figure 1), without claiming a formal systematic review. For a comprehensive review, we collected a total of 114 papers about DL-based EEG classification. The initial screening was based on titles and abstracts. Then, we removed duplicate and irrelevant studies. The remaining papers were reviewed in full to assess their methodologies, FE techniques, DL architectures, and performance metrics. We categorized key information from each study to structurally compare FE strategies.

2.1. Identification

In the selection process used to identify relevant studies, we implemented a title and abstract-based strategy to search major academic research databases, namely IEEE Xplore, Scopus, ScienceDirect, PubMed, and Google Scholar. These databases were chosen to cover engineering, biomedical, ML, and interdisciplinary venues where EEG-based DL studies are published. The IEEE Xplore database consists of articles in engineering and interdisciplinary fields. The Scopus and ScienceDirect databases consist of articles in engineering, computing, and scientific-related research. PubMed comprises bioengineering and neuroscience articles. Google Scholar indexes papers across multiple disciplines and publishers, including engineering, computer science, neuroscience, and biomedicine. For EEG-based DL classification, which spans these domains, Google Scholar helps to capture studies that may not be indexed in a single database. We only considered papers published between 2015 and 2025 to ensure the inclusion of recent and rapid advancements in DL architectures. Using relevant keywords including “EEG”, “feature extraction”, “deep learning”, “classification”, and their combinations, an initial pool of 893 papers was retrieved. The keywords were selected based on common usage in the literature. The acronym “EEG” was used because it is a widely adopted form in paper titles and abstracts across relevant fields, which made it effective for identification during the search phase. The full term “deep learning” was used instead of its acronym to avoid ambiguity, as certain acronyms have different meanings across disciplines. The full term is also commonly used in article titles, which improved the search accuracy.

2.2. Screening

The collected papers were subjected to a multistage screening process. First, titles and abstracts were examined to identify studies relevant to DL-based EEG classification. We removed duplicate records, non-English publications, and studies unrelated to our scope. A total of 256 papers were retained for abstract and full-text examination. Studies were considered eligible if they met the following inclusion criteria: (1) surveys on EEG-based classification, (2) employed EEG-based deep learning classification models, and (3) included DL-based design strategies to improve feature extraction. Surveys and studies lacking methodological detail were excluded at this stage.

Figure 1. Workflow model of our methodology.

2.3. Inclusion

The full-text assessment resulted in 114 papers that we kept for an in-depth analysis. Instead of applying inclusion/exclusion rules typical of systematic reviews, we organized the studies based on their methodological focus. The papers were categorized as follows: (1) 40 papers on CNN-based EEG classification, (2) 15 papers on Transformer-based EEG classification, (3) 22 papers on CNN–Transformer hybrid architectures, (4) 11 papers using miscellaneous deep learning approaches, and (5) 26 surveys on EEG-based classification.

2.4. Analysis

For each study, input representations, feature extraction techniques, network architectures, and reported performance and efficiency metrics were extracted and compared to identify trends, strengths, and challenges in the field. This structured grouping allowed us to perform a fair comparison of DL strategies without imposing the constraints of a formal systematic review protocol.

3. Related Work

3.1. Background

CNNs are DL models that can automatically extract patterns from grid-structured data. The core convolution in CNNs relies on shared weights across an image [15]. Many concepts have been developed to enhance FE, and many important CNN models have emerged: VGGNet, emphasizing a simple but deep design for hierarchical feature learning [16]; Inception networks, with parallel modules capturing multiscale features [17]; ResNet, enabling deeper networks through residual connections [18]; DenseNet, improving the learning efficiency by connecting all layers [19]; MobileNet, creating lightweight networks for mobile devices [20]; EfficientNet, balancing Acc. and efficiency through smart scaling [21]; RegNet, focusing on scalable network design [22]; and ConvNeXt, modernizing CNNs with Transformer updates while keeping them suitable for resource-limited applications [23].

CNNs have been used for EEG classification tasks due to their built-in spatial biases, which allow them to learn from proximity and locality relationships (Figure 2). They learn features from raw data, forming complex representations layer by layer [15]. This makes them suitable when EEG signals are converted into image-like formats such as spectrograms or scalograms (Figure 3). Common 2D array formats are [channels × time], [channels × frequency], or [frequency/scales × time]. Initial layers learn basic waveforms or frequency bands and deeper layers extract intricate spatiotemporal relationships across channels and time steps [24].

Figure 2. EEG analysis using a fundamental CNN structure. Adapted from Takahashi et al., 2024. Comparison of Vision Transformers and Convolutional Neural Networks in Medical Image Analysis: A Systematic Review, J Med Syst, 48 (1), 84 [9], licensed under Creative Commons Attribution 4.0 (CC BY 4.0). The original figure was modified by replacing the MRI input image with an EEG input representation; all other elements remain unchanged.

Figure 3. Different image-like input representations for EEG data.

CNNs efficiently recognize spatial patterns across EEG electrodes, where nearby signals reflect related brain activity [21]. They also offer translation invariance, detecting features that appear at slightly different times or forms [25]. This allows them to generalize across trials and subjects despite signal variability [25,26]. Their low computational overhead also makes them practical for real-time systems with limited processing power, like wearable EEG headsets or embedded systems [26]. CNNs still have limitations for EEG despite these strengths. The use of fixed convolutional filters limits CNNs’ flexibility in modeling event-related potential (ERP) signals, which vary in length or timing [27]—for instance, late ERP components like P600, which occurs hundreds of milliseconds after a stimulus [27]. Thus, CNNs’ limited temporal receptive fields mean that they may struggle to model the global temporal context that describes how brain activity evolves over long periods [28].

The Transformer architecture (Figure 4) was initially proposed for natural language processing (NLP). It replaces recurrence in traditional RNNs and LSTM with self-attention (SA) to improve the scalability. It is a seq2seq model with an encoder and a decoder composed of identical layers. Encoder layers consist of multihead SA and feedforward networks with residual connections, followed by normalization. Decoder layers add a cross-attention module between the SA and feedforward blocks and modify SA to prevent access to future positions [13,29,30]. In 2018, GPT showed that Transformers could extract features for coherent text generation [31]. In 2019, BERT introduced bidirectional attention to improve language understanding [32]. In 2020, T5 reframed NLP tasks under a text-to-text setup to unify FE [33]. In 2021, Vision Transformer (ViT) surpassed CNNs by extending FE to images [34]. In 2022, PaLM scaled FE for stronger reasoning and generalization [35]. In 2023, LLaMA optimized FE for smaller models [36]. In 2024, DeepMind’s Gemini integrated vision and language features in a single model [37]. Today, Transformers continue to advance FE across language, vision, audio, and multiagent reasoning.

Figure 4. Transformer architecture: from left to right, encoder–decoder, attention mechanism, multihead attention [13,29,30]. Reproduced from Vafaei and Hosseini, 2025. Transformers in EEG Analysis: A Review of Architectures and Applications in Motor Imagery, Seizure, and Emotion Classification, Sensors, 25 (5), 1293 [13], licensed under Creative Commons Attribution 4.0 (CC BY 4.0).

Transformers have been adopted in EEG applications due to their ability to model long-term dependencies in data via SA mechanisms [34,38,39,40,41,42,43,44,45,46]. The SA mechanism computes pairwise attention between all input positions. Unlike CNNs, which focus on local patterns, Transformers detect relationships across long time periods or between distant brain regions [13]. This is critical in EEG tasks like identifying late ERPs that appear seconds from earlier signals but are semantically correlated. Transformers offer several benefits for EEG analysis. First, their global temporal modeling allows them to track complex EEG/ERP components spanning long periods [39,42,45]. This ability makes them suitable for cognitive or emotional states where brain activity is less time-locked. Their attention weights indicate relevant signal segments for better interpretability [45,46]. In addition, their dynamic receptive fields allow them to adapt to variable timing across users or trials without relying on fixed filters, like CNNs [45,46]. Despite their potential, Transformers face challenges in EEG. They lack built-in spatial or temporal biases, so they must learn these relationships from data, which is difficult with small or noisy datasets [39,42,45]. Their quadratic time and memory complexity also makes them computationally expensive for long recordings [42,44]. Lastly, Transformers need careful regularization since they are prone to overfitting on limited or highly variable EEG data [42,45].

3.2. Recent Reviews

Many surveys have focused on broad and comparative reviews by systematically assessing ML and DL methods for EEG classification across diverse domains. Roy et al. [8] reviewed 154 DL studies published during 2010–2018 about epilepsy, sleep, BCIs, and cognitive research. They reported CNNs as the most common architectures, with a 5.4% Acc. improvement over traditional methods. This was followed by about half of the studies focusing on RNNs, which used raw or preprocessed EEG time series. They pointed out reproducibility issues caused by limited access to open datasets and code, and they recommended standards for experimental reporting. Craik et al. [2] also surveyed 90 DL studies on emotion recognition (ER), MI, seizure detection, mental workload, sleep stage scoring, and ERP detection. They found that CNNs, RNNs, and deep belief networks (DBNs) repeatedly outperformed conventional shallow classifiers. A workflow diagram to guide input selection and hyperparameter tuning was provided. Saeidi et al. [47] expanded on this and mapped preprocessing, FE, and classification pipelines across supervised ML and DL methods. Their work emphasized the rapid adoption of DL methods, while classic ML methods remain relevant. They created a catalog for common feature choices and listed the persistent standardization and reproducibility gaps. Li et al. [48] framed EEG classification in broader psychological and physiological contexts, clarifying conceptual pathways for research. Prabowo et al. [49] and Vempati and Sharma [50] summarized recent DL models, covering theoretical foundations and technical methods. They pointed out improvements in performance across EEG tasks. More recent studies, including those of Mohammed et al. [51] and Gatfan [52], provide comparative analyses across EEG applications. Mohammed et al. [51] presented an overview of CNNs, RNNs, Transformers, and hybrid approaches. The dominance of hybrid and DL alongside classical ML pipelines as a result of the transformative efficacy of both domains was noted. Gatfan [52] compared ML and DL methods for disease diagnosis and brain behavior analysis. They showed that DL (especially CNNs) outperformed traditional ML, which often requires combining multiple algorithms to achieve similar Acc.. These surveys point out the continuing need for methodological rigor and reproducibility.

Other authors have focused on application-specific reviews. EEG-based ER has been a prominent application area. Suhaimi et al. [53] reviewed studies from 2016 to 2019. They examined how emotions were evoked, the sample sizes used, and the EEG hardware and ML techniques. Their review pointed out virtual reality (VR) as an emerging method to deliver emotional stimuli. Rahman et al. [54] combined theoretical foundations with practical EEG-based ER techniques, detailing common emotion theories, FE methods, and classification strategies. Khare et al. [55] analyzed 142 multimodal studies that combined EEG with ECG, GSR, eye tracking, speech, and facial cues. Emotion models, datasets, and key challenges like variable signal lengths, a lack of trust in model outputs, and limited real-time support were reviewed. Jafari et al. [56] focused on real-time DL applications and their feasibility. They discussed hardware-focused solutions like system-on-chip (SoC), field-programmable gate array (FPGA), and application-specific integrated circuit (ASIC) implementations and provided a comprehensive examination of technical advances. Ma et al. [57] distinguished subject-dependent and subject-independent DL models, clarifying pathways for personalized and large-scale ER. Using a PRISMA review of 64 DL studies, Gkintoni et al. [58] reported that multimodal approaches can achieve Acc. above 90% and emphasized dataset variety. They recommended adaptive models and standardized protocols. MI EEG and BCIs have also been the focus of many researchers. Al-Saegh et al. [59] reviewed DL approaches and their input representations. They summarized preprocessing and architectural strategies that yield strong MI BCI performance. Ko et al. [60] surveyed techniques to reduce the BCI calibration time, pointing out generative data augmentation and explicit transfer learning as promising avenues. Pawan et al. [61] compiled over 220 ML-based BCI studies, offering practical guidance from acquisition to classification. Saibene et al. [62] emphasized benchmarking and wearable technology integration. Recent contributions have also addressed clinical translation. Moreno-Castelblanco et al. [63] reviewed lower-limb MI in neurorehabilitation, underlining multimodal fusion, minimal channels, and portable BCI designs to improve accessibility and usability. Wang et al. [64] compared 13 DL models, offering practical ablation-based architectural design recommendations. They showed that multistream CNNs with LSTM and spatial features enhance performance, while fully connected layers increase the complexity (Complex.) without improving Acc. Cognitive workload and neuropsychology are other areas of focus. Hassan et al. [65] conducted a PRISMA review of cognitive workload estimation. SVMs and DL models were reported as the most dominant techniques. They suggested that, to translate lab findings into deployable systems, multimodal dataset integration, standard testing protocols, and real-world validation are a necessity. Bardeci et al. [66] assessed CNNs and LSTM in psychiatric EEG, stressing the weak reporting of clinical features, flawed validation, and non-independent test data. They provided recommendations to improve the methodological rigor. Others have surveyed specific contexts. Nwagu et al. [67] reviewed EEG in immersive environments by analyzing BCIs in virtual and augmented reality (AR). They focused on techniques like steady-state visual evoked potential (SSVEP) with AR for control and MI with VR for rehabilitation. Discomfort and low information transfer rates were noted as challenges. Dadebayev et al. [68] compared consumer-grade EEG devices to research-grade equipment. They revealed limitations in data quality and classification performance when using commercial headsets as practical challenges in applied research.

Other reviews have examined DL architectures for EEG. Klepl et al. [69] surveyed graph neural networks (GNNs) for ER, MI, and neurological diagnosis tasks. They cataloged design patterns, including graph construction and node and edge features. Spectral graph convolutions were found to be more prevalent than spatial ones. The review suggested transfer learning and cross-frequency modeling as future directions. Vafaei and Hosseini [13] reviewed Transformer-based EEG models. They categorized them into time-series (TS), vision, graph attention, and hybrid Transformers and discussed augmentation and transfer learning as solutions for data scarcity.

Several studies have emphasized methodological shortcomings in EEG research. Bardeci et al. [66] highlighted weak clinical rigor in psychiatric DL EEG studies, calling for stronger reporting and validation. Roy et al. [8] also flagged reproducibility issues due to unavailable data and code. Dadebayev et al. [68] noted reliability concerns with consumer-grade EEG for ER. Wang et al. [64] provided ablation-based evaluations to show which architectural design choices affect performance. They also offered practical recommendations. Table 1 sums up these reviews and their contributions.

Table 1. Recent surveys on EEG-based classification.

3.3. Study Scope

It is important to note that our analysis of prior surveys is limited to their thematic scopes and findings to identify gaps within the literature. We did not perform a formal methodological quality or risk-of-bias assessment of the review papers, as our objective was the comparative evaluation of EEG-based DL models. Despite the growth of DL in EEG-based classification, most reviews focus on either broad architectural comparisons [8,13,51,69] or pipelines for specific applications, mainly ER, MI, and cognitive workload [53,59,65]. While these surveys examine Acc. trends and design choices, they do not explain the role of FE and its effects on both performance and efficiency. Only a few studies [54,59,61] have compared FE strategies across models or tasks, leaving gaps for researchers seeking efficient and generalizable pipelines. In this review, we address this gap by providing a comparative analysis of FE methods for EEG-based classification across CNNs, RNNs, Transformers, and hybrid architectures. Different from earlier reviews, this work examines the most recent studies in the field, with particular attention to their network architectures, FE strategies, computational efficiency, and performance outcomes. The studies were categorized by model type to identify common architectural patterns and variations in FE approaches. Special focus was placed on how design choices, such as feature fusion, input representation, and regularization methods, influence classification performance and system efficiency. This efficiency-driven comparative review shifts the focus from performance comparisons to deployment evaluations, which have not been addressed in prior EEG DL surveys. The aim is to draw out practical insights that can guide both researchers and applied system developers.

4. Feature Extraction for EEG-Based Classification

4.1. Efficiency in CNN-Based Models

Efficient FE aims to boost the classification performance (Acc., robustness) while reducing the overall cost (fewer parameters, fast inference, simple models). Over the past decade, many studies (Table 2) have proposed CNN architectures to improve the efficiency of EEG FE. In this section, we analyze a selection of studies to show how each approach enhances FE and impacts both performance and efficiency. Most works focus on EEG-based biometric mechanisms to address traditional challenges. For clarity, we group the studies by their main methodological focus, although overlap across categories exists.

4.1.1. Applications to Raw EEG

Initial studies of CNNs showed that temporal and spatial features could be directly extracted from minimally preprocessed signals, eliminating the need for manually engineered features. The work by Ma et al. [70] showed that CNNs could automatically extract useful patterns from raw resting-state EEG signals and create reliable “brain fingerprints” to identify individuals. Their end-to-end pipeline was jointly optimized using gradient descent. It uses two convolutional layers to extract invariant temporal patterns, followed by two average pooling layers to reduce the dimensionality and computational cost (Comp. Cost), and ends with a fully connected layer for classification. This approach achieved 88% Acc. in a 10-class identification task and maintained strong results with very low-frequency bands (0–2 Hz) and short temporal segments (<200 ms), reaching 76% Acc. with only 62.5 ms of data. Following this direction, Mao et al. [71] used a CNN to learn discriminative EEG features for person identification. Their pipeline processed raw EEG signals from a large-scale driving fatigue experiment. The CNN architecture utilized three convolutional layers with ReLU and max pooling modules after each, followed by two fully connected layers and a softmax output for classification. This method achieved 97% Acc. from 14,000 testing epochs and was trained in only 0.3 h for over 100 K epochs, outperforming traditional shallow classifiers. Schirrmeister et al. [13] proposed shallow and deep ConvNets for raw EEG decoding. The shallow network, inspired by Filter Bank Common Spatial Patterns (FBCSP), combined temporal convolutions, spatial filtering, squaring nonlinearity, mean pooling, and logarithmic activation to extract band power features. The deep network captured hierarchical spatiotemporal modulations, splitting the first block into temporal and spatial layers for regularization. Cropped training with sliding windows augmented the dataset and reduced overfitting. Although training was slower than for FBCSP, the predictions were efficient. Batch normalization and ELU activation improved its performance. Visualizations revealed meaningful frequency modulation, demonstrating robust decoding and interpretable feature learning. This approach outperformed FBCSP pipelines, achieving higher mean Acc. of 84.0%, compared to 82.1%. Similarly, Schöns et al. [72] presented a deep CNN-based biometric system trained on raw EEG. Recordings were segmented into 12 s overlapping windows to expand the training data. The network, composed of convolution, ReLU activation, pooling, and normalization layers, processed the signals, and classification layers were later discarded. CNN outputs served as compact feature vectors for verification. The system achieved an equal error rate (EER) of 0.19%, with the gamma band providing the most discriminative signal. Sliding-window augmentation and the shallow CNN enabled both scalability and near-perfect identification Acc. Di et al. [73] also proposed a CNN-based EEG biometric identification system. EEG signals were preprocessed and segmented before being input into the network. Temporal convolutions captured frequency patterns, while spatial convolutions modeled correlations across electrodes. Deeper layers extracted subject-specific representations to distinguish individual participants. Zhang et al. [74] designed HDNN and CNN4EEG for EEG-based event classification. HDNN divides EEG epochs into sub-epochs processed by child DNNs with shared weights, boosting the training speed and cutting the memory usage, while improving the Acc. over standard DNNs. CNN4EEG uses custom convolutional filters adapted to EEG’s spatial and temporal structure to effectively capture its spatiotemporal patterns. CNN4EEG outperformed all baselines, achieving 13% higher Acc. than shallow methods, 9% higher than that of a canonical CNN, and 6% higher than that of a DNN. These works demonstrate feasibility, but the systems’ performance depends on large models and extensive training data.

4.1.2. Frequency- and Spectral-Domain Approaches

Further applications have transformed EEG signals into compact representations to highlight spectral discriminants. To adapt CNNs to consumer-grade EEG, González et al. [75] proposed a 1D CNN operating on power spectral density (PSD) estimates computed from 6 s EEG windows with Welch’s method. The PSD inputs offered compact frequency-domain representations, reducing the dimensionality while retaining discriminative information. Convolution and max pooling layers operated like wavelet decomposition by extracting coarse spectral features; combined with downsampling, subsequent layers captured finer temporal–spectral patterns. An inception block further enabled multiscale feature learning. This approach achieved 94% Acc., surpassing SVM baselines and showing that PSD-based CNNs can efficiently extract subject-specific features from low-cost hardware. For SSVEP classification, Waytowich et al. [76] compared traditional FE methods such as canonical correlation analysis (CCA) with the compact EEGNet [12]. The compact CNN extracted frequency, phase, and amplitude features directly from raw EEG. It contrasted CCA, which requires prior knowledge of the stimulus signals and performs well in synchronous paradigms but fails in asynchronous ones. Using temporal convolutions as bandpass filters and depthwise spatial convolutions for spatial filtering, its compact design reduces the parameters and supports training on small datasets. It achieved 80% cross-subject Acc., outperforming traditional methods and demonstrating robust, calibration-free performance for BCI applications. Expanding this line of work, Yu et al. [77] focused on low-frequency SSVEP components (<20 Hz) as discriminative features for user authentication. These signals were marked by high intersubject and low intrasubject variability. They were isolated using a Chebyshev low-pass filter to suppress less informative high-frequency oscillations. For classification, the authors extended Schirrmeister’s shallow ConvNet [13] to a multiclass variant (M-Shallow) by introducing parallel temporal filters and additional layers to improve the scalability without increasing Complexity. Despite its lightweight design of 30 k parameters, it preserved fast training and inference and delivered high Acc. across multiple tasks. This integration yielded 97% cross-day authentication Acc. on eight subjects and shows that shallow architectures can be adapted to complex EEG biometrics with high efficiency. Compared with raw-signal CNNs, frequency-domain CNNs balance interpretability and efficiency by focusing on known neural rhythms. They draw on neuroscience principles to simplify models and enhance their robustness. These models show that embedding spectral knowledge into the network is more effective than increasing its depth.

4.1.3. EEG Representation and Topology Strategies

Other EEG strategies focus on how signals are structured before entering CNNs. They range from spatial electrode mappings to graph-based connectivity encodings to exploit the spatial organization and network properties of brain activity. Lai et al. [78] examined how EEG input representation affects CNN-based biometrics. The matrix of amplitude vs. time presents raw EEG amplitude values in their default channel order. It preserves spatial patterns, minimizes the preparation time, and achieves high identification Acc. Converting data into images introduces slight information loss but reduces storage, proving an acceptable trade-off for handling large datasets. Normalizing energy in an image of energy vs. time stabilizes large power values, simplifies CNN calculations, and improves FE. Rearranging channels by correlation degrades performance by disrupting the 2D spatial patterns that are essential for CNN feature learning. The study shows that CNN performance depends on input data structuring. Wang et al. [79] represented EEG signals as functional connectivity (FC) graphs generated using the phase locking value (PLV) in the beta and gamma bands, producing fused representations. Applying graph CNNs (GCNNs) to these graphs resulted in higher correct recognition rates (CRRs), robust generalization, reduced training time, and efficient transfer learning. In a related approach, 1D EEG data were converted into 2D layouts and processed with 3D CNNs for the simultaneous extraction of spatial and temporal features. This 3D method outperformed 2D CNNs, reduced the input dimensionality, and captured complex spatiotemporal patterns in ERP classification. Graph-based connectivity methods [80] and 3D convolutional representations [81] extended this further, capturing interchannel relationships and depth. Wang et al. [80] evaluated three EEG biometric methods—RHO + CNN, UniFeatures + CNN, and Raw + CNN—in terms of efficiency and performance. Firstly, RHO + CNN integrates FC with a CNN. The FC module calculates beta-band synchronization using the RHO index to generate 2D maps, which the CNN processes for FE and classification. Training converges in about 10 epochs, and the model achieves a CRR of 99.94% with low equal error rates (EERs). Its success comes from stable identity information in FC. Secondly, UniFeatures + CNN extracts univariate features such as autoregressive (AR) coefficients, fuzzy entropy (FuzzyEn), and PSD. While efficient, it yields lower CRRs and higher EERs in cross-state authentication, due to the state dependence of these features. Lastly, Raw + CNN directly processes raw EEG, requiring around 25 epochs for training and producing the lowest CRRs and highest EERs. It is highly sensitive to mental states, noise, and artifacts, resulting in poor generalization. Overall, RHO + CNN is the fastest, most accurate, and robust method, effectively handling mental state variabilities. Zhang et al. [81] applied 3D convolution to high-density EEG by modeling the electrode layout as a 2D grid with time as a third dimension to jointly learn scalp topology-aware spatial filters and temporal dynamics. The model’s computation was optimized through kernel and dimensionality choices. It outperformed 2D CNNs in tasks using spatial patterns like frontal–occipital activity and achieved strong results in ER and seizure detection. These representation strategies emphasize spatial locality and generalize better across cognitive tasks in comparison with frequency-based models.

4.1.4. ERP- and VEP-Based CNN Models

In contrast to continuous EEG, discrete approaches leverage stimulus-locked potentials as discriminative signatures, embedding spatial and temporal filtering directly into CNN architectures. Das et al. [82] investigated visual evoked potentials (VEPs) as stable biometric signatures. Data from 40 subjects across two sessions underwent preprocessing, including common average referencing (CAR), bandpass filtering, downsampling, z-score normalization, and detrending, before generating averaged VEP templates. Template averaging enhanced the signal-to-noise ratio and reduced the input dimensionality for efficient training. These inputs were processed by stacked convolutional and max pooling layers to extract spatiotemporal features and improve the robustness to subject variability. The system achieved 98.8% rank-1 identification Acc. and strong temporal stability across sessions performed a week apart. Cecotti et al. [83] conducted a systematic evaluation of CNN architectures of varying depth for single-trial ERP detection. They showed that well-designed CNNs can efficiently extract discriminative ERP features with minimal preprocessing. Integrating convolutional spatial filtering and shift-invariant temporal convolution achieved an average area under the curve (AUC) of 0.905 across subjects, while weight sharing reduced the parameters and accelerated training. Training multisubject classifiers eliminated subject-specific calibration and enhanced the efficiency and practicality of ERP-based BCI systems. Although the initial training was computationally intensive, the resulting models offered robustness and scalability beyond traditional linear methods. Extending this work, Cecotti et al. [84] evaluated 1D, 2D, and 3D CNNs for single-trial ERP detection. They introduced a volumetric representation by remapping 64 EEG sensors into a 2D scalp layout and adding time as a third dimension. These multidimensional convolutions simultaneously captured robust spatiotemporal features, improved generalization, and reduced the sensitivity to subject variability. The best 3D CNN required fewer inputs and achieved a mean AUC of 0.928, demonstrating superior efficiency and scalability. Chen et al. [85] proposed the Global Spatial and Local Temporal Filter CNN (GSLT-CNN). It integrates global spatial convolutions, which capture interchannel dependencies, with local temporal convolutions that extract fine dynamics. Trained directly on raw EEG signals, it achieved 96% Acc. across 157 subjects and up to 99% in rapid serial visual presentation (RSVP)-based cross-session tasks. The model was trained with 279,000 epochs in under 30 m, while remaining lightweight, robust, and highly adaptable. This demonstrates strong potential for EEG biometrics and broader BCI applications. These studies show how CNNs can be adapted to task-specific temporal alignments, in contrast to generic time–frequency approaches.

4.1.5. Multiscale and Temporal Modeling Strategies

Other strategies have expanded the receptive fields and decomposed EEG into finer sub-bands. This allows CNNs to capture both short- and long-range dynamics efficiently. Salami et al. [86] introduced EEG-ITNet, an Inception-based temporal CNN, for efficient MI classification in BCI systems. FE was performed through inception modules and causal convolutions with dilation. These modules separated multichannel EEG into informative sub-bands using parallel convolutional layers with different kernel sizes. Depthwise convolutions combined electrode information. The temporal convolution block employed dilated causal convolutions in residual layers to increase the receptive field. This hierarchical design efficiently integrated spectral, spatial, and temporal information with fewer parameters and enhanced interpretability comparable to other architectures. EEG-ITNet achieved 76.74% mean Acc. in within-subject and 78.74% in cross-subject scenarios and also generalized well on the OpenBMI dataset. Bai et al. [87] leveraged temporal convolutional networks (TCNs), which apply causal and dilated convolutions to efficiently capture long-range temporal features from sequential EEG data. Dilated convolutions enabled exponentially expanding receptive fields that could model dependencies far into the past, while residual connections stabilized the training and architecture depth. TCNs processed sequences in parallel to avoid the bottlenecks of recurrent networks, reduce memory usage during training, and allow faster inference. Compared to LSTM and generative recurrent units (GRUs), TCNs delivered superior performance across tasks. They improved FE and the predictive Acc., making TCNs a scalable and powerful alternative for EEG sequence modeling. For MI decoding in BCIs, Riyad et al. [88] introduced Incep-EEGNet, a deep ConvNet architecture. The pipeline processed raw EEG signals through bandpass filtering and trial segmentation before end-to-end FE and classification. Incep-EEGNet applied filters similar to EEGNet [12], followed by an Inception block. This block integrated parallel branches with varying convolutional kernels, pointwise convolution, and average pooling to extract richer temporal features. Depthwise convolution reduced the Comp. Cost, while pointwise convolution served as a residual connection. The model achieved 74.07% Acc. and a kappa of 0.654, outperforming traditional methods. Liu et al. [89] also proposed a parallel spatiotemporal self-attention CNN for four-class MI classification. The spatial module assigns higher weights to motor-relevant channels and reduces artifacts. The temporal module emphasizes MI-relevant sampling points and encodes continuous temporal changes. The model achieved 78.51% Acc. for intrasubject and 74.07% for intersubject classification. This lightweight design supports real-time BCI applications and outperformed both traditional and DL models. Multiscale strategies were further refined by Zhu et al. [90] in RAMST-CNN, a residual and multiscale spatiotemporal convolutional network for EEG-based personal identification. The model combined residual connections, multiscale grouping convolutions, global average pooling, and batch normalization. Parallel convolutions extracted both coarse and fine temporal patterns, while residual links facilitated gradient flow. Despite being lightweight, RAMST-CNN achieved 99.96% Acc. Similarly, Lakhan et al. [91] introduced a broadband network. It processes multiple EEG frequency bands in parallel through convolution branches tailored to distinct ranges. By integrating multiband FE into a single model, it achieved higher Acc. than single-band approaches. It maintained lower parameter counts than training separate models for each band. Ding et al. [92] introduced Tsception, a multiscale CNN for ER from raw EEG. It consisted of three layers: a dynamic temporal layer, an asymmetric spatial layer, and a high-level fusion layer. The temporal layer used 1D kernels of varying lengths scaled to the EEG sampling rate to capture both long-term low-frequency and short-term high-frequency dynamics. The spatial layer used global and hemisphere-specific kernels to exploit EEG asymmetry and model interhemispheric relationships. Features were then integrated by the fusion layer for classification. Tsception achieved higher Acc. and F1 scores than other methods while remaining compact with fewer parameters, supporting online BCI use. These approaches trade off efficiency for robustness. They capture dynamic EEG patterns beyond fixed time windows, situating them between raw CNNs and sequential models.

4.1.6. Compact and Lightweight Architectures

Further CNN approaches prioritize efficiency by reducing the parameter counts to improve their distinctive power. A key milestone was given by Lawhern et al. [12], who introduced EEGNet, a compact CNN for general-purpose BCI applications. The network processed raw EEG trials directly, using temporal convolutions for frequency filters, depthwise convolutions for spatial patterns, and a separable convolution to summarize temporal dynamics. This design performed well with limited data and no augmentation. EEGNet’s efficiency and interpretable features made it robust and practical for both ERP- and oscillatory-based BCIs. Subsequent variants [93,94] optimized EEGNet [12]. Salimi et al. [93] applied EEGNet [12] to extract cognitive signatures for biometric identification. They combined the N-back task cognitive protocol with an optimized EEGNet model to elicit and capture robust EEG patterns. This approach reduced the recording time and Comp. Cost. Using single EEG segments of only 1.1 s, the lightweight network achieved up to 99% identification Acc. Ingolfsson et al. [94] also developed an EEGNet [12] and TCN-based [87] model optimized for edge devices. EEG-TCNet used temporal convolutions with depthwise separable filters and dilation to capture EEG dynamics. It had only 4272 parameters and 6.8 M multiply–accumulate operations (MACs) per inference, but it achieved 83.8% Acc. on four-class MI, matching larger networks. Its low memory and compute footprint supports real-time on-device BCI processing, indicating that high Acc. is achievable with very few parameters. The model also generalized well across 12 datasets, surpassing prior results on the MOABB benchmark. Similarly, Kasim et al. [95] showed that even simple 1D CNNs could achieve competitive performance with minimal parameter counts. Their model extracted temporal features from each EEG channel using small convolution kernels, fused them efficiently, and avoided 2D operations, resulting in a lightweight network with only 10 K parameters. It matched the Acc. of more complex models in ER and MI tasks while being faster and easier to train. Lightweight convolutional strategies continued with Wu et al. [96]. They introduced MixConv CNNs by embedding varying convolution kernels within each layer to emulate an FBCSP approach. Different kernel lengths targeted frequency bands from delta to gamma to allow simultaneous temporal filtering. Mixed-FBCNet used 0.45 M parameters and only 10 EEG channels yet achieved high Acc. and low EERs. Building on this compactness, Altuwaijri et al. [97] developed a multibranch CNN for MI. EEG signals were split into three branches, processed by EEGNet [12] modules, and recombined using squeeze and excitation (SE) attention blocks, which reweighted features to improve the discriminability and efficiency. This compact architecture achieved 70% Acc. on challenging four-class MI tasks, outperforming the single-branch EEGNet [12]. Autthasan et al. [98] proposed MIN2Net, a multitask learning model for subject-independent MI EEG classification, to eliminate calibration for new users. EEG signals were bandpass-filtered and encoded into latent vectors by a multitask autoencoder. Deep metric learning (DML) with triplet loss refined these embeddings by clustering same-class samples and separating different-class samples, and a supervised classifier performed the final prediction. This design reduced preprocessing and kept the model small. MIN2Net improved the F1 score by 6.7% on SMR-BCI and 2.2% on OpenBMI, with ablation studies confirming the contribution of the DML module. Latent feature visualizations showed clearer feature clustering than baselines, and its consistent performance across binary and multiclass tasks demonstrates its suitability for calibration-free online BCI applications. Bidgoly et al. [99] used Siamese and triplet CNNs for EEG biometrics. Their model learned compact embeddings of EEG segments for subject matching, rather than multiclass classification. This approach avoided large output layers, ensuring scalability as the number of subjects increased, and achieved a 98.04% CRR with a 1.96% error rate on 105 subjects from the PhysioNet EEG dataset. Alsumari et al. [100] also studied EEG-based identification under controlled recording conditions as a matching task between distinct brain states (open vs. closed eyes). Using a CNN to differentiate individuals, their approach achieved a 99.05% CRR with only a 0.187% error rate on PhysioNet. As in the work of Bidgoly et al. [99], the model was compact, with small output layers. The results indicated robust FE for efficient and reliable identification. These studies emphasize deployment and robustness in resource-limited environments. Networks like EEGNet [12] and its variants [93,94] show that, for EEG tasks, efficiency and inductive biases matter more than large architectures. Unlike computer vision, where depth means richer features, EEG’s low SNR and small datasets make shallow models more effective. This is why EEGNet [12] has become a baseline for research and applied settings.

4.1.7. Benchmarking and Next-Generation CNN Designs

Comparative benchmarks highlight trade-offs across CNN families. Recently, Yap et al. [101] benchmarked CNN architectures such as GoogLeNet, InceptionV3, ResNet50/101, DenseNet201, and EfficientNet-B0 for EEG-based classification. EEG signals were converted to time–frequency images as input to these networks. They found that these CNNs could achieve high Acc. on EEG tasks, although trade-offs exist between model complexity and speed. EfficientNet-B0 matched the Acc. of deeper models while using fewer parameters, making it faster. ResNet101 and DenseNet201 offered only slight Acc. gains at the cost of a higher Comp. Cost and memory. The study concluded that balancing performance with computational efficiency is crucial and highlighted the potential of lightweight transfer learning for EEG. Chen et al. [102] introduced EEGNeX, a novel CNN that leverages the latest DL advances. The model combines expanded receptive fields, attention mechanisms, depthwise convolutions, bottleneck layers, residual connections, and optimized convolutional blocks to enhance spatiotemporal FE in EEG signals. As with EEGNet [12] and prior lightweight models, this model achieved high Acc. across multiple EEG classification tasks. Shakir et al. [103] proposed a CNN-based EEG authentication system with three convolutional layers, max pooling, two ReLU-activated dense layers, dropout, and a softmax output optimized using root mean square propagation (RMSprop). Feature selection used Gram–Schmidt orthogonalization (GSO) to identify three key channels (Oz, T7, Cz). For FE, the CNN was a “fingerprint model”, with the penultimate layer’s vectors compared against stored templates using cosine similarity (CS). Supporting both single-task and multitask FE, the system’s Acc. improved from 71% to 95% after optimizing the final dense layer to 30 neurons. This makes it a robust approach for practical EEG authentication. These works illustrate that simplified vision architectures can be adapted for EEG. This reinforces that EEG benefits from shallow designs rather than from scaled ones.

4.1.8. Fusion and Hybrid Feature Extraction Strategies

Beyond unimodal pipelines, fusion strategies integrate complementary FE pathways, like spatial, temporal, spectral, or multimodal information, for improved robustness and efficiency. Wu et al. [104] fused EEG with blinking features for multitask authentication. Using an RSVP paradigm, the system evoked stable EEG and electrooculography (EOG) signals. Hierarchical EEG features were extracted by selecting discriminative channels and time intervals via pointwise biserial correlation, averaging, and forming spatiotemporal maps for a CNN. Blinking signals were processed into time-domain morphological features through a backpropagation network. The two modalities were fused at the decision level using least squares, achieving 97.6% Acc. Subsampling and weight sharing ensured computational efficiency. Özdenizci et al. [105] incorporated adversarial training into CNN FE by adding an auxiliary domain discriminator to enforce subject-invariant EEG features. This approach reduced the need for calibration and improved cross-subject generalization, with only minor computational overhead. Musallam et al. [106] advanced hybrid feature fusion by introducing TCNet-Fusion, an enhanced version of EEG-TCNet [94] with a fusion layer combining features from multiple depths. Shallow outputs from an initial EEGNet [12] module were concatenated with a TCN [87] deep temporal features. This multilevel fusion improved the MI classification Acc. compared to the base EEG-TCNet [94] while adding a negligible Comp. Cost. The model was small due to the efficiency of EEGNet [12] and the TCN [87]. Mane et al. [107] introduced FBCNet, a CNN designed to handle limited training data, noise, and multivariate EEG for MI classification in BCIs. The network processed raw EEG through multiview spectral filtering and isolated MI-relevant frequency bands. The spatial convolution block captured discriminative spatial patterns for each band. The variance layer extracted temporal features by computing the variance in non-overlapping windows, emphasizing ERD/ERS-related activity. These strategies simplified the processing of complex EEG, enhanced the robustness, and reduced overfitting. The compact FBCNet performed well across diverse MI datasets, including stroke patients. These implementations show CNNs’ flexibility, transitioning from standalone classifiers to modular components in multimodal or sequential pipelines. They function as efficient feature extractors when paired with TCNs or RNNs or integrated into adversarial/fusion strategies.

4.1.9. Cross-Integration Overlaps

Compact CNNs [12,93] also process raw EEG inputs (Section 4.1.1); we classify them as compact designs (Section 4.1.6) to emphasize their efficiency contributions rather than input modalities. Other compact CNNs [94,95,96] also adopt spectral representations (Section 4.1.2). We also categorize these as compact designs (Section 4.1.6), since architectural parsimony is their primary focus. Some methods [78,79] combine frequency decomposition (Section 4.1.2) with topological electrode mapping. We discuss them in Section 4.1.3, since spatial encoding is their focus. Temporal CNNs [88,90,92] are also hybrid architectures that incorporate sequential models. We include them in Section 4.1.5 to emphasize their CNN-specific contributions, while broader hybridization is covered in Section 4.1.8. Benchmark studies [101,102] include compact and deeper architectures. We discuss them under benchmarking Section 4.1.7 to highlight comparative findings, rather than compactness (Section 4.1.6). Table 2 below presents a chronological overview of the studies.

Table 2. CNN-based architectures for EEG classification.

Ref.	Author/Year	Model	Protocol	Samples	Channels	Inspiration Basis	Acc.
[70]	Ma et al. 2015	CNN-based	Resting	10	64	CNN	88.00
[71]	Mao et al. 2017	CNN-based	VAT	100	64	CNN	97.00
[75]	Gonzalez et al. 2017	ES1D	AVEPs	23	16	1D CNN [Inception]	94.01
[82]	Das et al. 2017	CNN-based	MI, VEPs	40	17	CNN	98.80
[83]	Cecotti et al. 2017	CNN1-6	RSVP	16	64	CNN	83.10–90.50
[11]	Schirrmeister et al. 2017	ConvNets	MI	9/54	22/54/3	ResNet, Deep/Shallow ConvNet	81.00–85.20
[87]	Bai et al. 2018	TCN				CNN	97.20–99.00
[12]	Lawhern et al. 2018	EEGNet	ERPs, ERN, SMR, MRCP	15/26/13/9	64/56/22	CNN	0.91 [AUC]
[104]	Wu et al. 2018	CNN-based	RSVP	10/15	16	CNN	97.60
[72]	Schons et al. 2018	CNN-based	Resting	109	64	CNN	99.00
[73]	Di et al. 2018	CNN-based	ERPs	33	64	CNN	99.30–99.90
[74]	Zhang et al. 2018	HDNN/CNN4EEG	RSVP	15	64	CNN	89.00
[76]	Waytowich et al. 2018	Compact EEGNet	SSVEP	10	8	EEGNet	80.00
[78]	Lai et al. 2019	CNN-based	Resting	10	64	CNN	83.21/79.08
[85]	Chen et al. 2019	GSLT-CNN	ERPs, RSVP	10/32/157	28/64	CNN	97.06
[79]	Wang et al. 2019	CNN-based	SSVEP	10	8	CNN	99.73
[77]	Yu et al. 2019	M-Shallow ConvNet	SSVEP	8	9	CNN	96.78
[84]	Cecotti et al. 2019	1/2/3D CNN	RSVP	16	64	CNN	92.80
[80]	Wang et al. 2019	CNN-based	Resting	109/59	64/46	Graph CNN	99.98/98.96
[105]	Özdenizci et al. 2019	Adversarial CNN	RSVP	3/10	16	CNN [Adversarial]	98.60
[93]	Salimi et al. 2020	N-Back-EEGNet	N-back	26	28	EEGNet	95.00
[94]	Ingolfsson et al. 2020	EEG-TCNet	MI	9	22	EEGNet-TCN	77.35–97.44
[88]	Riyad et al. 2020	Incep-EEGNet	MI	9	22	Inception-EEGNet	74.08
[89]	Liu et al. 2020	PSTSA-CNN	MI	9/14	22/44	CNN-S Attention	74.07–97.68
[95]	Kasim et al. 2021	1DCNN	Photic stimuli	16	16	CNN	97.17
[90]	Zhu et al. 2021	RAMST-CNN	MI	109	64	CNN [ResNet]	96.49
[106]	Musallam et al., 2021	TCNet-Fusion	MI	9/14	22/44	EEG-TCNet	83.73–94.41
[107]	Mane et al. 2021	FBCNet	MI	9/54/37/34	22/20/27	CNN	74.70–81.11
[86]	Salami et al. 2022	EEG-ITNet	MI	54/9	20/22	Inception-TCN	76.19/78.74
[81]	Zhang et al. 2022	3D CNN	VEPs	70	16	CNN	82.33
[99]	Bidgoly et al. 2022	CNN-based	Resting	109	64/32/3	CNN	98.04
[96]	Wu et al. 2022	Mixed-FBCNet	MI	109/9/10	64/22/10	FBCNet	98.89–99.48
[97]	Altuwaijri et al. 2022	MBEEG-SE	MI	9	22	EEGNet-S Attention	82.87–96.15
[98]	Autthasan et al. 2022	MIN2Net	MI	9/14/54	20/15	AE [CNN]	72.03/68.81
[92]	Ding et al. 2023	TSception	EMO	32/27	32/32	GoogleNet	61.27/63.75
[100]	Alsumari et al. 2023	CNN-based	Resting	109	3	CNN	99.05
[101]	Yap et al. 2023	GoogleNet, ResNet, EfficientNet, DenseNet, Inception	ERPs	30	14	CNN	80:00
[102]	Chen et al. 2024	EEGNeX	ERPs, MI, SMR, ERN	1/54/6/26	14/20/22/56	EEGNet	78.81–93.81
[103]	Shakir et al. 2024	STFE/MTFE-R-CNN	MI	109	64	CNN	89.00/95.00
[91]	Lakhan et al. 2025	EEG-BBNet	MI, ERPs, SSVEP	54	62/14/8	CNN–Graph CNN	99.26

Acc.: Accuracy, AE: Autoencoder, Adversarial: Adversarial CNN, AVEP: Auditory Visual Evoked Potential, Deep/Shallow ConvNet: Deep/Shallow Convolutional Network, EMO: Emotion Protocol, ERN: Error-Related Negativity, ERP: Event-Related Potential, Inception: Inception Variant, MI: Motor Imagery, MRCP: Movement-Related Cortical Potential, MHA: Multihead Attention, N-Back Memory: N-Back Memory Task, Photic Stimuli: Light Stimulation Protocol, ResNet: Residual Neural Network, Resting: Resting State Protocol, RSVP: Rapid Serial Visual Presentation, S Attention: Self-Attention, SMR: Sensory Motor Rhythm, SSVEP: Steady-State Visual Evoked Potential, TN: Tensor Network, T Encoder: Transformer Encoder, VAT: Visual Attention Task, VEPs: Visual Evoked Potential, Y [X]: Indicates that model Y has components from model X.

4.2. Efficiency in Transformer-Based Models

Transformers provide a flexible and scalable framework for EEG FE (Table 3). By leveraging SA, they capture complex temporal dynamics, cross-channel spatial relationships, and long-range dependencies. Innovations such as generative pretraining, modular designs, and ensemble strategies enhance their robustness, discriminative power, and computational efficiency. Their ability to process raw signals, integrate multiple feature domains, and adapt across tasks has established them as a versatile tool in EEG analysis.

4.2.1. Applications to Raw EEG

Several studies have focused on Transformers that directly process raw EEG to eliminate manual feature engineering while achieving high Acc. and efficiency. Arjun et al. [38] adapted Vision Transformer (ViT) [34] for EEG ER, comparing CWT-based scalograms with raw multichannel signals. The model used multihead self-attention (MHSA), with the raw-input ViT treating each time window as a patch to capture transient emotional patterns. The raw model outperformed the CWT version, achieving 99% Acc. on DEAP. Shorter windows (6 s) boosted the performance by emphasizing local dynamics and increasing the sample size. A design with six layers, 512-D embeddings, and eight heads made the ViT two to three times smaller than its NLP counterparts. The study showed that end-to-end attention-based learning can surpass CNN/LSTM baselines for applied use. Siddhad et al. [41] applied a pure Transformer model with four stacked Transformer encoders and MHSA. Positional encoding integrated temporal order and channel information to enable the joint learning of spatiotemporal patterns. The model size was tuned per dataset to reduce overfitting and the Comp. Cost, simplify preprocessing, and improve efficiency. This approach removed the need for features like PSD or entropy. It achieved >95% Acc. for mental workload and >87% for age/gender classification, matching state-of-the-art results.

4.2.2. Generative and Self-Supervised Foundation Models

Transformers have also been used as foundational models by using self-supervised or generative learning to extract robust representations, synthesize EEG signals, and support cross-task generalization. Dosovitskiy et al. [34] introduced ViT. It replaces convolutional feature extractors by dividing images into fixed-size patches, embedding them, and applying MHSA. This design removes CNNs’ inductive biases and enables flexible and generalizable feature learning. ViT models (Base, Large, Huge) were pretrained on large datasets like ImageNet-21k and JFT-300M. MHA encodes diverse interactions in parallel, and pretrained representations allow high Acc. on smaller datasets. ViT proves that trained Transformers can match or surpass CNNs in FE while offering faster, more adaptable performance. This approach inspired EEG-specific Transformers like EEGPT [46]. Omair et al. [44] developed the Generative EEG Transformer (GET), a GPT-style [31] model that learns long-range EEG features through self-supervised signal generation. By predicting future EEG samples, GET forces attention layers to extract oscillations and distant dependencies. Pretraining on diverse EEG datasets rendered GET a foundational model. This improved subsequent tasks like epilepsy detection and BCI control while generating robust synthetic EEG for data augmentation, surpassing generative adversarial network (GAN) methods. Once pretrained, GET reduces the reliance on manual preprocessing and customized design. Lim et al. [45] proposed EEGTrans, a two-stage generative Transformer that integrates a vector-quantized autoencoder (VQ-VAE) with a Transformer decoder for EEG synthesis. The VQ-VAE compresses EEG into discrete latent tokens while filtering noise. The Transformer decoder models these tokens autoregressively, reconstructing realistic EEG with preserved spectral characteristics. This efficient separation of local and global modeling reduces the sequence length and allows a focus on informative features. EEGTrans resulted in realistic data augmentation, surpassing GAN-based methods. It also demonstrated strong cross-dataset generalization and robust unsupervised pretraining. Wang et al. [46] introduced EEGPT, a Transformer pretrained with masked self-supervised learning. EEG signals were segmented along time and channels with large and masked portions (50% time, 80% channels). Then, an encoder was trained to reconstruct missing data while aligning latent embeddings via a momentum encoder. This strategy forced EEGPT to learn spatiotemporal dependencies to produce robust representations. With 10 M parameters and minimal fine tuning, EEGPT achieved high performance across diverse EEG tasks. The masking strategy doubled data augmentation and improved generalization while avoiding task-specific model design. These models provide scalable, reusable EEG representations that reduce the fine tuning costs.

4.2.3. Modular and Dual-Branch Spatiotemporal Transformers

Some approaches model spatial and temporal EEG features using attention to improve the interpretability, discriminative power, and efficiency for tasks like ER and MI. Song et al. [39] proposed S3T, a two-branch Transformer for EEG MI that separates spatial and temporal FE. A spatial filtering module (CSP) improves the signal quality, followed by an encoder to capture dependencies between brain regions via SA. A Transformer then focuses on sequential time steps to extract evoked patterns. With only three MHA layers, this modular design directs each encoder to a specific signal aspect to improve efficiency and reduce overfitting. Trained end-to-end on BCI Competition IV-2a, S3T achieved 82–84% Acc. Its shallow design and integrated spatial filtering enhanced its generalization without large datasets. Du et al. [42] proposed ETST, a dual-encoder Transformer that separately models EEG features for person identification. The Temporal Transformer Encoder (TTE) treats each time point as a token to capture correlations, rhythms, and ERPs, while the Spatial Transformer Encoder (STE) treats each channel as a token to model connectivity. Preprocessing with bandpass filtering, artifact removal, and z-score normalization reduced noise. Trained without pretraining or augmentation, ETST achieved high Acc. It was also shown that modular attention aligned with the EEG structure in a compact model resulted in rich discriminative representations. Hu et al. [108] presented HASTF, a hybrid spatiotemporal attention network for EEG-based ER. EEG was filtered into five frequency bands using differential entropy (DE) features extracted from 1 s windows. These features were arranged into 3D patches to reflect the scalp topology. Spatial features were extracted by the Spatial Attention Feature Extractor (SAFE), which combines a U-shaped convolutional fusion module, skip connections, and parameter-free spatial attention. The Temporal Attention Feature Extractor (TAFE) applied positional embeddings and SA to model temporal dynamics. HASTF achieved high Acc. > 99%. Ablation studies showed temporal attention as the most impactful component. Muna et al. [109] introduced SSTAF, or the Spatial–Spectral–Temporal Attention Fusion Transformer, for upper-limb MI classification. EEG undergoes bandpass (8–30 Hz) and notch filtering, common average referencing, segmentation, and normalization. A Short-Time Fourier Transform (STFT) produces 4D time–frequency features. These rhythms were processed by spectral and spatial attention modules and an encoder to model temporal and channel interactions. SSTAF achieved 76.83% on EEGMMIDB and 68.30% on BCI Competition IV-2a, outperforming prior CNN and Transformer models. Ablation studies highlighted the role of the encoder in capturing temporal dynamics and attention modules in improving the Acc. Wei et al. [110] further refined dual-encoder modeling for EEG ER by separating temporal and spatial dependencies. A time step attention encoder extracted sequence-level features per channel, while a channel attention encoder captured interchannel relationships. A weighted fusion module combined these representations to improve the discriminative power and eliminate redundant computation. The approach achieved 95.73% intrasubject and 87.38% intersubject Acc.

4.2.4. Ensemble and Multidomain Transformers

Ensembles and specialized Transformers capture spectral and temporal patterns to enhance robustness and overall performance. Zeynali et al. [43] proposed a Transformer ensemble for EEG classification, where each model focuses on a different feature domain. A temporal Transformer captures waveform dynamics and short-term dependencies from raw EEG. A spectral Transformer processes PSD inputs to extract frequency-domain features and cross-frequency relationships. Their combination produced rich representations that led to strong results (F1 98.9% for cognitive workload). PSD preprocessing and targeted attention reduced noise and emphasized discriminative patterns. This fusion increased the computation, but the lightweight models mitigated overfitting, sped up convergence, and generalized across tasks without heavy pretraining. Ghous et al. [111] proposed a Transformer-based model for ER. Training proceeded in two stages: attention-enhanced base model development (AE-BMD) on SEED-IV and cross-dataset fine-tuning adaptation (CD-FTA) on SEED-V and MPED for generalization. Preprocessing was performed using Kalman and Savitzky–Golay (SG) filtering. FE included mel-frequency cepstral coefficients (MFCCs), gammatone frequency cepstral coefficients (GFCCs), PSD, DE, Hjorth parameters, band power, and entropy measures. Class imbalance was addressed using the Synthetic Minority Oversampling Technique (SMOTE). Spectral and temporal attention, positional encoding, MHA, and RNN/MLP-RNN layers captured precise patterns. The model achieved Acc. of 84% (SEED-IV), 90% (SEED-V), and 79% (MPED).

4.2.5. Specialized Attention Mechanisms and Dual Architectures

Mechanisms like gating, capsules, or regularization stabilize training, capture long-term dependencies, and refine EEG feature representations. Tao et al. [40] proposed a gated Transformer (GT) for EEG decoding. They integrated GRU/LSTM-inspired gating mechanisms into SA blocks to preserve the relevant EEG context, suppress noise, and stabilize FE over long sequences. The gating provides implicit regularization that improves convergence and maintains session stability. Trained from scratch without pretraining or augmentation, the model efficiently learns discriminative features for tasks such as continuous EEG decoding or ERP analysis. Wei et al. [112] introduced TC-Net, a Transformer capsule network, for EEG-based ER. They segmented EEG signals into non-overlapping windows and then processed them via a temporal Transformer module. A novel EEG PatchMerging strategy was used to balance global and local representations. Features were refined using an emotion capsule module that captured interchannel relationships before classification. TC-Net achieved strong performance on the DEAP and DREAMER datasets. It combines global context modeling, localized feature merging, and capsule refinement to improve the discriminative power, reduce redundancy, and enable efficient, robust recognition.

4.2.6. Cross-Integration Overlaps

Transformer-based EEG models have conceptual overlaps, where raw, generative, modular, ensemble, and hybrid approaches aim to improve efficiency, generalization, and scalability. All models [34,38,39,40,41,42,43,44,45,46,108,109,110,111,112] use SA mechanisms for FE. However, they differ in how they exploit it for efficiency. We classify the system in [46] under foundation models (Section 4.2.2) since its primary focus is masked self-supervised pretraining, despite combining it with spatiotemporal decomposition principles (Section 4.2.3) through time-channel segmentation. EEGTrans [45] incorporates a generative reconstruction mechanism within a two-stage architecture like modular Transformers (Section 4.2.3), but we place it under generative models (Section 4.2.2) since unsupervised representation learning is its central innovation. Models like those in [39,42] share parallel attention modules (Section 4.2.4) and dual-encoder hierarchies (Section 4.2.5). However, we place them under modular spatiotemporal designs (Section 4.2.3) to demonstrate their efficient decomposition, rather than dual fusion. Moreover, Refs. [43,111] utilize spectral and temporal attention (Section 4.2.3) within ensemble frameworks, but we group these models into multidomain ensembles (Section 4.2.4) because they focus on cross-domain integration. The works in [40,112] employ gating and capsule mechanisms, which can be considered as extensions of modular (Section 4.2.2) or generative Transformers (Section 4.2.3), but we classify them as specialized and hybrid attention mechanisms (Section 4.2.5) to highlight their focus on efficiency and model stability, rather than representational design.

Table 3. Transformer-based architectures for EEG classification.

Ref.	Author/Year	Model	Protocol	Samples	Channels	Inspiration Basis	Acc.
[38]	Arjun et al. 2021	ViT-CWT, ViT-Raw EEG	EMO-VAT	32	32	T Encoder (ViT)	97.00/95.75 99.40/99.10
[34]	Dosovitskiy et al. 2021	ViT (Base/Large/Huge)				T Encoder (ViT)	77.63–94.55
[39]	Song et al. 2021	S3T	MI	9/9	22/3	T Encoder [CNN]	82.59/84.26
[40]	Tao et al. 2021	Gated Transformer	MI, VAT	109/6	64/128	T Encoder [GRU]	61.11/55.4
[42]	Du et al. 2022	ETST	Resting	109	64	T Encoder	97.29–97.90
[43]	Zeynali et al. 2023	Ensemble Transformer	VEPs	8	64	T Encoder	96.10
[112]	Wei et al. 2023	TC-Net	EMO, AVP	32/23	48/15	T Encoder [CapsNet, ViT]	98.59–98.82
[41]	Siddhad et al. 2024	Transformer-based	Resting	60/48	14	T Encoder	95.28
[44]	Omair et al. 2024	GET	MI/Alpha EEGs	9/20	3/16	Transformer	85.00
[46]	Wang et al. 2024	EEGPT	ERPs, MI, SSVEP, EMO	9-2383	58/3-128	Transformer [BERT, ViT]	58.46–80.59
[108]	Hu et al. 2024	HASTF	EMO	32/15	32/62	Transformer [BERT]	98.93/99.12
[110]	Wei et al. 2025	Fusion Transformer	EMO		62	T Encoder	87.38/95.73
[45]	Lim et al. 2025	EEGTrans	MI	1/7/9/14	3/22/59/128	Transformer	80.69–90.84
[111]	Ghous et al. 2025	AE-BMD, CD-FTA	EMO	15/20/23	62	T Encoder [RNN, MLP]	79.00–95.00
[109]	Muna et al. 2025	SSTAF	MI	103/9	64/22	Transformer	68.30/76.83

Acc.: Accuracy, AVP: Audiovisual Evoked Potential, BERT: Bidirectional Encoder Representations from Transformers, CapsNet: Capsule Network, EMO: Emotion Protocol, EMO-VAT: Emotion with Visual Attention Task, ERP: Event-Related Potential, GRU: Gated Recurrent Unit, MI: Motor Imagery, MI/Alpha EEG: Motor Imagery with Alpha EEG, MLP: Multilayer Perceptron, Resting: Resting Protocol, RNN: Recurrent Neural Network, SSVEP: Steady-State Visual Evoked Potential, T Encoder: Transformer Encoder, Transformer: Transformer Model, VAT: Visual Attention Task, ViT: Vision Transformer, VEP: Visual Evoked Potential.

4.3. Efficiency in CNN–Transformer-Based Hybrids

Recent advances in EEG decoding have combined Transformer-based architectures with CNNs or TCNs to capture local and global features efficiently (Table 4). Key ideas have included hybrid CNN–Transformer designs, multibranch networks for parallel FE, temporal convolution modules, and self-supervised pretraining for richer feature maps. Many models have also used spatiotemporal attention and hierarchical encoding like patching or multistage FE to improve the Acc. and efficiency. The following sections review these models.

4.3.1. Sequential Pipelines

Sequential pipelines apply CNNs first to extract localized spatiotemporal representations. These features are then fed into Transformer encoders, which extract long-range relationships. These designs preserve the hierarchical nature of EEG while improving efficiency through convolutional prefiltering. Sun et al. [113] combined CNN-based spectral and temporal filtering with Transformer attention. Although detailed architectural specifications were not disclosed, the hybrid model demonstrated significant Acc. gains over CNN or RNN baselines. Omair et al. [114] proposed ConTraNet, a hybrid CNN–Transformer architecture for both EEG and electromyography (EMG). CNN layers extracted local patterns, and the Transformer modeled long-range dependencies. Designed to generalize across modalities and tasks, ConTraNet achieved top performance on 2–10-class datasets. With limited data, CNN filtering improved efficiency by reducing overfitting and helping the Transformer to focus on key global patterns. Wan et al. [115] developed EEGFormer. They employed a depthwise 1D CNN frontend to extract channel-wise features before feeding them into a Transformer encoder for spatial and temporal SA. The depthwise CNN reduced the parameters, allowing end-to-end training on raw EEG. EEGFormer achieved high performance across multiple tasks (emotion, depression, SSVEP). Ma et al. [116] proposed a hybrid CNN–Transformer for MI classification. Preprocessing included 4–40 Hz bandpass filtering, z-score normalization, and one versus rest (OVR)-CSP. A two-layer CNN extracted local spatiotemporal features, complemented by MHA across channels and frequency bands. The model achieved 83.91% Acc. on BCI-IV 2a. It outperformed CNN-only (58.10%) and Transformer-only (46.68%) baselines, showing efficient joint local–global FE. Zhao et al. [117] introduced CTNet, a CNN–Transformer hybrid for MI EEG. CNN layers extracted spatial and temporal features, followed by Transformer attention across channels and time. CTNet showed improvements of 2–3% over prior hybrids (82.5% BCI-IV 2a, 88.5% BCI-IV 2b). Liu et al. [118] developed ERTNet, an interpretable CNN–Transformer framework for emotion EEG. Temporal convolutions isolated important frequency bands, spatial depthwise convolutions captured the channel topology, and a Transformer fused abstract spatiotemporal features. Achieving 74% on DEAP and 67% on SEED-V, the model provided attention-based interpretability and reduced feature dimensionality. These models reduce overfitting and memory use and perform well in MI and ER.

4.3.2. Parallel and Multibranch Blocks

These hybrids process different feature types (temporal, spatial, or spectral) in dedicated branches before global fusion via attention or Transformer blocks. This allows simultaneous multiscale analysis with reduced redundancy. Li et al. [119] introduced a dual-branch CNN where one branch learned spatiotemporal features from raw EEG and the other processed scalograms. These features were fused and passed to a Transformer. This network achieved 96.7% Acc. on SEED and outperformed over ten previous methods. Xie et al. [120] introduced CTrans, a CNN–Transformer hybrid with dedicated spatial and temporal attention branches and optimized positional encodings. Certain variants embed EEG signals with CNNs before applying SA across channels or time. By separating attention dimensions, the model reduced the complexity, achieved 83% Acc. on two-class MI, and demonstrated efficient intersubject generalization. Si et al. [121] proposed TBEM, an ensemble combining a pure CNN and a CNN–Transformer hybrid. The CNN captured robust local features, and the hybrid model extracted global attention-based representations. Each block in the ensemble is lightweight, which enhances feature reliability. Averaging predictions improved Acc. and generalization, winning an IEEE EEG-ER competition. Yao et al. [122] introduced EEG ST-TCNN, a parallel spatiotemporal Transformer–CNN network. Separate Transformers modeled channels and time, with CNN fusion combining the outputs. This separation reduced Complex., and it achieved 96–96.6% Acc. on SEED/DEAP. Lu et al. [123] developed CIT-EmotionNet, a CNN–interactive Transformer for emotion EEG. Raw signals were converted to spatial–frequency maps; a parallel CNN extracted local features; and a Transformer captured global dependencies. Iterative interaction between the CNN and Transformer feature maps fused global and local information efficiently. The model achieved 98.57% on SEED and 92.09% on SEED-IV using a compact 10-layer network. These models improve performance due to their modular design and shared parameters, but the fusion stage slightly increases the computational overhead.

4.3.3. Integrated and Hierarchical Attention

These techniques integrate CNN and Transformer components within attention modules. Hierarchical attention captures spatial, spectral, and temporal dependencies across scales, yielding rich multilevel representations. Bagchi et al. [124] designed a ConvTransformer for single-trial visual EEG classification, integrating temporal CNN filters and MHSA within a single hybrid block. This design avoids the need for deep networks by capturing local temporal patterns and global cross-channel relationships simultaneously. The model outperformed prior CNN architectures on five visual tasks, highlighting its efficiency in extracting informative EEG features. Song et al. [125] proposed EEG Conformer, a compact CNN–Transformer hybrid inspired by audio conformers. Shallow CNNs extract local spatiotemporal features, and six Transformer layers capture the global context. The model achieved strong MI decoding performance while providing interpretable attention maps. Si et al. [126] developed MACTN, a hierarchical CNN–Transformer for ER. A temporal CNN captures local patterns, and sparse attention learns temporal dependencies, while channel attention highlights key electrodes. This mixed attention strategy achieved top Acc. on DEAP datasets and won the 2022 BCI ER Challenge. Gong et al. [127] designed ACTNN, a CNN–Transformer hybrid for multivariate emotion EEG. Spatial and spectral attention emphasize relevant channels and frequency bands, convolutional layers extract local features, and a Transformer encodes the temporal context. ACTNN efficiently focused on emotion-relevant brain regions while providing interpretable attention visualizations. It achieved 98.47% SEED and 91.90% SEED-IV Acc. Such models yield high Acc. but at a moderate cost in complexity compared to sequential hybrids (Section 4.3.1).

4.3.4. TCN-Enhanced Models for Lightweight Temporal Modeling

TCNs complement attention mechanisms by efficiently modeling long-range temporal dependencies with dilated convolutions that reduce the Transformer depth and improve latency. Altaheri et al. [128] proposed ATCNet, integrating MHSA with a TCN [87] and CNN spatial filters for MI EEG. Dilated depthwise separable convolutions in the TCN efficiently modeled temporal patterns, while attention focused on discriminative features. The model achieved 84–87% Acc. on BCI-IV 2a datasets with low latency. This shows that shallow attention blocks with lightweight TCN and CNN modules yield high-performance EEG decoding. Nguyen et al. [129] implemented EEG-TCNTransformer by combining a TCN [87] with a Transformer for MI EEG. Dilated convolutions in the TCN capture long-range patterns, while a Transformer models residual global dependencies. The model achieved 83.41% on BCI-IV 2a without bandpass filtering, learning optimal frequency filters internally with a low Comp. Cost. Cheng et al. [130] proposed MSDCGTNet, a fully integrated emotion EEG framework combining multiscale dynamic 1D CNNs, a GT encoder, and a TCN [87]. The CNN extracted spatial–spectral features directly from raw signals, avoiding spectrogram transformations. The GT used MHA with a GLU to capture global dependencies, while the TCN modeled the temporal context via dilated causal convolutions. The model achieved 98–99.7% across DEAP, SEED, and SEED-IV, maintaining low processing times and parameters per sample. These efficient models can maintain high Acc. without much computing power, even when using shallow attention mechanisms.

4.3.5. Pretrained and Self-Supervised Transformers

To learn EEG-transferable representations across tasks, datasets, and modalities, the following approaches have been used. Kostas et al. [131] developed BENDR, a self-supervised EEG pretraining framework. The hierarchical encoder had multiple 1D convolutional blocks with grouped convolutions, compressing raw EEG into latent representations. A Transformer encoder with MHSA operated on these sequences. Pretraining via contrastive self-supervision enabled cross-task generalization. Convolutional position encodings reduced the memory requirements, allowing a 10-M-parameter Transformer to train on limited data. Yang et al. [132] introduced ViT2EEG, fine-tuning a ViT [34] pretrained on ImageNet to process EEG represented as channel × time patches with a CNN embedding. Transfer learning from vision tasks improved the regression performance on EEG. This approach efficiently extracted spatiotemporal features without training a large Transformer from scratch, showing the value of visual priors for EEG. Jiang et al. [133] proposed LaBraM, a foundation Transformer model for cross-dataset EEG representation. EEG signals were segmented into channel-wise patches and tokenized via a vector quantized (VQ) neural codebook. A masked Transformer predicted missing patch codes, pretraining on 2500 h of multi-dataset EEG. LaBraM generalized across emotion, anomaly, and gait tasks. It also reduced the sequence length, enabling compact fine tuning without retraining new models for each dataset. Li et al. [134] proposed a Multitask Learning Transformer (MTL-Transformer) with an auxiliary EEG reconstruction head. The model was trained to reconstruct EEG alongside the main task, which helped it to learn better. This improved downstream tasks like eye-tracking regression. These frameworks shift efficiency from model compression to data efficiency for better generalization with minimal fine tuning. However, building them requires a large amount of upfront computing power.

4.3.6. Cross-Integration Overlaps

All these models [113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134] integrate convolutional and attention mechanisms for EEG feature extraction. Sequential pipeline models [115,116,118] also employ convolutional frontends that compress raw EEG and facilitate cross-task generalization, aligning them with those in Section 4.3.5. Architectures like those in [120,122,123] (Section 4.3.2) integrate hierarchical attention mechanisms (Section 4.3.3) within their fusion stages that improve selective spatiotemporal feature processing and reduce redundancy. TCN-enhanced models [128,130] (Section 4.3.4) incorporate hierarchical attention (Section 4.3.3). Ref. [128] uses multiscale attention to emphasize temporal features and reduce the Transformer depth. Ref. [130] combines multiscale CNNs with a GT encoder and attention, fusing spatial, spectral, and temporal features for richer feature integration. The models in [131,132,133,134] utilize convolutional or tokenization frontends like sequential pipelines (Section 4.3.1), supporting efficient attention computation and compact feature representation. These overlaps reduce computation and enhance feature efficiency and generalization.

Table 4. Hybrid CNN–Transformer architectures for EEG classification.

Ref.	Author/Year	Model	Protocol	Samples	Channels	Inspiration Basis	Acc.
[113]	Sun et al. 2021	Fusion-CNN-Trans	MI	109	64	CNN–T Encoder	87.80
[131]	Kostas et al. 2021	BENDR	Raw EEG	>10,000	20	CNN–T Encoder	86.70
[124]	Bagchi et al. 2022	EEG-ConvTranformer	VEPs	10	128	CNN–T Encoder (MHA)	89.64
[120]	Xie et al. 2022	CTrans	MI	109	64	CNN–T Encoder	83.31
[128]	Altaheri et al. 2023	ATCNet	MI	9	22	EEGNet–T Encoder (MHA)–TCN	70.97–85.38
[132]	Yang et al. 2023	ViT2EEG	Raw EEG	27	128	EEGNet–T Encoder (ViT)	55.40–61.70
[119]	Li et al. 2023	Dual-TSST	MI, EMO	9/15	22/3/62	CNN–T Encoder	96.65
[125]	Song et al. 2023	EEG Conformer	MI	9/9/15	22/3/62	CNN–T Encoder (MHA)	78.66/95.30
[115]	Wan et al. 2023	EEGFormer	SSVEP	70/15/12	64/62/6	CNN–T Encoder-CNN	92.75
[121]	Si et al. 2023	TBEM	EMO	80/6	30/30	CNN–T Encoder-CNN (HybridNet–PureConvNet)	42.50
[127]	Gong et al. 2023	ACTNN	EMO	15	62	CNN–T Encoder	95.30
[116]	Ma et al. 2023	CNN-Transformer	MI	9	22	CNN–Transformer	83.91
[114]	Omair et al. 2024	ConTraNet	MI	9/105	3/64	CNN–T Encoder	86.98
[126]	Si et al. 2024	MACTN	EMO	80/32	30/28	CNN–T Encoder (MHA)	67.80
[117]	Zhao et al. 2024	CTNet	MI	9	22/3	CNN–T Encoder	83.11–97.81
[133]	Jiang et al. 2024	LaBraM	Resting, MI, Raw EEG	>140	19-64	CNN–T Encoder	82.58
[118]	Liu et al. 2024	ERTNet	EMO-AAT	32/16	32/62	CNN–T Encoder	74.23
[122]	Yao et al. 2024	EEG ST-TCNN	EMO	15/32	62/32	T Encoder–CNN	95.73–96.95
[134]	Li et al. 2024	MTL-Transformer1-2	EEG, Eye tracking	356	128	ViT2EEG–CNN
[123]	Lu et al. 2024	CIT-EmotionNet	EMO	15	62	ResNet II–T Encoder	92.09/98.57
[129]	Nguyen et al. 2024	EEG-TCNTransformer	MI	9	22	EEG-TCNet–T Encoder (MHA)	83.41
[130]	Cheng et al. 2024	MSDCGTNet	EMO	32/15	32/62	CNN–T Encoder–TCN	98.85/99.67

Acc.: Accuracy, AAT: Auditory Attention Task, EEG: Electroencephalogram, EEGNet: EEG-Specific Neural Network, HybridNet: Hybrid Neural Network, MHA: Multihead Attention, MI: Motor Imagery, Raw EEG: Raw EEG Signals, ResNet: Residual Neural Network, SSVEP: Steady-State Visual Evoked Potential, TCN: Temporal Convolutional Network, Transformer: Standard Transformer Model, T Encoder: Transformer Encoder, ViT: Vision Transformer, VEP: Visual Evoked Potential, ViT2EEG: Vision Transformer Adapted to EEG.

4.4. Efficiency in Recurrent Deep Learning Models

Beyond CNN–Transformer hybrids, other DL architectures also excel in classifying EEG signals. Recurrent models are popular for their ability to process sequential EEG data. While effective, their computational efficiency has varied. Newer hybrid models, such as LSTM-CNN and LSTM–Transformer combinations, balance temporal modeling with spatial feature extraction. They improve the Acc. at the cost of computation. This section examines efficiency-oriented designs among recurrent hybrid models (Table 5). We focus on how attention, convolutional integration, temporal alignment, and multimodal fusion influence the Comp. Cost and deployment feasibility in real-time or resource-constrained applications.

4.4.1. Attention-Based Architectures

Attention mechanisms have been paired with frequency-domain decomposition to isolate discriminative EEG features in low-frequency bands. Zhang et al. [135] proposed an attention-based encoder–decoder RNN with XGBoost (XGB) for EEG-based person identification. Delta-band EEG was isolated and passed through an RNN with attention to emphasize informative channels. The resulting features were classified with XGB, achieving high Acc. across single-trial, multi-trial, and public datasets. Training was computationally intensive, but inference took less than 1 s, allowing real-world deployment. The method performed well with limited training data and across various EEG setups. Zhang et al. [136] developed a multimodal authentication system using EEG and gait data, combining an attention-based RNN with a one-class SVM and a nearest neighbor (NN) classifier. Delta-band EEG and gait sequences were modeled with an LSTM and attention to extract temporal features. An EEG-based 1D filter first rejected impostors before gait data were processed. With a 0% FAR and 1% FRR, DeepKey achieved strong security with 0.39 s latency, despite its longer setup time. Its architecture scaled to new users without retraining, supporting high-security applications. Balci [137] proposed DM-EEGID, a hybrid model combining an attention-based LSTM-MLP with random forest (RF)-based feature selection. EEG signals were decomposed into sub-bands, and delta was identified as the most distinctive pattern. RF feature selection determined optimal electrode subsets, reducing the channel count while maintaining high Acc. The LSTM attention mechanism focused on salient signal segments, and the MLP finalized classification. The system reached 99.96% Acc. for eyes-closed and 99.70% for eyes-open data. Its reliability with fewer electrodes shows a practical balance between efficiency and performance. These attention-based designs highlight efficiency through selective focus and data reduction rather than network size. Despite heavier training, their low latency and reduced channels make them practical for user-specific and real-time applications.

4.4.2. CNN–Recurrent Hybrids

Joint spatial–temporal representation learning has been achieved through the integration of convolutional and recurrent layers to encode EEG dynamics. Wilaiprasitporn et al. [138] presented a cascaded CNN-RNN architecture, evaluating both CNN-LSTM and CNN-GRU for EEG-based person identification using the DEAP dataset. EEG signals were converted into sequences of 2D spatial meshes, which CNN layers processed for spatial pattern learning, followed by RNN layers to model temporal dynamics. This two-stage pipeline provided expressive features with low latency. CNN-GRU variants trained faster, reached up to 100% CRRs, and remained effective with as few as five electrodes, showing practicality for portable systems. Sun et al. [139] introduced a 1D convolutional LSTM that combined 1D CNN layers with LSTM units for spatiotemporal EEG modeling. After preprocessing, CNN layers extracted localized features from segmented EEG signals, and LSTM layers captured temporal dependencies. Despite its complexity, the model reached 99.58% rank-1 Acc. with only 16 channels, outperforming deeper CNN and LSTM baselines. It remained robust across mental states and tasks, and the small number of channels makes it suitable for real-time use. Chakravarthi et al. [140] proposed a hybrid DL model for EEG-based ER, combining a CNN-LSTM framework with ResNet152 to address the limitations of traditional methods in PTSD-related applications. Using the SEED-V dataset (happiness, disgust, fear, neutral, and sadness), EEG signals were normalized and bandpass-filtered (1–75 Hz). FE used MFCCs from FP1, FP2, FC6, and F3, along with the sample entropy, Hurst exponent (R/S analysis), and average power from the alpha, beta, gamma, and theta bands. These features were converted into topographic maps and fed into the model for training using categorical cross-entropy and the Adam optimizer. The model achieved 98% Acc., with a low MSE, outperforming SVM and ANN baselines. Its integration of spatial, temporal, and spectral features supported the reliable recognition of non-verbal emotional cues, pointing out potential for emotion-aware interfaces. CNN-RNN hybrids improve efficiency by distributing the computational load; CNNs handle low-cost spatial abstraction, while RNNs capture essential temporal dependencies. This supports faster training and robust inference when electrode counts or sampling durations are constrained.

4.4.3. Stimulus-Locked Models

Temporally aligned brain responses to visual or auditory stimuli have been used to extract consistent EEG features. Puengdang et al. [141] utilized a personalized LSTM model for person authentication with dual stimuli: 7.5 Hz SSVEP and ERP components evoked by target images. Preprocessed EEG signals were shaped into fixed-length time series and fed into individual-specific LSTM networks trained on authorized and impostor data. Using only seven channels, the model reached 91.44% verification Acc., balancing performance and setup efficiency. Dual stimulation improved user specificity, although Acc. varied across individuals, revealing a need for further personalization. Zheng et al. [142] developed an ERP-guided LSTM framework for EEG-based visual classification. ERPs were averaged across trials to enhance signal quality and reduce noise. Then, they were fed into an LSTM encoder to learn temporal patterns, followed by softmax. The model reached 66.81% Acc. for six-class and 27.08% for 72-class classification, outperforming several raw EEG baselines. These designs improve efficiency through temporal alignment, reducing signal variability and feature redundancy. By leveraging evoked responses, they achieve compact yet reliable representations, although subject dependence limits scalability.

4.4.4. Multimodal and Parameter-Efficient Hybrids

Cross-modal fusion frameworks have combined EEG with other biometric cues using contrastive learning or joint embedding strategies to improve classification. Jin et al. [143] introduced the Convolutional Tensor-Train Neural Network (TTNN), a hybrid model integrating CNNs and tensor networks (TT). EEG signals were segmented into 1 s trials and processed by depthwise separable CNNs. Features were then transformed into high-order tensors by the TT layer, capturing multilinear dependencies with up to 800 times fewer parameters than fully connected layers. This low-rank representation reduced memory use in multitask EEG-based brain print recognition. CTNN achieved over 99% Acc., making it suitable for real-world applications. Kumar et al. [144] proposed a bidirectional LSTM-based (BLSTM-NN) authentication framework by combining EEG and dynamic signatures. DFT was used to derive angular features from signatures. Separate BLSTM models were trained independently for each modality and fused at the decision level using the Borda count and max rule. This efficient modular design avoided complex feature alignment and achieved 98.78% Acc. in security-critical scenarios. It even maintained a low FAR (3.75%) and HTER (1.87%) with noisy individual modalities. Chakladar et al. [145] proposed a multimodal Siamese neural network (mSNN) for user verification by fusing EEG and offline signatures. The architecture used parallel CNN and LSTM encoders for FE from both modalities. The embeddings were compared via contrastive loss, enabling one-shot learning with minimal data. The model achieved 98.57% Acc. and a 2.14% FAR. This demonstrates the robustness of multimodal fusion against single-trait spoofing attempts. These multimodal and tensorized designs achieve efficiency through shared representations and modular training. Compression and late fusion reduce memory and retraining, offering scalable solutions for multisensor or privacy-sensitive EEG systems.

4.4.5. Cross-Integration Overlaps

The work in [136] (Section 4.4.1) overlaps with that in Section 4.4.4 as its attention–RNN was fused with gait-based classifiers to improve efficiency through shared feature representations. Balci [137] in Section 4.4.1 intersects with Section 4.4.2 by combining an attention–LSTM with MLP and RF selection, achieving CNN-like compression through selective sub-band and electrode optimization. Ref. [140], discussed in Section 4.4.2, also relates to Section 4.4.4. It integrates ResNet152 with spectral features in a fusion-like design to enhance efficiency through feature reuse and pretrained transfer. While we include [141] in Section 4.4.3, it shares traits with Section 4.4.1 due to its temporal focus on stimulus-aligned epochs to mimic selective attention for efficiency. Lastly, Ref. [143] (Section 4.4.4) aligns with Section 4.4.2. It uses CNN feature extraction coupled with tensor-train compression, using low-rank representation for scalable and efficient learning.

Table 5. Other hybrid architectures for EEG classification.

Ref.	Author/Year	Model	Protocol	Samples	Channels	Inspiration Basis	Acc.
Pure Architectures
[144]	Kumar et al. 2019	BLSTM-NN	VEPs	33/58	14/16	LSTM	97.57
[141]	Puengdang et al. 2019	LSTM-based	SSVEP, ERPs	20	6	LSTM	91.44
[142]	Zheng et al. 2020	ERP-LSTM	VEP	10	128	LSTM	66.81
[145]	Chakladar et al. 2021	mSNN	MI	70	14	SNN	98.57
Other Hybrids
[138]	Wilaiprasitporn et al. 2015	CNN-LSTM, CNN-GRU	ERPs, EMO	32/40	5/32	CNN-LSTM/GRU	99.17–99.90
[135]	Zhang et al. 2018	MindID	Resting	8	14/64	Attention RNN	98.20–99.89
[139]	Sun et al. 2019	1DCNN-LSTM	Resting	109	64/32/4	1D CNN-LSTM	94.34–99.58
[136]	Zhang et al. 2020	DeepKey	Relaxing	7	14	Attention RNN	99.00
[143]	Jin et al. 2021	CTNN	Resting, MI, EEG	105/20/32	64/32/7	CNN-TN	99.50
[140]	Chakravarthi et al. 2022	ResNet152-LSTM	Resting	20	4	ResNet-LSTM	98.00
[137]	Balci et al. 2023	DM-EEGID	Resting	109	48	Attention LSTM	99.97–99.70

Acc.: Accuracy, Attention: Selective Neural Focus Mechanism, Attention RNN: Attention-Based Recurrent Neural Network, GRU: Gated Recurrent Unit, TN: Tensor Network, ERP: Event-Related Potential, EMO: Emotion Protocol, LSTM: Long Short-Term Memory, MI: Motor Imagery, Resting: Resting Protocol, ResNet: Residual Network, RNN: Recurrent Neural Network, SNN: Siamese Neural Network, SSVEP: Steady-State Visual Evoked Potential, VEP: Visual Evoked Potential.

5. Comparative Analysis of Efficiency Trade-Offs

5.1. Proxy Metric Development

In our method, for each DL architectural category (CNNs, Transformers, CNN–Transformers), we performed the following:

We collected any metrics reported by the authors based on four axes: (1) accuracy (Acc.) to represent the overall performance of the model—we kept the highest reported value; (2) computational resources to represent system costs, such as the architectural cost (parameters), compute cost (FLOPs/MACs), memory footprint, and inference latency; (3) operational costs for acquisition costs like the epoch length (EEG segment) and channel count—for the channels and sample size, we kept the minimum reported values; (4) training costs such as training time and the GPU/TPU/cloud environment used for training. Thus, we were able to approximate the total cost to train and deploy these systems.
Due to the inconsistent and incomplete reporting of system metrics, we employed a mixed approach to create four proxy metrics (each scored on a scale from 1 (Best/Lowest Cost) to 5 (worst/highest cost)). This calculation was based on evaluating a set of quantitative metrics (white background in tables) and the authors’ qualitative claims (orange background in tables) to define a unified cost dimension across all models.

Complex. Proxy (Complex.): This metric reflects the size and depth of the architecture using the number of parameters (k/M) and EEG channels used. It quantifies the memory cost and architectural burden. Low Complex. is ideal for edge devices (Table 6).

Table 6. Complexity Proxy Development Rationale.

Scale	Rationale
1 (Very Low)	Very simple or lightweight design. Lowest parameter counts. Designed for few channels. Minimal depth.
2 (Low)	Small or simplified models. Low parameter counts. Low channel count. Uses efficient blocks.
3 (Medium)	Standard deep learning models. Moderate parameter count. Low channel count.
4 (High)	Deep, complex, or specialized models. High parameter count. Applied to high-channel datasets.
5 (Very High)	Very complex models. Very high parameter count. Requires full channel count.

Computational Cost Proxy (Comp. Cost): This metric reflects the hardware resources required for a single prediction (inference) after training is complete. It is measured by MACs/FLOPs and the model parameters. It quantifies the processing cost. A low cost is crucial for real-time BCIs (Table 7).

Table 7. Computational Cost Proxy Development Rationale.

Scale	Rationale
1 (Minimal)	Very low FLOPs/MACs. Designed for mobile/embedded focus.
2 (Low)	Low parameter count. Claims to be more efficient/faster than standard models.
3 (Medium)	Standard operational load (for most desktop/GPU systems). Claims of optimization but lacks high efficiency.
4 (High)	High operational load (a powerful discrete GPU is required). High resource use optimized for Acc.
5 (Very High)	Very high FLOPS/MACs. Model’s Complex. indicates high computational demand.

Operational Cost Proxy (Oper. Cost): This metric reflects the time required for a model to generate a prediction (latency) and the data requirements. It quantifies the real-world usability and user burden needed for the system. This is critical for systems needing seamless and immediate user interaction (Table 8).

Table 8. Operational Cost Proxy Development Rationale.

Scale	Rationale
1 (Real-Time)	Reported latency is in sub-seconds. Uses few channels, short segments, and raw signals. No calibration.
2 (Near Real-Time)	Latency is slightly longer (suitable for basic BCI control). Minimal preprocessing.
3 (Acceptable)	Latency is half to a few seconds (adequate for user-paced, non-immediate tasks). Uses data transformation.
4 (Slow)	Latency exceeds seconds (unsuitable for immediate interaction). Extensive pretraining and fine-tuning.
5 (Very Slow)	Relies on a high number of channels or long segments. Large-scale pretraining or high input dimensions.

Training Cost Proxy (Train. Cost): This metric reflects the training time and the number of training epochs required to train the model from scratch. It quantifies the initial development and time investment before deployment. A low cost is desirable for research and rapid development (Table 9).

Table 9. Training Cost Proxy Development Rationale.

Scale	Rationale
1 (Instant)	Extremely low epoch count. Training time reported in minutes. Efficient transfer learning for new users.
2 (Fast)	Low to moderate epoch counts (meaning quick convergence). No high-cost hardware.
3 (Standard)	Typical training time for most deep learning models on one GPU. Common epoch count for the domain.
4 (Long)	Extended training time on one GPU (suggesting a larger dataset or deeper model). High epoch count.
5 (Very Long)	Extensive training time (hours/days) or complex iterative process on high GPUs. Very high epoch count.

5.1.1. Efficiency Insights: CNNs

The collected data (Table 10) reveal clear trade-offs between metrics across studies.

Acc. vs. Complex. and Comp. Cost: Compact architectures can maintain strong performance with a limited Comp. Cost. Ref. [12] achieved 91% Acc. using only 1.066 K parameters. In [94], the results showed suitability for low-power deployment, with 4.27 K parameters, 6.8 M MACs, and 77.35% Acc. Ref. [106] used 17.58 K parameters and 20.69 M MACs to achieve 83.73% Acc., although the authors noted that the increased MACs might limit use on lightweight devices for a modest Acc. gain.
Acc. vs. Oper. Cost: Reducing the number of channels lowers the setup complexity and hardware requirements without necessarily reducing Acc. With only three channels, Refs. [99,100] achieved 98.04% and 99.05% Acc., respectively. Ref. [100] also used 95 s epochs, balancing Acc. and recording efficiency. The epoch length is another important factor for real-time operation. Ref. [70] achieved only 76% Acc. with 62.5 ms segments, showing a speed–Acc. trade-off. In contrast, Refs. [79,80] reached around 99.9% Acc. using 1 s segments, trading speed for better performance. Low latency is critical for online BCI and authentication systems. Ref. [97] reported 1.79 ms latency with 96.15% Acc., suitable for real-time use. Ref. [94] showed 197 ms latency, representing a slower but computationally efficient design.
Acc. vs. Train. Cost: Training requirements affect deployment feasibility. Ref. [11] reported 24.77 min training time, compared to 33 s for the FBCSP baseline, reflecting the high Comp. Costs of deep models for a moderate Acc. gain (85.20%). Refs. [79,80] required only 2–4 fine-tuning epochs (<1 min) to adapt to new users, reducing the calibration time. Ref. [96] reported 8–18 min enrollment time per subject, showing the time cost of subject-specific model adaptation.

5.1.2. Efficiency Insights: Transformers

From Table 11, we can observe trade-offs between Acc., Comp. and Oper. costs.

Acc. vs. Complex. and Comp. Cost: Model size influences both Acc. and computational efficiency. The ViT and EEGPT models [34,46] lack inductive biases useful for small datasets, which resulted in lower Acc. compared to ResNet. Ref. [34] used large-scale pretraining with 14–300 M images to overcome this trade-off. In [46], performance was scaled with up to 101 M parameters, but this increased the model size and memory usage. In contrast, Ref. [39] achieved 84.26% Acc. with only 6.50–8.68 K, which means that compact models can maintain competitive performance. Architectural choices also affect computational efficiency. Sequential RNNs/LSTMs have long training times for long sequences. Transformer-based models [40,42] address this by using faster parallel attention mechanisms. Expanding the input signal window provides a richer context and better performance but increases the computational demands and memory usage. Ref. [44] mitigated this by projecting the input into a lower-dimensional latent space (100). Ref. [45] had to limit its codebook to fit GPU constraints during training.
Acc. vs. Oper. Cost: Transformer-based models suppress heavy FE, which reduces the operational complexity. Using raw signals [38] led to 99.4% Acc. with lower Oper. Costs compared to the CWT method (97% Acc.), meaning that reduced preprocessing can maximize performance. However, the raw data approach [41] yielded lower performance compared to traditional feature engineering. In [43], the simple temporal Transformer had a low Oper. Cost, while the spectrotemporal ensemble model had a high Oper. Cost but provided 96.1% Acc.

Table 10. CNN-based models’ metrics.

Ref.	Author/Year	Parameters (K/M)	MACs/FLOPs (M/G)	Latency (s)	Training Time Epochs (s/m/h)	Memory Footprint	Epoch Length (Segment) (ms/s)	GPU TPU Cloud	Acc. (%)	Sample Size	Channels	Complex.	Comp. Cost	Oper. Cost	Train. Cost
[70]	Ma et al. 2015	-	-	-	-	-	62.5 ms	-	88.00	10	64	3	3	2	3
[71]	Mao et al. 2017	-	-	-	0.3 h	-	-	-	97.00	100	64	3	3	3	2
[75]	Gonzalez et al. 2017	-	-	-	1 M iterations	-	1 s	-	94.01	23	16	2	3	3	4
[82]	Das et al. 2017	-	-	-	50 epochs	-	600 ms	-	98.80	40	17	2	2	3	2
[83]	Cecotti et al. 2017	-	-	-	-	-	800 ms	NVIDIA GTX 1080	90.50	16	64	2	3	3	5
[11]	Schirrmeister et al. 2017	-	-	-	24 m 46 s	-	4 s	NVIDIA GeForce GTX 980	85.20	9	3	3	4	3	4
[87]	Bai et al. 2018	70 K	-	-	Fast convergence	Low	1 s	-	99.00	-	-	3	2	3	2
[12]	Lawhern et al. 2018	1.066 K	-	-	500 epochs	-	4 s	NVIDIA Quadro M6000 GPU	91.00	9	22	1	2	3	3
[104]	Wu et al. 2018	-	-	7s	500 epochs	-	2 s	-	97.60	10	16	2	4	4	3
[72]	Schons et al. 2018	-	-	-	-	-	1 s	-	99.00	109	64	3	3	3	3
[73]	Di et al. 2018	-	-	-	-	-	1 s	GPU	99.90	33	64	4	3	3	3
[74]	Zhang et al. 2018	-	-	-	Long training	10x	1.25 s	-	89.00	15	64	2	3	3	4
[76]	Waytowish et al. 2018	-	-	-	-	-	1s	-	80.00	10	8	2	2	2	3
[78]	Lai et al. 2019	-	-	-	30 repetitions	-	-	-	83.21	10	64	2	2	3	2
[85]	Chen et al. 2019	-	-	-	0.5 h	-	-	NVIDIA GeForce GTX TITAN X	97.06	10	28	4	2	3	2
[79]	Wang et al. 2019	-	-	-	2–4 epochs (fine-tune)	-	1 s	-	99.73	10	8	2	2	2	1
[77]	Yu et al. 2019	-	-	-	50 iterations	-	-	-	96.78	8	9	2	2	3	2
[84]	Cecotti et al. 2019	-	-	-	-	-	800 ms	-	92.80	16	64	4	3	3	3
[80]	Wang et al. 2019	-	-	-	<1 min, 0 epochs	-	1s	-	99.98	59	46	2	2	3	1
[105]	Özdenizci et al. 2019	-	-	-	100 epochs	-	0.5 s	-	98.60	3	16	3	3	3	2
[93]	Salimi et al. 2020	-	-	-	100 epochs	-	1.1 s	NVIDIA Tesla K80	95.00	26	28	1	2	3	2
[94]	Ingolfsson et al. 2020	4.27 K	6.8 M	197 ms	750 epochs	396 kB	4 s	NVIDIA GTX 1080 Ti GPU	97.44	9	22	1	1	4	4
[88]	Riyad et al. 2020	-	-	-	180 epochs	-	4 s	NVIDIA P100 GPU	74.08	9	22	3	3	4	2
[89]	Liu et al. 2020	-	-	-	-	-	4 s	NVIDIA RTX 2080Ti GPU	97.68	9	22	4	4	4	3
[95]	Kasim et al. 2021	-	-	-	1200 epochs	-	3 s	-	97.17	16	16	3	3	3	5
[90]	Zhu et al. 2021	-	-	-	-	-	1 s	-	96.49	109	64	4	3	3	3
[106]	Musallam et al. 2021	17.58 K	20.69 M	-	1000 epochs	1188 kB	4.5 s	TensorFlow	94.41	9	22	3	4	4	4
[107]	Mane et al. 2021	-	-	-	600/1500 epochs	-	4 s	-	81.11	9	20	2	3	4	5
[86]	Salami et al. 2022	3 K	-	-	500 epochs	-	4 s	-	78.74	9	20	1	2	4	3
[81]	Zhang et al. 2022	-	-	-	Hour level	-	4 s	-	82.33	70	16	4	5	4	5
[99]	Bidgoly et al. 2022	-	-	-	30 epochs	-	1 s	-	98.04	109	3	2	2	1	2
[96]	Wu et al. 2022	450.626 K	-	4 s	8–10 m (enrollment)	-	4 s	-	99.48	9	10	4	4	2	2
[97]	Altuwaijri et al. 2022	10.17 K	-	1.79 ms	1000 epochs	-	4.5 s	Google Colab	96.15	9	22	2	1	4	4
[98]	Autthasan et al. 2022	55.232 K	-	0.1–0.3 s	0.47–1.36 s/epoch	-	2 s	NVIDIA Tesla V100 GPU	72.03	9	15	3	2	3	2
[92]	Ding et al. 2023	12.56 K	-	-	500 epochs	-	2–4 s	-	63.75	27	32	2	3	3	3
[100]	Alsumari et al. 2023	74.071 K	-	-	20 epochs	-	5 s	Google Colab	99.05	109	3	2	3	1	1
[101]	Yap et al. 2023	5–45 M	-	2 s	30 epochs	-	4.5 s	GTX 1080 Ti	80.00	30	14	2	1	2	4
[102]	Chen et al. 2024	-	-	-	-	-	2 s	-	93.81	1	56	4	3	3	3
[103]	Shakir et al. 2024	-	-	-	-	-	1 s	-	95.00	109	3	2	2	1	3
[91]	Lakhan et al. 2025	-	-	-	20 epochs	-	-	NVIDIA Tesla V100GPU	99.26	54	8	4	3	2	1

Table 11. Transformer-based models’ metrics.

Ref.	Author/Year	Parameters (K/M)	MACs/FLOPs (M/G)	Latency (s)	Training Time Epochs (s/m/h)	Memory Footprint	Epoch Length (Segment) (ms/s)	GPU TPU Cloud	Acc. (%)	Sample Size	Channels	Complex.	Comp. Cost	Oper. Cost	Train. Cost
[38]	Arjun et al. 2021	-	-	-	-	-	6 s	-	99.40	32	32	1	1	1	2
[34]	Dosovitskiy et al. 2021	-	-	-	-	-	14–300 M image patches	TPU v3 core days	94.55	-	-	3	2	3	4
[39]	Song et al. 2021	6.50–8.68 K	-	-	-	-	Small segments	-	84.26	9	3	1	1	1	1
[40]	Tao et al. 2021	-	-	-	-	-	20–460 ms	-	61.11	6	64	3	2	2	2
[41]	Siddhad et al. 2024	-	-	-	-	-	-	-	95.28	48	14	2	2	1	2
[42]	Du et al. 2022	-	-	-	-	-	1s	-	97.90	109	64	2	2	1	2
[43]	Zeynali et al. 2023	-	-	-	1000 epochs	-	-	-	96.10	8	64	4	3	3	5
[44]	Omair et al. 2024	-	-	-	-	Latent dim. 100	150 time stamps	-	85.00	9	3	3	3	2	2
[45]	Lim et al. 2025	(Embed. size 256)	-	-	<1 day	>24 GB	2 s	RTX 4090 GPU	90.84	1	3	5	5	4	4
[46]	Wang et al. 2024	10–101 M	-	-	200 epochs	-	4 s	8 NVIDIA 3090s GPUs	80.59	9	3	5	5	5	5
[108]	Hu et al. 2024	-	-	-	100 epochs	-	-	NVIDA TESLA T4 Tensor Core GPU	99.12	15	32	4	3	2	3
[109]	Muna et al. 2025	-	-	-	20 epochs	-	-	CUDA Cloud	76.83	9	22	3	3	3	1
[111]	Ghous et al. 2025	-	-	-	50 epochs	-	-	-	95.00	15	62	4	4	4	3

5.1.3. Efficiency Insights: CNN–Transformer Hybrids

The data (Table 12) highlight many trade-offs between the different costs.

Acc. vs. Complex. and Comp. Cost: Models with high parameter counts deliver higher accuracy at the expense of computational efficiency. For instance, architectures with up to 23.55 M [131] or even 369 M [133] parameters achieved strong performance but required more computation. In contrast, Ref. [117] achieved very high Acc. (97.81%) with a very low parameter count (24.9–25.7 K), indicating high architectural efficiency. Similarly, Ref. [128] is noted for its small memory footprint and low parameter count (115.2 K), making it suitable for resource-constrained applications or embedded BCI applications. The models in [120,126] show that longer EEG data segments or window lengths generally increase Acc. but also raise the computational complexity.
Accu. vs. Complex.: Pure Transformer models lack inductive bias, which necessitates a large amount of data to prevent overfitting. Hybrid models [114,117] reduce this data dependency and complexity by incorporating CNN FE modules for improved training efficiency and performance stability.
Oper. Cost vs. Train. Cost: Some models optimize real-time usability at the expense of training overhead. The model in [130] showed a low latency of 0.0043 s with high Acc. (99.67%), showing a strong design for real-time operation (low Oper. Cost). Others [129] had up to 5000 epochs, indicating a high Train. Cost.

5.1.4. Efficiency Insights: Recurrent Hybrids

Based on Table 13, the following are key efficiency trade-offs.

Acc. vs. Complex. and Comp. Cost: Multimodal EEG systems increase user complexity yet yield high security. Ref. [144] boosted the Acc. from 97.57% (unimodal) to 98.78% (EEG and signature). Similarly, Ref. [136] achieved 99.57% overall Acc. by combining EEG and gait. Advanced models can reduce the computational burden without compromising Acc. Ref. [143] used tensor-train decomposition for computational efficiency gains. It required only 1.6 K parameters for classification, compared to a traditional model at 1.28 M parameters. This led to a reduced memory footprint with 99.50% Acc.
Acc. vs. Oper. Costs: Channels’ dimensionality reduction increases efficiency and user practicality while maintaining high Acc. Refs. [139,140] demonstrated high Acc. (99.58% and 98.00%, respectively) using a minimal number of four channels. Ref. [138] achieved a 100% CRR with 32 channels and still maintained a 99.17% CRR when reduced to five, making the system practical and efficient. Ref. [137] found the optimal efficiency–Acc. balance at 48 channels out of 64. An operational burden during data collection is sometimes accepted for FE gains to improve both the signal-to-noise ratio and Acc. Ref. [142] accepted a very high Oper. Cost, requiring over 50,000 trials, for an enhanced feature space. This resulted in a 30.09% improvement in classification Acc. over comparable methods.
Acc. vs. Train. Cost: Many high-performing systems [135,137,139] exhibit long training times. For example, Ref. [139] achieved 99.58% Acc. by accepting a longer training time but balanced this with a fast-testing latency of 0.065 s.
Oper. Cost vs. Train. Cost: A few authors have accepted high Train. Costs in exchange for very low Oper. Costs. Ref. [135] had a long training time in exchange for less than 1 s latency for a better authentication decision and practical deployment. Ref. [139] achieved lower batch testing latency of 0.065 s. Ref. [145] used one-shot learning training on as few as six pairs. It resulted in a reduced initial Oper. Cost for user enrollment despite a long total training time (870 min).

Table 12. CNN–Transformer hybrids’ metrics.

Ref.	Author/Year	Parameters (K/M)	MACs/FLOPs (M/G)	Latency (s)	Training Time Epochs (s/m/h)	Memory Footprint	Epoch Length (Segment) (ms/s)	GPU TPU Cloud	Acc. (%)	Sample Size	Channels	Complex.	Comp. Cost	Oper. Cost	Train. Cost
[113]	Sun et al. 2021	-	-	-	-	-	-	-	87.80	109	64	3	3	2	3
[131]	Kostas et al. 2021	-	Quadratic	-	-	-	-	-	86.70	>10,000	20	5	5	4	3
[124]	Bagchi et al. 2022	4.56–23.55 M	Quadratic	-	35–80 epochs	-	-	-	89.64	10	128	5	5	4	3
[120]	Xie et al. 2022	-	-	-	-	-	-	-	83.31	109	64	3	3	3	3
[128]	Altaheri et al. 2023	115.2 K	-	-	-	Small	-	-	85.38	9	22	1	1	1	2
[132]	Yang et al. 2023	86 M	-	-	15 epochs	-	-	-	61.70	27	128	4	4	3	2
[119]	Li et al. 2023	-	-	-	-	-	-	-	96.65	9	3	4	3	3	3
[125]	Song et al. 2023	-	-	0.27	-	-	-	GPU	95.30	9	3	3	3	2	2
[115]	Wan et al. 2023	Avoids huge complexity	-	-	-	-	-	-	92.75	12	6	3	2	2	3
[121]	Si et al. 2023	Hybrid: slightly lower	-	-	-	-	14 s	-	42.50	6	30	4	3	3	3
[127]	Gong et al. 2023	-	-	-	-	-	-	-	95.30	15	62	3	3	3	3
[116]	Ma et al. 2023	-	-	-	200 epochs	-	-	-	83.91	9	22	3	3	3	4
[114]	Omair et al. 2024	-	-	-	100 epochs	-	-	-	86.98	9	3	3	3	3	3
[126]	Si et al. 2024	-	-	-	-	-	14 s Optimal	-	67.80	32	28	3	4	3	3
[117]	Zhao et al. 2024	24.9–25.7 K	-	-	-	-	-	RTX3090	97.81	9	3	2	2	2	3
[133]	Jiang et al. 2024	5.8–369 M	-	-	Fine tuning costly	Costly	1 s	-	82.58	>140	19	5	5	5	5
[118]	Liu et al. 2024	-	-	-	-	-	-	-	74.23	16	32	2	2	2	3
[122]	Yao et al. 2024	-	-	-	-	-	3 s	-	96.95	15	32	3	3	2	3
[134]	Li et al. 2024	-	-	-	15 epochs	-	-	RTX4090	-	356	128	3	2	2	2
[123]	Lu et al. 2024	-	-	-	-	-	-	-	98.57	15	62	4	3	3	3
[129]	Nguyen et al. 2024	-	-	-	Up to 5000 epochs	-	-	-	83.41	9	22	3	3	3	5
[130]	Cheng et al. 2024	Linear complexity	-	0.0043	200–300 epochs	-	2–17 s	2080Ti	99.67	15	32	2	1	1	4

Table 13. Recurrent-based models’ metrics.

Ref.	Author/Year	Parameters (K/M)	MACs/FLOPs (M/G)	Latency (s)	Training Time Epochs (s/m/h)	Memory Footprint	Epoch Length (Segment) (ms/s)	GPU/TPU/Cloud	Acc. (%)	Sample Size	Channels	Complex.	Comp. Cost	Oper. Cost	Train. Cost
[144]	Kumar et al. 2019	-	-	-	-	-	-	-	97.57	33	14	4	2	4	2
[141]	Puengdang et al. 2019	-	-	-	28.5 m 30–50 epochs	-	-	-	91.44	20	6	3	2	3	1
[142]	Zheng et al. 2020	-	-	-	-	-	-	-	66.81	10	128	3	4	5	3
[145]	Chakladar et al. 2021	-	-	-	870 m 150 epochs	-	-	-	98.57	70	14	4	3	3	5
[138]	Wilaiprasitporn et al. 2015	-	-	-	Fast	-	-	-	99.90	32	5	4	2	2	2
[135]	Zhang et al. 2018	-	-	<1	Increased	-	-	-	99.89	8	14	4	1	3	4
[139]	Sun et al. 2019	-	-	0.065	Long	-	-	GPU	99.58	109	4	4	1	1	4
[136]	Zhang et al. 2020	-	-	0.39	-	-	-	-	99.00	7	14	5	2	5	3
[143]	Jin et al. 2021	1.6 K (vs. 1.28 M)	-	-	-	Reduced	-	-	99.50	20	7	5	1	2	2
[140]	Chakravarthi et al. 2022	-	-	-	-	-	-	-	98.00	20	4	4	3	1	3
[137]	Balci et al. 2023	-	-	-	Long	-	-	-	99.97	109	48	4	3	4	4

5.2. Comprehensive Weighted Sum Model

We used the previously collected metrics and cost proxies to implement three weighted sensitivity analysis (WSA) scenarios (Table 14) and provide comparative rankings under specific operational priorities. We applied five criteria (Acc. vs. all four cost proxies) to ensure a full picture for each ranking. We focus on three scenarios to illustrate the overall classification efficiency.

Table 14. WSA Scenarios.

We used these five criteria weights (sum to 100) for all DL categories (Table 15).

Table 15. WSA Criteria Weights.

We used the normalized five criteria to calculate the WSA utility score using the following formula:

S_{i} = \sum_{j = 1}^{5} w_{j} \cdot n_{i j}

where

S_{i}

is the utility score for model i,

w_{j}

is the weight of criterion j, and

n_{i j}

is the normalized value of model i on criterion j.

We then plotted the WSA utility scores using heatmaps to show which models are the most robust to changing weights. The heatmaps illustrate the strengths of the networks across the three scenarios. Lighter green colors indicate higher efficiency, and darker blue colors indicate lower efficiency.

5.2.1. CNN Heatmap Analysis

Across 40 studies (Figure 5), a few studies demonstrate consistently high efficiency by scoring well in all three scenarios. These models [79,80] show the highest scores and are the highest ranked for the S1 and S3 scenarios. Their versatile performance is driven by very high accuracy and a near-perfect score for the Train. Cost (very fast fine tuning/convergence). The studies in [99,100] also rank highly in all scenarios. They excel because of their Oper. Costs (three channels) and high Acc. Ref. [100] stands out with the highest score for S2. The Scenario column highlights models prioritized for hardware constraints. These models [12,93] jump in rank compared to the others due to their low Complex. (tiny size). Ref. [94] is an example of structural efficiency specialization. Despite the overall low scores, it has a distinctively higher score in S2 compared to S1 and S3. This shows a clear trade-off, sacrificing speed and Acc. for model compactness. Models clustered at the bottom [81,92,106] show low scores across all columns. This indicates a poor balance among the five criteria, often due to low Acc. ([92] at only 63.75%) or very high resource requirements, such as long Train. Costs, long epoch lengths, or high computational demands.

Figure 5. Heatmap depicting CNNs’ WSA scores [11,12,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107].

The WSA analysis quantifies the trade-offs of DL model design for EEG tasks.

Acc. vs. Oper. Cost: In S1, a high weight on Oper. Cost means accepting constraints like low channel counts or very short epoch lengths. The most balanced models [79,80] achieved very high Acc. with very short epoch lengths (1 s) and efficient fine tuning. Ref. [104] scores highly in S1, but its 7 s latency shows a poor speed/operational trade-off, reducing its score despite its good Acc.
Acc. vs. Complex.: In S2, a high weight for Complex. means sacrificing peak Acc. or longer training. The model in Ref. [12] is designed to be ultra-compact (1.066 k parameters), resulting in a high score for S2 but with modest Acc. of 91.00% and a high number of training epochs (500).
Acc. vs. Training Cost: A heavy Train. Cost is necessary for a well-performing model. Studies like [81,83] have the lowest scores due to long training, which means that a high Train. Cost does not guarantee high operational efficiency. However, models like that in [97], with a high number of training epochs (1000), achieved excellent latency, showing a trade-off between training effort and post-deployment speed.

5.2.2. Transformer Heatmap Analysis

The heatmap (Figure 6) represents the overall efficiency of the 13 Transformer-based models across the three scenarios. The consistent leader [38] remains the top-ranked model across all scenarios, confirming its robust efficiency due to high Acc. and minimal resource requirements. The models in [39,41,42], ranked second, third, and fourth, are consistently the most efficient. For Refs. [41,42], the high scores arise from using raw EEG and tiny architectures, resulting in high Acc., low Complex., and low Oper. Costs. Ref. [42] scores highest in the S1 scenario, highlighting its speed. Ref. [39] scores highest in the S2 scenario, pointing out its small parameter count. The lowest-scoring models [45,46] are significantly less efficient. This is due to large parameter counts (101 M for [46]) or high memory requirements (>24 GB for [45]). Overall, the relative rankings are highly consistent across all three scenarios, which means that the underlying efficiency trade-offs of the models are fundamental to their designs and less sensitive to the specific weight distribution.

Figure 6. Heatmap depicting Transformers’ WSA scores [34,38,39,40,41,42,43,44,45,46,108,109,111].

The WSA rankings reveal critical trade-offs between model Acc. and resource costs.

Acc. vs. Oper. Cost: This trade-off is critical in S1, as it determines whether an Acc. gain justifies added preprocessing Complex. and slowness. Some studies [38,41,42] use raw EEG signals and achieve the best overall performance with a minimal Oper. Cost. Ref. [38] best embodies an efficient lightweight solution that balances Complex., Oper., and Comp. Costs. In contrast, Ref. [43] combined raw temporal and spectral (PSD) features, which resulted in a slight increase in Acc. but incurred a higher Oper. Cost due to the added computation.
Acc. vs. Complex.: This trade-off is central for S2, where the model size and computational load are minimized. The tiny model in [39] had few parameters and obtained the best Complex. This shows that attention mechanisms can be effective at low parameter counts. This model trades Acc. (84.26%) for very high efficiency. On the other hand, Ref. [46] demonstrates a large-model penalty, with millions of parameters and high Complex. While its size boosts its performance via scalability, the resulting high Comp. Cost cancels out any Acc. benefits.
Acc. vs. Training Cost: This trade-off reflects the computational and time resources required to train a model, heavily influencing the all-rounder efficiency. The studies in [43,45] represent the resource-heavy end of the spectrum, both requiring 1000 training epochs. This high demand for training time indicates that these models need a high development cost to reach high Acc. In contrast, Ref. [109] achieved fast convergence, completing training in only 20 epochs. This reflects a highly efficient training process, balancing a low development cost with moderate Acc. (76.83%).

5.2.3. CNN–Transformer Heatmap Analysis

The heatmap (Figure 7) demonstrates that a model’s efficiency determines its overall rank, rather than slight variations in the criteria weights. The models in [118,130] are consistently the most efficient, with the highest scores (ranking first and second, respectively). They offer the best overall efficiency trade-offs, as their low Complex., Comp. Cost, and Oper. Cost offset any reduction in performance. Thus, they are the most practical choices for resource-limited applications. Notably, Ref. [118] ranks first across all scenarios, demonstrating its superior and robust efficiency trade-off. Ref. [130] is consistently the second-best performer across all scenarios. Conversely, the least efficient models, including [125,133] and the two large-scale models [115,131] with the lowest scores, rank 17th–21st. Irrespective of their potential for high Acc., these models are resource-intensive and are poorly suited for edge environments or real-time systems. The stability in the relative rankings of the models across all three weighting scenarios means that small and fast models are consistently preferred in this efficiency-focused analysis. Finally, models such as those in [114,126] occupy the mid-range, offering a moderate trade-off. This makes them good general-purpose options when computational constraints are not a defining factor.

Figure 7. Heatmap depicting CNN–Transformer hybrids’ WSA scores [113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133].

The WSA analysis points out the following trade-offs.

Acc. vs. Total Cost Efficiency: The model in [118] demonstrates the maximum trade-off in favor of efficiency. It achieves good but not the best raw Acc. of 85.38%, which is compensated for by a zero-cost profile across Complex., Comp. Cost, and Oper. Cost. The model’s minimal resource footprint makes it the ideal edge solution.
Acc. vs. Oper. Cost: The model in [130] represents the most desirable trade-off. It provides peak performance across the dataset with the highest raw Acc. of 99.67%, while simultaneously demonstrating minimal latency due to its zero-cost Oper. Cost profile. This secures its status as the optimal real-time solution.
Acc. vs. Complex. and Oper. Cost: The model in [129] clearly prioritizes performance, as shown by its near-best raw Acc. of 98.08%. However, this high performance comes at the cost of its resource profile, with only average performance in Complex. and Oper. Cost. It can be chosen only when high Acc. is mandatory and the system can tolerate its high resource utilization. Overall, it is less efficient than the two top models [118,130].
Acc. vs. Complex. and Comp. Cost: The model in [133] demonstrates a poor trade-off. It yields modest raw Acc. of 70.11% with the maximum resources across the Complex. and Comp. Costs. It proves the diminishing returns of architectural scaling that do not translate into superior performance. This is the least practical choice in constrained environments.

5.2.4. Recurrent Hybrid Heatmap Analysis

The heatmap (Figure 8) shows the following insights for 11 recurrent hybrid models. The system in [139] is the top performer for both the S1 and S3 scenarios due to its high Acc., minimal channel use, and fast latency. Ref. [141] is the best for S2, driven by its excellent Complex. and Train. Cost scores. Systems like that in [143] score highly in S1 but low in S2 due to high Complex. scores. Ref. [142] consistently shows the lowest scores due to its very low Acc. and high Oper. Cost (128 channels). The models in [136,137] rank poorly (10 and 8, respectively) despite their high Acc. This is due to their heavily penalized Complex. and Oper. Costs in the WSA.

Figure 8. Heatmap depicting recurrent hybrids’ WSA scores [135,136,137,138,139,140,141,142,143,144,145].

The WSA reveals specific efficiency trade-offs among the top-ranked systems.

Acc. vs. Oper./Comp. Cost: In S1, the model in [139] ranks first and achieves 99.58% Acc. by minimizing resource use during operation. It secures efficiency by using only four channels and has the fastest testing latency at 1 s. This performance comes at the cost of a long training time due to its complex recurrent architecture. Ref. [140] ranks second and achieves 98.00% Acc. and uses four channels, but its Comp. Cost score is lower because of its latency of 1.7 s.
Acc. vs. Complex./Train. Cost: In S2, the model in [141] ranks first due to having the lowest Acc. among the top tier (91.44%) in exchange for being the most resource-efficient system. It requires only minutes of training time. In contrast, Ref. [139] ranks second and maintains 99.58% Acc. but its recurrent architecture results in high Complex., increasing its memory footprint.
Acc. vs. Total System Cost: In S3, Ref. [139] ranks first, achieving 99.58% Acc. and combining this with superior operational efficiency (four channels and 1 s latency). Its score confirms that its fast low-channel operation outweighs the penalty of its long initial training time. Ref. [138] ranks second and achieves high Acc. (99.17%) and reliable performance across all cost metrics, with six channels and 1.25 s latency. This allows it to avoid the extreme cost trade-offs of the rank-1 models.

5.3. The Efficiency Frontier

This section provides visual insight into the performance vs. efficiency trade-off and identifies architectural designs that we should consider as benchmarks for efficiency. These non-dominated models (Pareto principle) achieve the best performance-to-cost ratios. We used the models’ Acc. vs. WSA S3 scores. For selection, we selected the five top-performing models in the all-rounder scenario from each DL architectural category, while highlighting whether any model is the highest scorer in any scenario (S1, S2, S3).

5.3.1. Trade-off Analysis

The scatterplot (Figure 9) offers key insights into the trade-offs between model Acc. and the all-rounder WSA scores. The analysis focuses on the concept of Pareto optimality, which is represented by the solid black line connecting the non-dominated models. These models achieve the best possible combination of high Acc. and high S3 utility compared to all other models. Their WSA scores cannot be improved without decreasing their Acc. and vice versa. The most efficient models are those lying directly on or very close to the black line. The model in [38], with the highest WSA score and Acc., and the CNN model in [79], with a slightly lower WSA score, define the extremes of the most efficient set. The hybrid CNN–Transformer models [118,130] and the recurrent hybrid [139] also lie at this frontier. The curve generally slopes upward and to the right, illustrating the positive correlation between the two metrics: higher Acc. generally comes with a higher S3 score. However, moving along the frontier reveals the increasing cost of improvement. The frontier flattens out as we move toward 100% Acc. (approached by [80] at 99.98%), indicating that achieving the final few percentage points of Acc. requires significant sacrifices or yields diminishing returns in the S3 score. All points that fall below the Pareto frontier are dominated. This means that there is at least one model on the line that is superior in both Acc. and the WSA score or equal in one and superior in the other. For instance, the CNN model in [99] is dominated by the model in [79]. These dominated models are inherently suboptimal choices for an all-rounder system.

Figure 9. Acc. vs. S3 WSA scores.

5.3.2. Architectural Insights

The plot reveals performance clusters among the different architectural groups. The Transformer architecture shows the strongest potential for S3 utility, with the absolute highest WSA score [38]. This shows that, while maintaining high Acc., these models are particularly efficient across the WSA criteria (latency, energy efficiency). The CNN group dominates the high-Acc., mid-range utility segment. The highest Acc. (99.98%) belongs to the CNN model in [80]. This architecture is a solid choice when pure Acc. is the priority, even if the S3 score for [80] is slightly lower than that of the top Transformer [38]. The CNN–Transformer and recurrent hybrid groups show models scattered across the mid-to-high Acc. range. They both offer competitive, non-dominated representations on the Pareto frontier ([118,130] for CNN–Transformers and [139] for recurrent). The CNN–Transformer models have a balance among their individual constituent models’ strengths.

5.3.3. Scenario Performance Insights

The markers indicate the models’ statuses across the three scenarios (S1, S2, S3). The top-ranked performer in all three scenarios (S1, S2, and S3) is the Transformer [38]. This is the choice for systems requiring multi-criteria excellence. Another multi-champion model is that in [118]. The highest Acc. is seen for the CNN in [79]; it performs well in S1 and S3, but not S2. The recurrent model in [139] also shares this champion status.

5.4. The Pareto Frontier

In this step, we selected the three true Pareto champions from the previous step. Since the three models were from only two architectural categories, we added the best high-performing representatives for the two remaining architectural groups to give better comparative insights. The analysis is based on a scaled score, where 4 is the best outcome (highest Acc./lowest cost) and 1 is the worst outcome (lowest Acc./highest cost). The models are sorted by the calculated polygon area (as an overall utility score, Figure 10).

The model in [38] achieves the highest overall utility (largest area) by dominating the cost metrics. It scores the maximum in three out of four cost metrics (Complex., Comp. Cost, and Oper. Cost). This means that it is the least costly model in these areas. Its high score in cost-effectiveness is offset by only a moderate score (2) in the Train. Cost. Its Acc. score is the lowest of the top three models (slightly above 3), but this dip in performance is outweighed by its operational efficiency. In applications where the runtime cost (Complex., computation, and operation) is the primary constraint, the model in Ref. [38] is the optimal choice as it delivers high Acc. with minimal operational expenses. The models in [79,80] trade higher Oper. Costs for marginal gains in Acc. The model in [80] has the highest Acc. (4, representing 99.98%), but scores 1 for its Oper. Cost, making it the most expensive to run among this set. Its costs are unbalanced, with a low Train. Cost but high Complex. and Oper. Costs. The model in [79] is a compromise between those in [38,80]. It maintains very high Acc. and a low Train. Cost like [80], but it slightly improves in its Complex., Comp. Cost, and Oper. Cost, scoring 3 in each. The two models in [130,139] are dominated by other models in key axes, resulting in lower overall utility. Ref. [130] achieves high scores in three metrics (Acc., Comp. Cost, Oper. Cost) but is penalized by its score of 1 for the Train. Cost. This model is best only if the Train. Cost can be completely ignored, which is rarely the case. The model in [139] has the worst Complex. score, while it maintains high Comp. Costs and Oper. Costs (4). Its low Complex. and Train. Cost decrease its overall area, positioning it as the option with the lowest overall utility. The high-cost performance of the Transformer architecture [38] contrasts with the high Acc. of the CNN architectures [79,80], pointing out the architectural trade-off that exists at the top of the performance frontier.

Figure 10. Comparison of Pareto frontier optimum models.

5.5. Performance vs. Total Real Cost Analysis

In this step, we performed an Acc. vs. total real cost analysis. The total real cost is defined as follows:

T o t a l R e a l C o s t = \frac{(C o m p l e x . + C o m p . C o s t + O p e r . C o s t + T r a i n . C o s t)}{4}

where Complex., Comp. Cost, Oper. Cost, and Train. Cost are defined in Section 5.1.

The plot below (Figure 11) maps the performance against the resource consumption for all 85 models. The objective is to maximize Acc. while minimizing the cost value. The ideal region is the top-left corner (high Acc., low cost). The data points are skewed towards the high Acc. range (90–100%) across costs between 1.5 and 4.0. This implies that achieving high Acc. is generally feasible, but doing so with the minimum cost is difficult. CNNs form the largest group, with a wide distribution, achieving some of the lowest costs and highest Acc. Recurrent hybrids and Transformers cluster within the high-Acc. band near or within the Pareto frontier region, indicating strong performance-to-cost ratios. CNN–Transformers have a higher average cost, with the densest cluster around cost = 3.25.

Figure 11. Acc. vs. total real cost and Pareto frontier.

The Pareto frontier optimum (black dashed line) represents the set of optimal, non-dominated solutions. Any model below this line is inferior to at least one point on the frontier in terms of both cost and Acc. The first steep segment of the frontier (cost of 1.0–2.0) defines the most critical trade-off. The Transformer in [38], with a 1.25 cost and nearly perfect Acc. (99.40%), presents an excellent balance. The CNN in [80] at cost = 2.00 sets the performance ceiling at 99.98%. The CNN–Transformer in [130], with a 2.0 cost and 99.67% Acc., is close in Acc. with a similar cost. The analysis shows that a greater cost than 2.00 does not yield higher Acc. Beyond a cost = 2.00, the Pareto frontier flattens out almost completely, forming a dense optimal plateau in the 99.0–99.9% range. Models like the CNN in [73] at cost = 3.25 (99.90%) and the recurrent hybrid in [137] at cost = 3.75 (99.97%) are technically on the frontier (as they are the best options at their specific cost points), but they demonstrate a clear case of diminishing and negative returns. These models offer no significant Acc. advantage over the CNN at cost = 2.00 but require up to twice the resource investment. In summary, the most effective strategy is to target models in the cost range of 1.25 to 2.00, as this provides the highest performance return per unit of cost.

6. Analysis of Design Trends

The reviewed EEG classification models [11,12,34,38,39,40,41,42,43,44,45,46,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145] vary in Complex., input representation, and learning strategies but share several design patterns within each architectural category. This section analyzes these design trends and methodological practices, highlighting the techniques that are proven effective across tasks.

6.1. CNN Design Patterns

Researchers have explored specialized CNN architectures and FE methods designed for the spatial and temporal structures of brain signals. This has resulted in compact networks with improved performance, efficiency, and generalization. Below is a summary of these trends.

6.1.1. Depthwise Separable Convolutions and Compact CNNs

A dominant trend initiated by EEGNet [12] involves reduced parameters and temporal–spatial decomposition (Figure 12). This design has achieved strong performance across tasks with a small memory footprint [12]. Subsequent lightweight CNN models [76,88,93,94,97,102,106] surpassed heavy models. These compact architectures are easy to train on limited data for real-time BCIs on low-power devices [94,97].

Figure 12. Convolutions; (a) Standard, (b) depthwise, and (c) pointwise [12,20].

6.1.2. Outperforming Traditional Methods

A consistent theme is that CNN-based models could surpass traditional FE methods and replace complex feature engineering. ConvNets [11] matched filter-bank CSP’s Acc. with automatic learning, while Compact-EEGNet [76] outperformed CCA for SSVEP. This reduces the pipeline overhead and supports fast one-pass inference, crucial for real-time BCIs.

6.1.3. Multiscale and Multibranch Architectures

Models like those in [86,90] use inception-style modules, while those in [91,97] employ multibranch networks, processing different frequency bands or timescales. These models capture rich EEG dynamics to enhance performance without increasing the model size.

6.1.4. Residual Connections and Feature Fusion

RAMST-CNN [90] uses residual links for deeper training without vanishing gradients. TCNet-Fusion [106] applies the layered concatenation of shallow and deep features, combining their strengths. These strategies show that combining features from different layers improves generalization and Acc. with minimal computational overhead.

6.1.5. Attention Mechanisms

A trend is to use attention blocks to focus on the most relevant channels, frequencies, or time periods, performing internal feature selection without stacking extra layers. MBEEG-SE [97] incorporates SE blocks to adjust the feature maps and improve their quality without adding significant Complex. By dynamically weighting the feature maps, noisy features are filtered, improving the performance with minimal overhead.

6.1.6. Adversarial and Transfer Learning

Özdenizci et al. [105] used adversarial CNNs to learn subject-invariant features to allow single models to generalize across users. Similarly, Zhang [74] applied transfer learning to boost cross-task generalization. He showed that models retained their Acc. when retrained or fine-tuned on new tasks with minimal data. These strategies do not require large models or additional data and address EEG scarcity by reusing learned representations.

6.1.7. Real-Time, Embedded Focus and Cloud/Edge Deployment

Systems like those in [76,94] focus on real-time, low-power BCI applications for wearable deployment. Techniques like model compression, parameter reduction, and low-bit hardware implementations make these architectures efficient for edge devices. EEG-TCNet [94] achieved over 70% Acc. on multiclass MI with fewer than 5 K parameters. All authors highlight that efficiency is not only theoretical but essential for practical use.

6.1.8. Lightweight 1D/2D/3D CNNs

Refs. [81,94] exploited depth or spatial structures. These models achieved strong performance with fewer layers and parameters.

6.2. Transformer Design Patterns

Across the Transformer-based EEG models, researchers have also implemented a range of architectural and data FE techniques to improve the classification Acc. and computational efficiency. These can be grouped into six strategies.

6.2.1. Raw Signal Use with Minimal Preprocessing

Models such as those in [38,42] process raw or minimally preprocessed EEG using positional encodings. This excludes manual feature engineering and reduces the computational overhead. SA captures temporal dependencies and cross-channel interactions and achieves better performance.

6.2.2. Time–Frequency Transform Integration

ViT-CWT [38] and Ensemble-Transformer [43] models apply transformations like CWT or PSD to generate time–frequency representations before attention-based modeling. These compact inputs reduce the load on deeper layers and help the model to capture short, frequency-specific neural patterns crucial to cognitive and emotional recognition.

6.2.3. Modular Attention Architectures (Spatiotemporal Separation)

Architecture like those in [39,42,45] adopt modular designs that separate spatial and temporal attention into dedicated Transformer blocks. This structure reduces the model Complex. and focuses on targeted learning. By preserving electrode relationships and temporal dynamics, these efficient models generalize better across subjects and tasks.

6.2.4. Attention-Enhanced Gating and Graph Mechanisms

Others [40,44] have integrated gating units and graph-based spatial modeling to filter noise and regulate information flows. These mechanisms stabilize attention and improve long-range dependency modeling. As a result, the models are both computationally efficient and more robust to noisy or variable EEG inputs.

6.2.5. Self-Supervised and Generative Pretraining

Designs like those in [44,45,46] use self-supervised and generative pretraining to learn from unlabeled EEG. Techniques like masked prediction or sequence generation enable generalizable feature extraction. This reduces the Train. Cost and improves performance when datasets are limited.

6.2.6. Specialized Tokenization and Representation Learning

Models such as those in [38,45] apply tokenization methods to compress EEG into low-dimensional representations. These compact inputs support efficient attention processing and preserve essential temporal and spectral features, enhancing both speed and Acc.

6.3. CNN–Transformer Architecture Design Patterns

Modern hybrids use rich architectures like CNN–Transformer designs, multibranch networks for parallel FE, temporal convolution modules, and self-supervised pretraining for richer representations. Many models use spatiotemporal attention mechanisms and hierarchical encoding to boost efficiency. Below, we discuss a few models.

6.3.1. Dual-Branch Fusion Strategies

Models like those in [119,121] introduce dual-branch structures where time, frequency, or spatial information is processed separately. The outputs are fused to enrich the features.

6.3.2. CNN Precompression for Efficient Attention

Other studies, like [113,114,125], have embedded CNN filters early to precompress data before passing them to the Transformer. This makes attention computation faster and more effective.

6.3.3. Attention Optimization

Models such as those in [115,120,128] apply channel-wise or time-aware attention optimization. They focus computation on the most informative parts of EEG signals.

6.3.4. Smart Feature Enhancement

Some architectures [132,133] leverage pretraining and self-supervision. This makes FE richer with fewer labeled data.

6.3.5. Interactive Learning Modules

A few designs [123,134] have adopted interactive modules or multitask learning. These approaches encourage feature generalization by reconstructing inputs or exchanging information across CNN and Transformer layers dynamically.

6.4. Recurrent Hybrid Design Patterns

Other papers have explored recurrent hybrids combined with CNNs or Transformers. The best designs efficiently integrate temporal context that CNNs might miss, without excessive Complex. Often, one network serves as a feature extractor for the other.

6.4.1. Frequency-Focused Fusion

Several models [135,137] decompose EEG into sub-bands such as delta and apply attention-based RNN or LSTM layers. This targets discriminative frequency components and improves the robustness with limited training data.

6.4.2. Multimodal Fusion with Recurrent Backbones

Systems like those in [136,144,145] fuse EEG with gait or signatures, using LSTM or BLSTM encoders to model sequential patterns. Fusion occurs at either the representation or decision level and supports multi-biometric authentication.

6.4.3. Two-Stage Pipelines

Models such as those in [138,139,140] combine CNNs for spatial or spectral feature extraction with LSTM for temporal modeling. This pipeline balances Acc. with efficiency and works well for real-time EEG classification.

6.4.4. Lightweight and Portable Personalization

The authors in [141,142] designed personalized LSTM models that maintain Acc. with few electrodes and small datasets. These methods reduce the setup cost and support portable EEG systems.

6.4.5. Efficiency-Driven Architectures

Jin et al. [143] used tensor-train layers to compress the parameters by up to 800 times with similar Acc. Other works [135,137] have emphasized fast inference and modular designs. These approaches show how recurrent models can be adapted for real-time deployment.

6.4.6. Multitask and Feature Generalization

In [140,145], the authors integrated auxiliary features such as entropy, MFCCs, and angular features or applied multitask setups. These strategies support generalization and allow LSTM hybrids to adapt across contexts and tasks.

A taxonomy of these designs trends in presented in Figure 13. The success of Recurrent hybrids and emerging CNN–Transformer models means that future EEG feature extractors may remain CNN-based at their core. They might include smart add-ons like SA to enhance complex tasks’ performance while keeping the model size manageable.

Figure 13. A taxonomy of emerging design trends in DL-EEG classification models.

7. Findings and Discussion

To give a pictorial summary of the research insights from the reviewed papers, we plot an alluvial diagram, as shown in Figure 14. The vertical volume represents the weight of each category in the reviewed studies. Sections include years of publication, architectural categories (CNN, Transformer, CNN–Transformer, recurrent hybrids), domains of application (Bio, MI, EP, ER, GF), performance (very low–very high), and efficiency (very low–very high). The diagram gives a picture of the research trends in terms of each section’s variables. Insights and gaps are visually clear. In this section, we discuss these insights and future directions.

Figure 14. A recapitulative alluvial diagram.

7.1. Domains of Application

The table below (Table 16) summarizes the domains and tasks that the reviewed papers focus on.

Table 16. Domains of application.

7.1.1. Trends over Time—Architectures

CNNs were the most dominant architectures in comparison to recurrent hybrids in the 2015–2018 and 2019–2020 periods. In 2021–2022, there was a shift toward more diverse and complex architectures (Transformers, CNN–Transformer hybrids). These complex architectures have become a major focus in 2023–2025. The inclusion of recurrent-based models has become less frequent, largely being replaced by newer architectures. Despite the rise of Transformers, CNNs remain relevant and a frequently used architecture across all time periods. This reconfirms their foundational utility.

7.1.2. Architecture–Domain Mapping

CNNs are versatile and show the broadest application across all domains, including Bio, EP, MI, and GF models. The Bio domain demonstrates the highest number of studies across all architectures and time periods, followed closely by EP and MI. Both ER and GF emerge only in the 2021–2022 and 2023–2025 periods and are the major foci of Transformer and CNN–Transformer models.

7.1.3. Performance–Efficiency Trade-Offs

Most of the research across all years, architectures, and domains focuses on achieving very high or high performance. This indicates that maximizing Acc. is the primary goal. A large portion of very high-performing studies are linked to low or very low efficiency (mainly 2019–2020). This highlights the common trade-off whereby complex and computationally expensive models are needed for top performance. The newer Transformer architectures (particularly in the 2021–2022 period) frequently link very high performance with very low or low efficiency. This implies that early attempts to use them for high performance came at a high Comp. Cost.

7.1.4. Notable Trends in the Latest Period

The 2023–2025 period shows a promising trend of achieving high or very high performance and efficiency with either Transformer or CNN–Transformer models [45,46,119,123]. This reveals that these architectures have matured to be both powerful and efficient. In the same period, CNNs have frequently been associated with very low efficiency despite their very high performance [93,100,103]. This indicates that even established architectures are being pushed to their limits in terms of performance, sometimes sacrificing efficiency. Moreover, in the 2023–2025 period, Transformer and CNN–Transformer models [111,119,123] have successfully balanced very high performance with high efficiency within the ER domain.

7.2. Future Directions

Based on the alluvial diagram, below are our recommendations for future research and underexplored development directions in the EEG field.

7.2.1. Bridging the Performance–Efficiency Gap

The data consistently show a trade-off where the highest-performing models often have low efficiency. Future work should focus on the following:

Developing novel Transformer and CNN–Transformer architectures that maintain very high performance and improve efficiency beyond media. This could be achieved by (1) exploring knowledge distillation from complex and less efficient models to smaller and faster ones; (2) implementing sparsity techniques in Transformer layers; and (3) researching hardware-aware network designs specific to BCI/EEG applications.
Given the stability of foundational CNNs, revising and optimizing lightweight and high-performing CNN variants that are deployable on low-power devices.

7.2.2. Expanding Domain Specialization and Generalization

A strong focus remains on Bio, EP, and MI. Future directions are as follows:

Increase research attention to the ER domain, which has emerged in the most recent period. The latest models show promising high performance/high efficiency in this area, revealing that it is an impactful research domain.
Invest in the GF domain, using Transformer and CNN–Transformer architectures to reduce the need for domain-specific models. The goal should be to build models that can achieve high performance across multiple domains (Bio, MI, EP) without extensive retraining.
Explore novel and niche BCI domains outside personalized medicine with newer architectures to see if the performance gains translate to these areas.

7.2.3. Deeper Analysis of Architecture Components

The emergence of hybrid models signifies that the combination of components leads to efficient and well-performing models. Future directions include the following:

Perform ablation studies on hybrid CNN–Transformers by isolating the contributions of the CNN part vs. the Transformer part across different domains. This will determine the optimal split to maximize performance and efficiency.
Standardizing performance and efficiency metrics since the use of low to high scales is relative. Future research should adopt quantitative standardized metrics for reporting to allow for the rigorous and fair comparison of architectures.

7.2.4. Longitudinal Studies and Reproducibility

It is necessary to conduct studies to track architectural lifecycles (how long a design remains relevant)—for instance, investigate whether the efficiency gains seen in early CNNs can be replicated with modern training techniques on new architectures.
Since the initial focus has been on peak performance, future work should prioritize measuring robustness and generalization. Very high performance is less valuable if it is not reproducible by other researchers.
It is important to develop foundational models that are pretrained on low-density EEG to support robust edge deployment.

7.2.5. Generative Models as the Next Frontier

The new generative models are primarily used for data augmentation in EEG [46]. However, their encoder components may yield compact latent-space embeddings for more efficient FE and classification [45]. These models (Section 4.2.2 and Section 4.3.5) address the data scarcity and subject-specific calibration challenges for better efficiency. Pretraining models, as in [131,133], for long periods using multi-dataset EEG will create transferable representations. This shifts the computational challenges from the end-user to the large-scale model developer. This practice is still not well established.

7.3. Limitations

Our review of 114 DL EEG classification studies shows consistent methodological gaps in reporting (see the blue-shaded areas in Table 6, Table 7, Table 8 and Table 9). Most papers focus on Acc. and neglect key details like computational complexity, memory use, and latency. It becomes challenging to compare the performance and efficiency of different models and check their practicality on wearables or embedded systems without this information. Only a few studies, such as [46,94,106,132], have reported these metrics. To ensure a unified comparison despite these reporting gaps, we implemented specific dataset extraction decisions, such as choosing the highest-reported accuracy and lowest channel counts. Our proxy metrics (Complex. Comp. Cost, Oper. Cost, and Train. Cost) represent normalized approximations based on a mixed evaluation of quantitative data and qualitative author claims regarding model efficiency and architectural design. These scores should be viewed as expert-estimated cost dimensions rather than absolute hardware benchmarks. This reflects the constraints of non-standardized reporting. To address these gaps and support meaningful reproducibility and deployable EEG systems, we recommend that future research report the following parameters:

Performance metrics—Acc., EER, AUC, CRR, etc.;
Model size—parameters in K or M;
Computational complexity—MACs or FLOPs per inference window;
Memory footprint—memory usage at inference, including weights and activations;
Inference latency—measured on embedded, mobile, desktop, or server systems;
Training details—training time, number of epochs, batch size, and hardware used;
Validation protocol—within/cross-subject or cross-session and number of subjects;
Operational setup—number of EEG channels, epoch duration, and calibration/enrollment time per subject/session.

8. Conclusions

Recent research shows that careful architectural design may improve FE, advancing EEG-based classification. Across CNNs, RNNs, Transformers, and hybrid designs, the surveyed studies reveal progress toward models that are both accurate and computationally practical. CNNs are strong at extracting spatiotemporal features, but they struggle with long-term temporal dependencies. Transformers are better at capturing these dependencies but require large datasets at a higher Comp. Cost. Hybrid networks are more promising since they attempt to capture the best of both, using CNNs for efficient local FE and Transformers for powerful global context modeling. A recurring theme is that design choices—namely temporal–spatial fusion, attention mechanisms, and lightweight architectures—can influence both efficiency and performance. Despite these gains, challenges persist in reproducibility, cross-subject generalization, and adapting models to noisy or resource-limited environments. Addressing these issues will require larger and diverse datasets, standardized evaluation protocols, and alignment between model design and practical deployment. Integrating approaches such as self-supervised learning, multimodal fusion, adaptive architectures, or generative models may help in creating EEG systems that are accurate, scalable, interpretable, and practical for real-world applications.

The objective of this paper was to contribute to research on EEG-based classification by comparing DL models and their FE strategies. We reviewed 114 papers overall and synthesized 88 studies from the past decade, focusing on trade-offs between Acc. and computational efficiency. The distinctive contribution of this review is its efficiency-oriented, scenario-based comparative perspective. Unlike prior surveys focusing mainly on classification accuracy for system evaluation, we applied WSA to assess models under different computational and deployment constraints. This approach identified the optimal architectures for real-time interaction vs. edge deployment environments. In doing so, we demonstrate how design choices such as input representation, model architectures, fusion techniques, and training strategies influence performance. Our aim is to provide researchers with insights to develop lightweight but reliable EEG pipelines. We also hope that this work encourages future efforts toward creating standardized, explainable, and robust approaches that can help to move EEG-based systems from the lab to practical BCIs and neurotechnology.

Author Contributions

Conceptualization, L.H.; methodology, L.H.; software, L.H.; validation, L.H. and J.R.; formal analysis, L.H.; investigation, L.H.; resources, L.H.; data curation, L.H.; writing—original draft preparation, L.H.; writing—review and editing, L.H., J.R. and A.N.; visualization, L.H.; supervision, J.R.; project administration, J.R. and R.V.; funding acquisition, R.V. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article/included tables. Further inquiries can be directed to the corresponding author(s).

Acknowledgments

This research was undertaken, in part, thanks to funding from the David Sobey Retailing Centre, Sobey School of Business. We acknowledge the support of the NeuroCognitive Imaging Lab (NCIL) at Dalhousie University. This research was conducted as part of Saint Mary’s University Computer Engineering Research Lab (CERL).

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of the data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

Acc.	Accuracy
ANN	Artificial Neural Network
AR	Autoregressive
AUC	Area Under the Curve
BCI	Brain–Computer Interface
BLSTM-NN	Bidirectional Long Short-Term Memory Neural Network.
Borda Count	Rank-Based Aggregation Method
CAR	Common Average Referencing
CCA	Canonical Correlation Analysis
CNN	Convolutional Neural Network
CRR	Correct Recognition Rate
CS	Cosine Similarity
CSP	Common Spatial Patterns
CWT	Continuous Wavelet Transform
DBN	Deep Belief Network
DE	Differential Entropy
DL	Deep Learning
DML	Deep Metric Learning
ECG	Electrocardiography
EEG	Electroencephalography
EER	Equal Error Rate
EOG	Electrooculography
FAR	False Acceptance Rate
FBCSP	Filter Bank Common Spatial Patterns
FC	Fully Connected layer
FE	Feature Extraction
FLOPs	Floating-Point Operations Per Second
FRR	False Rejection Rate
FuzzyEn	Fuzzy Entropy
GAN	Generative Adversarial Network
GAT	Graph Attention Network
GET	Generative EEG Transformer
GFCC	Gammatone Frequency Cepstral Coefficient
GNN	Graph Neural Network
GRU	Gated Recurrent Unit
GSO	Gram–Schmidt Orthogonalization
GSR	Galvanic Skin Response
GT	Gated Transformer
HTER	Half Total Error Rate
ITR	Information Transfer Rate
LSTM	Long Short-Term Memory
MACs	Millions of Multiply–Accumulate Operations
Max Rule	A Decision Fusion Strategy Selecting the Maximum Score
MFCC	Mel-Frequency Cepstral Coefficient
MI	Motor Imagery
ML	Machine Learning
MLP	Multilayer Perceptron
MSE	Mean Squared Error
NN	Nearest Neighbor
PLV	Phase Locking Value
PSD	Power Spectral Density
RF	Random Forest
RMSprop	Root Mean Square Propagation
RNN	Recurrent Neural Network
R/S Analysis	Rescaled Range Analysis
SA	Self-Attention
SAFE	Spatial Attention Feature Extractor
SE	Squeeze-and-Excitation
SMOTE	Synthetic Minority Oversampling Technique
SNN	Siamese Neural Network
SPA	Spectral Power Analysis
STE	Spatial Transformer Encoder
STFT	Short-Time Fourier Transform
TTNN	Tensor-Train Neural Network
TAFE	Temporal Attention Feature Extractor
TTE	Temporal Transformer Encoder
TCN	Temporal Convolutional Network
ViT	Vision Transformer
WT	Wavelet Transform
XGB	XGBoost

The following dataset abbreviations are used in this manuscript:

BCI Competition IV-2a	Motor Imagery EEG Benchmark Dataset (22 channels, 9 subjects).
CD FTA	Cross-Dataset Fine-Tuning Adaptation. Benchmarking for adapting EEG models across datasets.
DEAP	Dataset for Emotion Analysis using EEG and peripheral Physiological signals during video watching.
DREAMER	Dataset for emotion analysis using EEG and ECG while subjects watched affective videos.
EEGMMIDB	Large-scale PhysioNet dataset for EEG Motor Movement/Imagery Database.
HGD	Gamma dataset.
ImageNet	Large-scale visual dataset (1.2 M images, 1000 categories), widely used for pretraining.
ImageNet 21k	Extended ImageNet with 21,000 categories for large-scale vision pretraining.
JFT 300 M	Google’s large-scale proprietary dataset of 300 M images, used for pretraining.
MOABB	Mother of All BCI Benchmarks. A standardized benchmarking framework for EEG datasets.
MPED	Multimodal Physiological Emotion Database. EEG and physiological modalities for emotion recognition.
PhysioNet EEG	Collection of EEG datasets on PhysioNet for various clinical and cognitive studies.
SEED-IV	SJTU Emotion EEG Dataset (four-class emotion recognition for 15 participants).
SEED-V	SJTU Emotion EEG Dataset (five-class emotion recognition for 16 participants).
SMR BCI	Sensorimotor Rhythm Brain–Computer Interface Dataset. Longitudinal MI dataset (600 h, 62 participants).

References

Teplan, M. Fundamentals of EEG Measurement. Meas. Sci. Rev. 2002, 2, 1–11. Available online: https://www.measurement.sk/2002/S2/Teplan.pdf (accessed on 15 September 2025).
Craik, A.; He, Y.; Contreras-Vidal, J.L. Deep Learning for Electroencephalogram (EEG) Classification Tasks: A Review. J. Neural Eng. 2019, 16, 031001. [Google Scholar] [CrossRef] [PubMed]
Lin, F.; Cho, K.W.; Song, C.; Xu, W.; Jin, Z. Brain Password: A Secure and Truly Cancelable Brain Biometrics for Smart Headwear. In Proceedings of the 16th Annual International Conference on Mobile Systems, Applications, and Services, Munich, Germany, 10–15 June 2018; ACM: New York, NY, USA, 2018; pp. 296–309. [Google Scholar] [CrossRef]
Yu, Y.-C.; Wang, S.; Gabel, L.A. A Feasibility Study of Using Event-Related Potential as a Biometrics. In Proceedings of 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Orlando, FL, USA, 16–20 August 2016; IEEE: New York, NY, USA, 2016; pp. 4547–4550. [Google Scholar] [CrossRef]
Singh, A.K.; Krishnan, S. Trends in EEG Signal Feature Extraction Applications. Front. Artif. Intell. 2023, 5, 1072801. [Google Scholar] [CrossRef] [PubMed]
Ramoser, H.; Muller-Gerking, J.; Pfurtscheller, G. Optimal Spatial Filtering of Single Trial EEG during Imagined Hand Movement. IEEE Trans. Rehab. Eng. 2000, 8, 441–446. [Google Scholar] [CrossRef]
Subasi, A. EEG Signal Classification Using Wavelet Feature Extraction and a Mixture of Expert Model. Expert Syst. Appl. 2007, 32, 1084–1093. [Google Scholar] [CrossRef]
Roy, Y.; Banville, H.; Albuquerque, I.; Gramfort, A.; Falk, T.H.; Faubert, J. Deep Learning-Based Electroencephalography Analysis: A Systematic Review. J. Neural Eng. 2019, 16, 051001. [Google Scholar] [CrossRef] [PubMed]
Takahashi, S.; Sakaguchi, Y.; Kouno, N.; Takasawa, K.; Ishizu, K.; Akagi, Y.; Aoyama, R.; Teraya, N.; Bolatkan, A.; Shinkai, N.; et al. Comparison of Vision Transformers and Convolutional Neural Networks in Medical Image Analysis: A Systematic Review. J. Med. Syst. 2024, 48, 84. [Google Scholar] [CrossRef]
Sun, C.; Mou, C. Survey on the Research Direction of EEG-Based Signal Processing. Front. Neurosci. 2023, 17, 1203059. [Google Scholar] [CrossRef]
Schirrmeister, R.T.; Springenberg, J.T.; Fiederer, L.D.J.; Glasstetter, M.; Eggensperger, K.; Tangermann, M.; Hutter, F.; Burgard, W.; Ball, T. Deep Learning with Convolutional Neural Networks for EEG Decoding and Visualization. Hum. Brain Mapp. 2017, 38, 5391–5420. [Google Scholar] [CrossRef]
Lawhern, V.J.; Solon, A.J.; Waytowich, N.R.; Gordon, S.M.; Hung, C.P.; Lance, B.J. EEGNet: A Compact Convolutional Neural Network for EEG-Based Brain–Computer Interfaces. J. Neural Eng. 2018, 15, 056013. [Google Scholar] [CrossRef]
Vafaei, E.; Hosseini, M. Transformers in EEG Analysis: A Review of Architectures and Applications in Motor Imagery, Seizure, and Emotion Classification. Sensors 2025, 25, 1293. [Google Scholar] [CrossRef]
Zhang, X.; Yao, L.; Wang, X.; Monaghan, J.; McAlpine, D.; Zhang, Y. A Survey on Deep Learning-Based Non-Invasive Brain Signals: Recent Advances and New Frontiers. J. Neural Eng. 2021, 18, 031002. [Google Scholar] [CrossRef]
Fu, J. A Comparison of CNN and Transformer in Continual Learning. Master’s Thesis, KTH Royal Institute of Technology, Stockholm, Sweden, 2023. Available online: https://kth.diva-portal.org/smash/get/diva2:1820229/FULLTEXT01.pdf (accessed on 16 September 2025).
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015), San Diego, CA, USA, 7–9 May 2015; Available online: https://ora.ox.ac.uk/objects/uuid:60713f18-a6d1-4d97-8f45-b60ad8aebbce (accessed on 27 September 2025).
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going Deeper with Convolutions. In Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; IEEE: New York, NY, USA, 2015; pp. 1–9. [Google Scholar] [CrossRef]
Xie, S.; Girshick, R.; Dollar, P.; Tu, Z.; He, K. Aggregated Residual Transformations for Deep Neural Networks. In Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: New York, NY, USA, 2017; pp. 5987–5995. [Google Scholar] [CrossRef]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: New York, NY, USA, 2017; pp. 2261–2269. [Google Scholar] [CrossRef]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
Tan, M.; Le, Q.V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the 36th International Conference on Machine Learning (ICML), Long Beach, CA, USA, 10–15 June 2019; pp. 6105–6114. Available online: http://proceedings.mlr.press/v97/tan19a.html (accessed on 16 September 2025).
Radosavovic, I.; Kosaraju, R.P.; Girshick, R.; He, K.; Dollar, P. Designing Network Design Spaces. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; IEEE: New York, NY, USA, 2020; pp. 10425–10433. [Google Scholar] [CrossRef]
Liu, Z.; Mao, H.; Wu, C.-Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; IEEE: New York, NY, USA, 2022; pp. 11966–11976. [Google Scholar] [CrossRef]
Chen, G.; Zhang, X.; Zhang, J.; Li, F.; Duan, S. A Novel Brain-Computer Interface Based on Audio-Assisted Visual Evoked EEG and Spatial-Temporal Attention CNN. Front. Neurorobot. 2022, 16, 995552. [Google Scholar] [CrossRef] [PubMed]
Mai, N.-D.; Hoang Long, N.M.; Chung, W.-Y. 1D-CNN-Based BCI System for Detecting Emotional States Using a Wireless and Wearable 8-Channel Custom-Designed EEG Headset. In Proceedings of the 2021 IEEE International Conference on Flexible and Printable Sensors and Systems (FLEPS), Manchester, UK, 20–23 June 2021; IEEE: New York, NY, USA, 2021; pp. 1–4. [Google Scholar] [CrossRef]
Huang, J.; Wang, C.; Zhao, W.; Grau, A.; Xue, X.; Zhang, F. LTDNet-EEG: A Lightweight Network of Portable/Wearable Devices for Real-Time EEG Signal Denoising. IEEE Trans. Consum. Electron. 2024, 70, 5561–5575. [Google Scholar] [CrossRef]
Borra, D.; Magosso, E. Deep Learning-Based EEG Analysis: Investigating P3 ERP Components. J. Integr. Neurosci. 2021, 20, 791–811. [Google Scholar] [CrossRef]
Ramakrishnan, K.; Groen, I.I.A.; Smeulders, A.W.M.; Scholte, H.S.; Ghebreab, S. Deep Learning for EEG-Based Brain Mapping. bioRxiv 2017, 178541. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NeurIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Available online: https://papers.nips.cc/paper/7181-attention-is-all-you-need (accessed on 23 September 2025).
Lin, T.; Wang, Y.; Liu, X.; Qiu, X. A Survey of Transformers. arXiv 2021, arXiv:2106.04554. [Google Scholar] [CrossRef]
Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding by Generative Pre-Training. OpenAI. 2018. Available online: https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf (accessed on 17 September 2025).
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, Minneapolis, MN, USA, 2–7 June 2019; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 2020, 21, 1–67. Available online: https://jmlr.org/papers/v21/20-074.html (accessed on 22 September 2025).
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16×16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 3–7 May 2021; Available online: https://openreview.net/forum?id=YicbFdNTTy (accessed on 19 September 2025).
Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H.W.; Sutton, C.; Gehrmann, S.; et al. PaLM: Scaling Language Modeling with Pathways. J. Mach. Learn. Res. 2023, 24, 1–113. Available online: https://jmlr.org/papers/volume24/22-1144/22-1144.pdf (accessed on 21 September 2025).
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.-A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. LLaMA: Open and Efficient Foundation Language Models. Meta AI. 2023. Available online: https://ai.meta.com/research/publications/llama-open-and-efficient-foundation-language-models (accessed on 25 September 2025).
Google Developers/Google DeepMind. Introducing Gemini: Google’s Most Capable AI Model Yet. Google Blog 6 December 2023. Available online: https://blog.google/technology/ai/google-gemini-ai/ (accessed on 20 September 2025).
Arjun, A.; Rajpoot, A.S.; Raveendranatha, P.M. Introducing Attention Mechanism for EEG Signals: Emotion Recognition with Vision Transformers. In Proceedings of 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), Guadalajara, Mexico, 1–5 November 2021; IEEE: New York, NY, USA, 2021; pp. 5723–5726. [Google Scholar] [CrossRef]
Song, Y.; Jia, X.; Yang, L.; Xie, L. Transformer-Based Spatial-Temporal Feature Learning for EEG Decoding. arXiv 2021, arXiv:2106.11170. [Google Scholar] [CrossRef]
Tao, Y.; Sun, T.; Muhamed, A.; Genc, S.; Jackson, D.; Arsanjani, A.; Yaddanapudi, S.; Li, L.; Kumar, P. Gated Transformer for Decoding Human Brain EEG Signals. In Proceedings of the 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), Guadalajara, Mexico, 1–5 November 2021; IEEE: New York, NY, USA, 2021; pp. 125–130. [Google Scholar] [CrossRef]
Siddhad, G.; Gupta, A.; Dogra, D.P.; Roy, P.P. Efficacy of Transformer Networks for Classification of EEG Data. Biomed. Signal Process. Control 2024, 87, 105488. [Google Scholar] [CrossRef]
Du, Y.; Xu, Y.; Wang, X.; Liu, L.; Ma, P. EEG Temporal–Spatial Transformer for Person Identification. Sci. Rep. 2022, 12, 14378. [Google Scholar] [CrossRef]
Zeynali, M.; Seyedarabi, H.; Afrouzian, R. Classification of EEG Signals Using Transformer Based Deep Learning and Ensemble Models. Biomed. Signal Process. Control 2023, 86, 105130. [Google Scholar] [CrossRef]
Omair, A.; Saif-ur-Rehman, M.; Metzler, M.; Glasmachers, T.; Iossifidis, I.; Klaes, C. GET: A Generative EEG Transformer for Continuous Context-Based Neural Signals. arXiv 2024, arXiv:2406.03115. [Google Scholar] [CrossRef]
Lim, J.-H.; Kuo, P.-C. EEGTrans: Transformer-Driven Generative Models for EEG Synthesis. In Proceedings of the 13th International Conference on Learning Representations (ICLR 2025), Singapore, 24–28 April 2025; Available online: https://openreview.net/forum?id=ydw2l8zgUB (accessed on 18 September 2025).
Wang, G.; Liu, W.; He, Y.; Xu, C.; Ma, L.; Li, H. EEGPT: Pretrained Transformer for Universal and Reliable EEG Representation Learning. In Proceedings of the 38th Conference on Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 10–15 December 2024; pp. 1–12. [Google Scholar]
Saeidi, M.; Karwowski, W.; Farahani, F.V.; Fiok, K.; Taiar, R.; Hancock, P.A.; Al-Juaid, A. Neural Decoding of EEG Signals with Machine Learning: A Systematic Review. Brain Sci. 2021, 11, 1525. [Google Scholar] [CrossRef] [PubMed]
Li, X.; Zhang, Y.; Tiwari, P.; Song, D.; Hu, B.; Yang, M.; Zhao, Z.; Kumar, N.; Marttinen, P. EEG Based Emotion Recognition: A Tutorial and Review. ACM Comput. Surv. 2023, 55, 1–57. [Google Scholar] [CrossRef]
Prabowo, D.W.; Nugroho, H.A.; Setiawan, N.A.; Debayle, J. A Systematic Literature Review of Emotion Recognition Using EEG Signals. Cogn. Syst. Res. 2023, 82, 101152. [Google Scholar] [CrossRef]
Vempati, R.; Sharma, L.D. A Systematic Review on Automated Human Emotion Recognition Using Electroencephalogram Signals and Artificial Intelligence. Results Eng. 2023, 18, 101027. [Google Scholar] [CrossRef]
Mohammed, S.A.; Jasim, S.S.; Thamir, B.A.; Alabdel Abass, A. A Survey on EEG Signal Analysis Using Machine Learning. UTJES 2025, 15, 89–106. [Google Scholar] [CrossRef]
Gatfan, K.S. A Review on Deep Learning for Electroencephalogram Signal Classification. J. AI-Qadisiyah Comp. Sci. Math. 2024, 16, 137–151. [Google Scholar] [CrossRef]
Suhaimi, N.S.; Mountstephens, J.; Teo, J. EEG-Based Emotion Recognition: A State-of-the-Art Review of Current Trends and Opportunities. Comput. Intell. Neurosci. 2020, 2020, 1–19. [Google Scholar] [CrossRef] [PubMed]
Rahman, M.M.; Sarkar, A.K.; Hossain, M.A.; Hossain, M.S.; Islam, M.R.; Hossain, M.B.; Quinn, J.M.W.; Moni, M.A. Recognition of Human Emotions Using EEG Signals: A Review. Comput. Biol. Med. 2021, 136, 104696. [Google Scholar] [CrossRef] [PubMed]
Khare, S.K.; Blanes-Vidal, V.; Nadimi, E.S.; Acharya, U.R. Emotion Recognition and Artificial Intelligence: A Systematic Review (2014–2023) and Research Recommendations. Inf. Fusion 2024, 102, 102019. [Google Scholar] [CrossRef]
Jafari, M.; Shoeibi, A.; Khodatars, M.; Bagherzadeh, S.; Shalbaf, A.; García, D.L.; Gorriz, J.M.; Acharya, U.R. Emotion Recognition in EEG Signals Using Deep Learning Methods: A Review. Comput. Biol. Med. 2023, 165, 107450. [Google Scholar] [CrossRef] [PubMed]
Ma, W.; Zheng, Y.; Li, T.; Li, Z.; Li, Y.; Wang, L. A Comprehensive Review of Deep Learning in EEG-Based Emotion Recognition: Classifications, Trends, and Practical Implications. PeerJ Comput. Sci. 2024, 10, e2065. [Google Scholar] [CrossRef]
Gkintoni, E.; Aroutzidis, A.; Antonopoulou, H.; Halkiopoulos, C. From Neural Networks to Emotional Networks: A Systematic Review of EEG-Based Emotion Recognition in Cognitive Neuroscience and Real-World Applications. Brain Sci. 2025, 15, 220. [Google Scholar] [CrossRef] [PubMed]
Al-Saegh, A.; Dawwd, S.A.; Abdul-Jabbar, J.M. Deep Learning for Motor Imagery EEG-Based Classification: A Review. Biomed. Signal Process. Control 2021, 63, 102172. [Google Scholar] [CrossRef]
Ko, W.; Jeon, E.; Jeong, S.; Phyo, J.; Suk, H.-I. A Survey on Deep Learning-Based Short/Zero-Calibration Approaches for EEG-Based Brain–Computer Interfaces. Front. Hum. Neurosci. 2021, 15, 643386. [Google Scholar] [CrossRef]
Pawan; Dhiman, R. Machine Learning Techniques for Electroencephalogram Based Brain-Computer Interface: A Systematic Literature Review. Meas. Sens. 2023, 28, 100823. [Google Scholar] [CrossRef]
Saibene, A.; Ghaemi, H.; Dagdevir, E. Deep Learning in Motor Imagery EEG Signal Decoding: A Systematic Review. Neurocomputing 2024, 610, 128577. [Google Scholar] [CrossRef]
Moreno-Castelblanco, S.R.; Vélez-Guerrero, M.A.; Callejas-Cuervo, M. Artificial Intelligence Approaches for EEG Signal Acquisition and Processing in Lower-Limb Motor Imagery: A Systematic Review. Sensors 2025, 25, 5030. [Google Scholar] [CrossRef]
Wang, X.; Liesaputra, V.; Liu, Z.; Wang, Y.; Huang, Z. An In-Depth Survey on Deep Learning-Based Motor Imagery Electroencephalogram (EEG) Classification. Artif. Intell. Med. 2024, 147, 102738. [Google Scholar] [CrossRef]
Hassan, J.; Reza, S.; Ahmed, S.U.; Anik, N.H.; Khan, M.O. EEG Workload Estimation and Classification: A Systematic Review. J. Neural Eng. 2025, 22, 051003. [Google Scholar] [CrossRef] [PubMed]
de Bardeci, M.; Ip, C.T.; Olbrich, S. Deep Learning Applied to Electroencephalogram Data in Mental Disorders: A Systematic Review. Biol. Psychol. 2021, 162, 108117. [Google Scholar] [CrossRef]
Nwagu, C.; AlSlaity, A.; Orji, R. EEG-Based Brain-Computer Interactions in Immersive Virtual and Augmented Reality: A Systematic Review. Proc. ACM Hum.-Comput. Interact. 2023, 7, 1–33. [Google Scholar] [CrossRef]
Dadebayev, D.; Goh, W.W.; Tan, E.X. EEG-Based Emotion Recognition: Review of Commercial EEG Devices and Machine Learning Techniques. J. King Saud Univ. Comput. Inf. Sci. 2022, 34, 4385–4401. [Google Scholar] [CrossRef]
Klepl, D.; Wu, M.; He, F. Graph Neural Network-Based EEG Classification: A Survey. arXiv 2023, arXiv:2310.02152. [Google Scholar] [CrossRef]
Ma, L.; Minett, J.W.; Blu, T.; Wang, W.S.-Y. Resting State EEG-Based Biometrics for Individual Identification Using Convolutional Neural Networks. In Proceedings of the 2015 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Milan, Italy, 25–29 August 2015; IEEE: New York, NY, USA, 2015; pp. 2848–2851. [Google Scholar] [CrossRef]
Mao, Z.; Yao, W.X.; Huang, Y. EEG-Based Biometric Identification with Deep Learning. In Proceedings of the 2017 8th International IEEE/EMBS Conference on Neural Engineering (NER), Shanghai, China, 25–28 May 2017; IEEE: New York, NY, USA, 2017; pp. 609–612. [Google Scholar] [CrossRef]
Schons, T.; Moreira, G.J.P.; Silva, P.H.L.; Coelho, V.N.; Luz, E.J.S. Convolutional Network for EEG-Based Biometric. In Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications; Mendoza, M., Velastín, S., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2018; Volume 10657, pp. 601–608. [Google Scholar] [CrossRef]
Di, Y.; An, X.; Liu, S.; He, F.; Ming, D. Using Convolutional Neural Networks for Identification Based on EEG Signals. In Proceedings of the 2018 10th International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC), Hangzhou, China, 25–26 August 2018; IEEE: New York, NY, USA, 2018; pp. 119–122. [Google Scholar] [CrossRef]
Zhang, F.Q.; Mao, Z.J.; Huang, Y.F.; Xu, L.; Ding, G.Y. Deep Learning Models for EEG-Based Rapid Serial Visual Presentation Event Classification. J. Inf. Hiding Multimed. Signal Process. 2018, 9, 177–187. [Google Scholar]
Gonzalez, P.A.; Katsigiannis, S.; Ramzan, N.; Tolson, D.; Arevalillo-Herrez, M. ES1D: A Deep Network for EEG-Based Subject Identification. In Proceedings of the 2017 IEEE 17th International Conference on Bioinformatics and Bioengineering (BIBE), Washington, DC, USA, 23–25 October 2017; IEEE: New York, NY, USA, 2017; pp. 81–85. [Google Scholar] [CrossRef]
Waytowich, N.; Lawhern, V.J.; Garcia, O.; Cummings, J.; Faller, J.; Sajda, P.; Vettel, J.M. Compact Convolutional Neural Networks for Classification of Asynchronous Steady-State Visual Evoked Potentials. J. Neural Eng. 2018, 15, 066031. [Google Scholar] [CrossRef]
Yu, T.; Wei, C.-S.; Chiang, K.-J.; Nakanishi, M.; Jung, T.-P. EEG-Based User Authentication Using a Convolutional Neural Network. In Proceedings of the 2019 9th International IEEE/EMBS Conference on Neural Engineering (NER), San Francisco, CA, USA, 20–23 March 2019; IEEE: New York, NY, USA, 2019; pp. 1011–1014. [Google Scholar] [CrossRef]
Lai, C.Q.; Ibrahim, H.; Abdullah, M.Z.; Abdullah, J.M.; Suandi, S.A.; Azman, A. Arrangements of Resting State Electroencephalography as the Input to Convolutional Neural Network for Biometric Identification. Comput. Intell. Neurosci. 2019, 2019, 1–10. [Google Scholar] [CrossRef] [PubMed]
Wang, M.; El-Fiqi, H.; Hu, J.; Abbass, H.A. Convolutional Neural Networks Using Dynamic Functional Connectivity for EEG-Based Person Identification in Diverse Human States. IEEE Trans. Inform. Forensic Secur. 2019, 14, 3259–3272. [Google Scholar] [CrossRef]
Wang, M.; Hu, J.; Abbass, H. Stable EEG Biometrics Using Convolutional Neural Networks and Functional Connectivity. Aust. J. Intell. Inf. Process. Syst. 2019, 15, 19–26. [Google Scholar]
Zhang, R.; Zeng, Y.; Tong, L.; Shu, J.; Lu, R.; Li, Z.; Yang, K.; Yan, B. EEG Identity Authentication in Multi-Domain Features: A Multi-Scale 3D-CNN Approach. Front. Neurorobot. 2022, 16, 901765. [Google Scholar] [CrossRef]
Das, R.; Maiorana, E.; Campisi, P. Visually Evoked Potential for EEG Biometrics Using Convolutional Neural Network. In Proceedings of the 2017 25th European Signal Processing Conference (EUSIPCO), Kos, Greece, 28 August–2 September 2017; IEEE: New York, NY, USA, 2017; pp. 951–955. [Google Scholar] [CrossRef]
Cecotti, H. Convolutional Neural Networks for Event-Related Potential Detection: Impact of the Architecture. In Proceedings of the 2017 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Seogwipo, Republic of Korea, 11–15 July 2017; IEEE: New York, NY, USA, 2017; pp. 2031–2034. [Google Scholar] [CrossRef]
Cecotti, H.; Jha, G. 3D Convolutional Neural Networks for Event-Related Potential Detection. In Proceedings of the 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Berlin, Germany, 23–27 July 2019; IEEE: New York, NY, USA, 2019; pp. 4160–4163. [Google Scholar] [CrossRef]
Chen, J.X.; Mao, Z.J.; Yao, W.X.; Huang, Y.F. EEG-Based Biometric Identification with Convolutional Neural Network. Multimed. Tools Appl. 2020, 79, 10655–10675. [Google Scholar] [CrossRef]
Salami, A.; Andreu-Perez, J.; Gillmeister, H. EEG-ITNet: An Explainable Inception Temporal Convolutional Network for Motor Imagery Classification. IEEE Access 2022, 10, 36672–36685. [Google Scholar] [CrossRef]
Bai, S.; Kolter, J.Z.; Koltun, V. An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. arXiv 2018, arXiv:1803.01271. [Google Scholar] [CrossRef]
Riyad, M.; Khalil, M.; Adib, A. Incep-EEGNet: A ConvNet for Motor Imagery Decoding. In Image and Signal Processing; El Moataz, A., Mammass, D., Mansouri, A., Nouboud, F., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2020; Volume 12119, pp. 103–111. [Google Scholar] [CrossRef]
Liu, X.; Shen, Y.; Liu, J.; Yang, J.; Xiong, P.; Lin, F. Parallel Spatial–Temporal Self-Attention CNN-Based Motor Imagery Classification for BCI. Front. Neurosci. 2020, 14, 587520. [Google Scholar] [CrossRef]
Zhu, Y.; Peng, Y.; Song, Y.; Ozawa, K.; Kong, W. RAMST-CNN: A Residual and Multiscale Spatio-Temporal Convolution Neural Network for Personal Identification with EEG. IEICE Trans. Fundam. Electron. Commun. Comput. Sci. 2021, E104.A, 563–571. [Google Scholar] [CrossRef]
Lakhan, P.; Banluesombatkul, N.; Sricom, N.; Sawangjai, P.; Sangnark, S.; Yagi, T.; Wilaiprasitporn, T.; Saengmolee, W.; Limpiti, T. EEG-BBNet: A Hybrid Framework for Brain Biometric Using Graph Connectivity. IEEE Sens. Lett. 2025, 9, 1–4. [Google Scholar] [CrossRef]
Ding, Y.; Robinson, N.; Zhang, S.; Zeng, Q.; Guan, C. TSception: Capturing Temporal Dynamics and Spatial Asymmetry from EEG for Emotion Recognition. IEEE Trans. Affect. Comput. 2023, 14, 2238–2250. [Google Scholar] [CrossRef]
Salimi, N.; Barlow, M.; Lakshika, E. Towards Potential of N-back Task as Protocol and EEGNet for the EEG-based Biometric. In Proceedings of the 2020 IEEE Symposium Series on Computational Intelligence (SSCI), Canberra, ACT, Australia, 1–4 December 2020; pp. 1718–1724. [Google Scholar] [CrossRef]
Ingolfsson, T.M.; Hersche, M.; Wang, X.; Kobayashi, N.; Cavigelli, L.; Benini, L. EEG-TCNet: An Accurate Temporal Convolutional Network for Embedded Motor-Imagery Brain–Machine Interfaces. In Proceedings of the 2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Toronto, ON, Canada, 11–14 October 2020; IEEE: New York, NY, USA, 2020; pp. 2958–2965. [Google Scholar] [CrossRef]
Kasim, Ö.; Tosun, M. Biometric Authentication from Photic Stimulated EEG Records. Appl. Artif. Intell. 2021, 35, 1407–1419. [Google Scholar] [CrossRef]
Wu, B.; Meng, W.; Chiu, W.-Y. Towards Enhanced EEG-Based Authentication with Motor Imagery Brain-Computer Interface. In Proceedings of the 38th Annual Computer Security Applications Conference, Austin, TX, USA, 5–9 December 2022; ACM: New York, NY, USA, 2022; pp. 799–812. [Google Scholar] [CrossRef]
Altuwaijri, G.A.; Muhammad, G.; Altaheri, H.; Alsulaiman, M. A Multi-Branch Convolutional Neural Network with Squeeze-and-Excitation Attention Blocks for EEG-Based Motor Imagery Signals Classification. Diagnostics 2022, 12, 995. [Google Scholar] [CrossRef]
Autthasan, P.; Chaisaen, R.; Sudhawiyangkul, T.; Rangpong, P.; Kiatthaveephong, S.; Dilokthanakul, N.; Bhakdisongkhram, G.; Phan, H.; Guan, C.; Wilaiprasitporn, T. MIN2Net: End-to-End Multi-Task Learning for Subject-Independent Motor Imagery EEG Classification. IEEE Trans. Biomed. Eng. 2022, 69, 2105–2118. [Google Scholar] [CrossRef] [PubMed]
Bidgoly, A.J.; Bidgoly, H.J.; Arezoumand, Z. Towards a Universal and Privacy Preserving EEG-Based Authentication System. Sci. Rep. 2022, 12, 2531. [Google Scholar] [CrossRef] [PubMed]
Alsumari, W.; Hussain, M.; Alshehri, L.; Aboalsamh, H.A. EEG-Based Person Identification and Authentication Using Deep Convolutional Neural Network. Axioms 2023, 12, 74. [Google Scholar] [CrossRef]
Yap, H.Y.; Choo, Y.-H.; Mohd Yusoh, Z.I.; Khoh, W.H. An Evaluation of Transfer Learning Models in EEG-Based Authentication. Brain Inf. 2023, 10, 19. [Google Scholar] [CrossRef]
Chen, X.; Teng, X.; Chen, H.; Pan, Y.; Geyer, P. Toward Reliable Signals Decoding for Electroencephalogram: A Benchmark Study to EEGNeX. Biomed. Signal Process. Control 2024, 87, 105475. [Google Scholar] [CrossRef]
Shakir, A.M.; Bidgoly, A.J. Task-Independent EEG-Based Authentication. J. Tianjin Univ. Sci. Technol. 2024, 57, 1–14. [Google Scholar] [CrossRef]
Wu, Q.; Zeng, Y.; Zhang, C.; Tong, L.; Yan, B. An EEG-Based Person Authentication System with Open-Set Capability Combining Eye Blinking Signals. Sensors 2018, 18, 335. [Google Scholar] [CrossRef]
Ozdenizci, O.; Wang, Y.; Koike-Akino, T.; Erdogmus, D. Adversarial Deep Learning in EEG Biometrics. IEEE Signal Process. Lett. 2019, 26, 710–714. [Google Scholar] [CrossRef]
Musallam, Y.K.; AlFassam, N.I.; Muhammad, G.; Amin, S.U.; Alsulaiman, M.; Abdul, W.; Altaheri, H.; Bencherif, M.A.; Algabri, M. Electroencephalography-Based Motor Imagery Classification Using Temporal Convolutional Network Fusion. Biomed. Signal Process. Control 2021, 69, 102826. [Google Scholar] [CrossRef]
Mane, R.; Chew, E.; Chua, K.; Ang, K.K.; Robinson, N.; Vinod, A.P.; Lee, S.-W.; Guan, C. FBCNet: A Multi-View Convolutional Neural Network for Brain-Computer Interface. arXiv 2021, arXiv:2104.01233. [Google Scholar] [CrossRef]
Hu, F.; Wang, F.; Bi, J.; An, Z.; Chen, C.; Qu, G.; Han, S. HASTF: A Hybrid Attention Spatio-Temporal Feature Fusion Network for EEG Emotion Recognition. Front. Neurosci. 2024, 18, 1479570. [Google Scholar] [CrossRef] [PubMed]
Muna, U.M.; Shawon, M.M.H.; Jobayer, M.; Akter, S.; Sabuj, S.R. SSTAF: Spatial-Spectral-Temporal Attention Fusion Transformer for Motor Imagery Classification. arXiv 2025, arXiv:2504.13220. [Google Scholar] [CrossRef]
Wei, C.; Zhou, G. EEG Emotion Recognition Based on Attention Mechanism Fusion Transformer Network. In Proceedings of the 2024 11th International Conference on Biomedical and Bioinformatics Engineering, Osaka, Japan, 8–11 November 2024; ACM: New York, NY, USA, 2024; pp. 146–150. [Google Scholar] [CrossRef]
Ghous, G.; Najam, S.; Alshehri, M.; Alshahrani, A.; AlQahtani, Y.; Jalal, A.; Liu, H. Attention-Driven Emotion Recognition in EEG: A Transformer-Based Approach with Cross-Dataset Fine-Tuning. IEEE Access 2025, 13, 69369–69394. [Google Scholar] [CrossRef]
Wei, Y.; Liu, Y.; Li, C.; Cheng, J.; Song, R.; Chen, X. TC-Net: A Transformer Capsule Network for EEG-Based Emotion Recognition. Comput. Biol. Med. 2023, 152, 106463. [Google Scholar] [CrossRef]
Sun, J.; Xie, J.; Zhou, H. EEG Classification with Transformer-Based Models. In Proceedings of the 2021 IEEE 3rd Global Conference on Life Sciences and Technologies (LifeTech), Nara, Japan, 9–11 March 2021; IEEE: New York, NY, USA, 2021; pp. 92–93. [Google Scholar] [CrossRef]
Omair, A.; Saif-ur-Rehman, M.; Glasmachers, T.; Iossifidis, I.; Klaes, C. ConTraNet: A Hybrid Network for Improving the Classification of EEG and EMG Signals with Limited Training Data. Comput. Biol. Med. 2024, 168, 107649. [Google Scholar] [CrossRef]
Wan, Z.; Li, M.; Liu, S.; Huang, J.; Tan, H.; Duan, W. EEGformer: A Transformer–Based Brain Activity Classification Method Using EEG Signal. Front. Neurosci. 2023, 17, 1148855. [Google Scholar] [CrossRef]
Ma, Y.; Song, Y.; Gao, F. A Novel Hybrid CNN-Transformer Model for EEG Motor Imagery Classification. In Proceedings of the 2022 International Joint Conference on Neural Networks (IJCNN), Padua, Italy, 18–23 July 2022; IEEE: New York, NY, USA, 2022; pp. 1–8. [Google Scholar] [CrossRef]
Zhao, W.; Jiang, X.; Zhang, B.; Xiao, S.; Weng, S. CTNet: A Convolutional Transformer Network for EEG-Based Motor Imagery Classification. Sci. Rep. 2024, 14, 20237. [Google Scholar] [CrossRef]
Liu, R.; Chao, Y.; Ma, X.; Sha, X.; Sun, L.; Li, S.; Chang, S. ERTNet: An Interpretable Transformer-Based Framework for EEG Emotion Recognition. Front. Neurosci. 2024, 18, 1320645. [Google Scholar] [CrossRef] [PubMed]
Li, H.; Zhang, H.; Chen, Y. Dual-TSST: A Dual-Branch Temporal-Spectral-Spatial Transformer Model for EEG Decoding. arXiv 2024, arXiv:2409.03251. [Google Scholar] [CrossRef]
Xie, J.; Zhang, J.; Sun, J.; Ma, Z.; Qin, L.; Li, G.; Zhou, H.; Zhan, Y. A Transformer-Based Approach Combining Deep Learning Network and Spatial-Temporal Information for Raw EEG Classification. IEEE Trans. Neural Syst. Rehabil. Eng. 2022, 30, 2126–2136. [Google Scholar] [CrossRef]
Si, X.; Huang, D.; Sun, Y.; Huang, S.; Huang, H.; Ming, D. Transformer-Based Ensemble Deep Learning Model for EEG-Based Emotion Recognition. Brain Sci. Adv. 2023, 9, 210–223. [Google Scholar] [CrossRef]
Yao, X.; Li, T.; Ding, P.; Wang, F.; Zhao, L.; Gong, A.; Nan, W.; Fu, Y. Emotion Classification Based on Transformer and CNN for EEG Spatial–Temporal Feature Learning. Brain Sci. 2024, 14, 268. [Google Scholar] [CrossRef] [PubMed]
Lu, W.; Xia, L.; Tan, T.P.; Ma, H. CIT-EmotionNet: Convolution Interactive Transformer Network for EEG Emotion Recognition. PeerJ Comput. Sci. 2024, 10, e2610. [Google Scholar] [CrossRef] [PubMed]
Bagchi, S.; Bathula, D.R. EEG-ConvTransformer for Single-Trial EEG-Based Visual Stimulus Classification. Pattern Recognit. 2022, 129, 108757. [Google Scholar] [CrossRef]
Song, Y.; Zheng, Q.; Liu, B.; Gao, X. EEG Conformer: Convolutional Transformer for EEG Decoding and Visualization. IEEE Trans. Neural Syst. Rehabil. Eng. 2023, 31, 710–719. [Google Scholar] [CrossRef]
Si, X.; Huang, D.; Liang, Z.; Sun, Y.; Huang, H.; Liu, Q.; Yang, Z.; Ming, D. Temporal Aware Mixed Attention-Based Convolution and Transformer Network for Cross-Subject EEG Emotion Recognition. Comput. Biol. Med. 2024, 181, 108973. [Google Scholar] [CrossRef]
Gong, L.; Li, M.; Zhang, T.; Chen, W. EEG Emotion Recognition Using Attention-Based Convolutional Transformer Neural Network. Biomed. Signal Process. Control 2023, 84, 104835. [Google Scholar] [CrossRef]
Altaheri, H.; Muhammad, G.; Alsulaiman, M. Physics-Informed Attention Temporal Convolutional Network for EEG-Based Motor Imagery Classification. IEEE Trans. Ind. Inf. 2023, 19, 2249–2258. [Google Scholar] [CrossRef]
Nguyen, A.H.P.; Oyefisayo, O.; Pfeffer, M.A.; Ling, S.H. EEG-TCNTransformer: A Temporal Convolutional Transformer for Motor Imagery Brain–Computer Interfaces. Signals 2024, 5, 605–632. [Google Scholar] [CrossRef]
Cheng, Z.; Bu, X.; Wang, Q.; Yang, T.; Tu, J. EEG-Based Emotion Recognition Using Multi-Scale Dynamic CNN and Gated Transformer. Sci. Rep. 2024, 14, 31319. [Google Scholar] [CrossRef] [PubMed]
Kostas, D.; Aroca-Ouellette, S.; Rudzicz, F. BENDR: Using Transformers and a Contrastive Self-Supervised Learning Task to Learn from Massive Amounts of EEG Data. Front. Hum. Neurosci. 2021, 15, 653659. [Google Scholar] [CrossRef]
Yang, R.; Modesitt, E. ViT2EEG: Leveraging Hybrid Pretrained Vision Transformers for EEG Data. arXiv 2023, arXiv:2308.00454. [Google Scholar] [CrossRef]
Jiang, W.; Zhao, L.; Lu, B. LaBraM: Large Brain Model for Learning Generic Representations with Tremendous EEG Data. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 7–11 May 2024; pp. 1–22. Available online: https://proceedings.iclr.cc/paper_files/paper/2024/file/47393e8594c82ce8fd83adc672cf9872-Paper-Conference.pdf (accessed on 23 September 2025).
Li, W.; Zhou, N.; Qu, X. Enhancing Eye-Tracking Performance Through Multi-Task Learning Transformer. In Augmented Cognition; Schmorrow, D.D., Fidopiastis, C.M., Eds.; Lecture Notes in Computer Science; Springer Nature: Cham, Switzerland, 2024; Volume 14695, pp. 31–46. [Google Scholar] [CrossRef]
Zhang, X.; Yao, L.; Kanhere, S.S.; Liu, Y.; Gu, T.; Chen, K. MindID: Person Identification from Brain Waves through Attention-Based Recurrent Neural Network. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 2018, 2, 1–23. [Google Scholar] [CrossRef]
Zhang, X.; Yao, L.; Huang, C.; Gu, T.; Yang, Z.; Liu, Y. DeepKey: A Multimodal Biometric Authentication System via Deep Decoding Gaits and Brainwaves. ACM Trans. Intell. Syst. Technol. 2020, 11, 1–24. [Google Scholar] [CrossRef]
Balcı, F. DM-EEGID: EEG-Based Biometric Authentication System Using Hybrid Attention-Based LSTM and MLP Algorithm. Trait. Du Signal 2023, 40, 65–79. [Google Scholar] [CrossRef]
Wilaiprasitporn, T.; Ditthapron, A.; Matchaparn, K.; Tongbuasirilai, T.; Banluesombatkul, N.; Chuangsuwanich, E. Affective EEG-Based Person Identification Using the Deep Learning Approach. IEEE Trans. Cogn. Dev. Syst. 2020, 12, 486–496. [Google Scholar] [CrossRef]
Sun, Y.; Lo, F.P.-W.; Lo, B. EEG-Based User Identification System Using 1D-Convolutional Long Short-Term Memory Neural Networks. Expert Syst. Appl. 2019, 125, 259–267. [Google Scholar] [CrossRef]
Chakravarthi, B.; Ng, S.-C.; Ezilarasan, M.R.; Leung, M.-F. EEG-Based Emotion Recognition Using Hybrid CNN and LSTM Classification. Front. Comput. Neurosci. 2022, 16, 1019776. [Google Scholar] [CrossRef] [PubMed]
Puengdang, S.; Tuarob, S.; Sattabongkot, T.; Sakboonyarat, B. EEG-Based Person Authentication Method Using Deep Learning with Visual Stimulation. In Proceedings of the 2019 11th International Conference on Knowledge and Smart Technology (KST), Phuket, Thailand, 23–26 January 2019; IEEE: New York, NY, USA, 2019; pp. 6–10. [Google Scholar] [CrossRef]
Zheng, X.; Cao, Z.; Bai, Q. An Evoked Potential-Guided Deep Learning Brain Representation for Visual Classification. In Neural Information Processing; Yang, H., Pasupa, K., Leung, A.C.-S., Kwok, J.T., Chan, J.H., King, I., Eds.; Communications in Computer and Information Science; Springer International Publishing: Cham, Switzerland, 2020; Volume 1333, pp. 54–61. [Google Scholar] [CrossRef]
Jin, X.; Tang, J.; Kong, X.; Peng, Y.; Cao, J.; Zhao, Q.; Kong, W. CTNN: A Convolutional Tensor-Train Neural Network for Multi-Task Brainprint Recognition. IEEE Trans. Neural Syst. Rehabil. Eng. 2021, 29, 103–112. [Google Scholar] [CrossRef] [PubMed]
Kumar, P.; Saini, R.; Kaur, B.; Roy, P.P.; Scheme, E. Fusion of Neuro-Signals and Dynamic Signatures for Person Authentication. Sensors 2019, 19, 4641. [Google Scholar] [CrossRef] [PubMed]
Chakladar, D.D.; Kumar, P.; Roy, P.P.; Dogra, D.P.; Scheme, E.; Chang, V. A Multimodal-Siamese Neural Network (mSNN) for Person Verification Using Signatures and EEG. Inf. Fusion 2021, 71, 17–27. [Google Scholar] [CrossRef]

Figure 2. EEG analysis using a fundamental CNN structure. Adapted from Takahashi et al., 2024. Comparison of Vision Transformers and Convolutional Neural Networks in Medical Image Analysis: A Systematic Review, J Med Syst, 48 (1), 84 [9], licensed under Creative Commons Attribution 4.0 (CC BY 4.0). The original figure was modified by replacing the MRI input image with an EEG input representation; all other elements remain unchanged.

Figure 3. Different image-like input representations for EEG data.

Figure 4. Transformer architecture: from left to right, encoder–decoder, attention mechanism, multihead attention [13,29,30]. Reproduced from Vafaei and Hosseini, 2025. Transformers in EEG Analysis: A Review of Architectures and Applications in Motor Imagery, Seizure, and Emotion Classification, Sensors, 25 (5), 1293 [13], licensed under Creative Commons Attribution 4.0 (CC BY 4.0).

Figure 5. Heatmap depicting CNNs’ WSA scores [11,12,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107].

Figure 6. Heatmap depicting Transformers’ WSA scores [34,38,39,40,41,42,43,44,45,46,108,109,111].

Figure 7. Heatmap depicting CNN–Transformer hybrids’ WSA scores [113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133].

Figure 8. Heatmap depicting recurrent hybrids’ WSA scores [135,136,137,138,139,140,141,142,143,144,145].

Figure 9. Acc. vs. S3 WSA scores.

Figure 11. Acc. vs. total real cost and Pareto frontier.

Figure 12. Convolutions; (a) Standard, (b) depthwise, and (c) pointwise [12,20].

Figure 13. A taxonomy of emerging design trends in DL-EEG classification models.

Figure 14. A recapitulative alluvial diagram.

Table 1. Recent surveys on EEG-based classification.

Ref.	Author/Year	Focus	Contribution
Broad Surveys and Comparative Reviews
[8]	Roy et al. 2019	156 DL EEG studies (2010–2018)	-Concluded CNNs most common (40%), RNNs come next (13%) -DL gave 5.4% Acc. gain with extensive trends
[2]	Craik et al. 2019	90 DL EEG studies	-Categorized tasks (ER, MI, workload, seizure, ERP/sleep) -Noted CNNs/RNNs/DBNs strongest
[47]	Saeidi et al. 2021	Supervised EEG decoding in both ML and DL	-Mapped full pipeline from preprocessing to classification -Highlighted SVMs’ continued use and growing DL adoption
[48]	Li et al. 2022	Conceptual EEG classification	-EEG research within psychological and physiological contexts
[49] [50]	Prabowo et al. 2023 Vempati and Sharma 2023	DL foundations and EEG classification	-Summarized theoretical motivations and methods
[52]	Gatfan 2024	ML vs. DL for EEG classification	-Showed CNNs consistently outperform ML
[51]	Mohammed et al. 2025	ML and DL across seizure detection, ER, MI, workload	-Surveyed CNNs, RNNs, Transformers -Noted growing hybrid adoption
Application-Specific Reviews: ER
[53]	Suhaimi et al. 2020	EEG-based ER (2016–2019)	-Compared emotion elicitation methods (hardware, ML use, and VR as emerging tool)
[54]	Rahman et al. 2021	EEG and emotion theories	-Combined theoretical emotion models with EEG techniques
[55]	Khare et al. 2023	Multimodal ER (EEG + ECG/GSR)	-Reviewed datasets and challenges (signal variability, trust, real-time issues) -proposed federated meta-learning
[56]	Jafari et al. 2023	DL for EEG ER	-Discussed hardware acceleration (SoC, FPGA, ASIC) for real-time EEG
[57]	Ma et al. 2024	DL for EEG ER	-Framework: subject-dependent vs. -independent models
[58]	Gkintoni et al. 2025	64 DL EEG studies on ER	-Found multimodal EEG + physiology led to >90% Acc. -Recommended adaptive models and ethical practices
Application-Specific Reviews: MI and BCIs
[59]	Al-Saegh et al. 2021	DL for MI EEG	-Summarized input representations (raw, TF) -Noted CNN/hybrid dominance
[60]	Ko et al. 2021	Calibration-free DL-based BCIs	-Categorized DA (generative, geometric) and TL (explicit, implicit) -Recommended generative DA and explicit TL
[61]	Pawan et al. 2023	220 ML-based BCI studies	-Organized classification pipelines, common features, and classifiers -Noted DL emergence
[62]	Saibene et al. 2024	DL for MI EEG (public datasets)	-Highlighted preprocessing, benchmarks, wearable integration
[64]	Wang et al. 2024	67 DL-based studies for MI EEG (PRISMA)	-Benchmarked 13 models with public code -Their ablations resulted in design recommendations -Confirmed multistream CNN + LSTM best—FC layers are costly
[63]	Moreno-Castelblanco et al. 2025	35 ML- and DL-based studies on lower limb MI in neurorehabilitation	-Identified multimodal fusion, low-channel, portable BCIs as trends
Application-Specific Reviews: Cognitive Workload and Neuropsychology
[66]	Bardeci et al. 2021	DL for EEG-based psychiatric diagnosis and prediction	-Evaluated methodological rigor in clinical context of EEG -Highlighted flaws in reporting and validation
[65]	Hassan et al. 2024	EEG-based cognitive workload (PRISMA)	-Showed SVM + DL dominance (CNNs, RNNs, and hybrids) -Emphasized multimodal integration and real-world validation
Application-Specific Reviews: Special Contexts (Devices, VR/AR)
[68]	Dadebayev et al. 2022	Consumer vs. research-grade EEG	-Found limited data quality and ML performance with commercial headsets
[67]	Nwagu et al. 2023	EEG-based BCIs in VR/AR	-First review for VR/AR BCIs -Defined trends: SSVEP in AR, MI in VR -Noted discomfort and ITR issues
Architecture-Focused Reviews
[69]	Klepl et al. 2023	GNNs for EEG	-Reviewed graph node/edge features -Highlighted spectral graph convolutions’ dominance -Common node features (raw EEG signals and DE)
[13]	Vafaei and Hosseini 2025	Transformers for EEG	-Categorized (TS, vision, GAT, hybrid) Transformers -Provided a roadmap with data augmentation and TL

AR: Augmented Reality, ASIC: Application-Specific Integrated Circuit, BCI: Brain–Computer Interface, ER: Emotion Recognition, DE: Differential Entropy, DBN: Deep Belief Network, DML: Deep Metric Learning, DL: Deep Learning, EEG: Electroencephalogram, ERP: Event-Related Potential, FC: Fully Connected, FPGA: Field-Programmable Gate Array, GAN: Generative Adversarial Network, GAT: Graph Attention Network, GNN: Graph Neural Network, GRU: Gated Recurrent Unit, ITR: Information Transfer Rate, MI: Motor Imagery, ML: Machine Learning, PRISMA: Preferred Reporting Items for Systematic Reviews and Meta-Analyses, RNN: Recurrent Neural Network, SVM: Support Vector Machine, SoC: System on Chip, SSVEP: Steady-State Visual Evoked Potential, TS: Time Series, TF: Time–Frequency, TL: Transfer Learning, VR: Virtual Reality.

Table 14. WSA Scenarios.

Scenario Focus	Primary Weights	Insights
S1: Real-Time efficiency	Oper. Cost (Speed)	Reveals best models for synchronous, low-latency applications
S2: Edge Deployment Efficiency	Complex. (Size)	Shows ideal models for portable, low-memory devices
S3: Comprehensive All-Rounder	Neutral (Balanced)	Ranks models with the highest overall utility and efficiency

Table 15. WSA Criteria Weights.

Metric	S1: Real-Time	S2: Edge Deployment	S3: All-Rounder
Accuracy (Acc.)	35%	35%	40%
Oper. Cost	30%	15%	20%
Complex.	15%	30%	15%
Comp. Cost	10%	10%	10%
Train. Cost	10%	10%	15%

Table 16. Domains of application.

Domain/Task	CNNs	Transformers	CNN–Transformers	Recurrent Hybrids
Biometrics (Bio)	[70,71,72,73,75,77,78,79,80,81,82,83,85,90,91,93,96,99,100,101,103,105]	[41,42]	-	[135,136,137,138,139,143,144,145]
Motor Imagery (MI/BCI)	[11,12,86,88,89,94,96,97,98,99,102,103,104,106,107]	[39,40,44,45,109]	[113,114,116,117,119,120,125,128,129]	-
Emotion Recognition (ER)	[92]	[38,108,111]	[118,119,121,122,123,125,127,130]	[140]
Evoked Potentials (EPs)	[12,74,76,77,80,81,82,83,84,85,91,95,102,104,105,107]	[43]	[115,120,124,129]	[141,142]
General Foundation (GF)	[87]	[34,44,45,46]	[131,132,133]	-

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.