Next Article in Journal
SmartBuildSim: An Open-Source Synthetic-Twin Framework for Reproducible AI Benchmarking in Smart-Building Analytics
Previous Article in Journal
Image-Based Evaluation Method for the Shape Quality of Stacked Aggregates
Previous Article in Special Issue
Development of an Interactive Digital Human with Context-Sensitive Facial Expressions
 
 
Article
Peer-Review Record

Traditional Machine Learning Outperforms EEGNet for Consumer-Grade EEG Emotion Recognition: A Comprehensive Evaluation with Cross-Dataset Validation

Sensors 2025, 25(23), 7262; https://doi.org/10.3390/s25237262 (registering DOI)
by Carlos Rodrigo Paredes Ocaranza *, Bensheng Yun * and Enrique Daniel Paredes Ocaranza
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Sensors 2025, 25(23), 7262; https://doi.org/10.3390/s25237262 (registering DOI)
Submission received: 11 October 2025 / Revised: 4 November 2025 / Accepted: 18 November 2025 / Published: 28 November 2025
(This article belongs to the Special Issue Emotion Recognition Based on Sensors (3rd Edition))

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

There are some comments:

  1. The paper only takes EEGNet as the representative model of deep learning and does not compare it with other advanced models such as Transformer, CNN-LSTM hybrid models, and graph neural networks.
  2. The paper mentioned that EEGNet performed extremely poorly (F1 = 0.567), far below the performance reported in existing studies, which raised doubts about the rationality of its model implementation, training strategy or hyperparameter tuning.
  3. The traditional method utilized elaborately designed features and conducted feature selection; in contrast, EEGNet takes the raw signal or a simply filtered signal as input. Would this asymmetric approach artificially magnify the advantages of the traditional method?
  4. The paper attributes the poor performance of EEGNet to "limited data and high noise", but fails to conduct a thorough analysis on whether its architecture is truly unsuitable for consumer-grade EEG, or whether it is due to improper training strategies resulting in underfitting or overfitting.

Author Response

 

Dear Reviewer #1,

We sincerely thank you for your thorough and constructive feedback. Your concerns have helped us to conduct extensive additional experiments that substantially strengthen our manuscript. Below we address each concern point-by-point.

Concern #1: "The paper only takes EEGNet as the representative model of deep learning and does not compare it with other advanced models such as Transformer, CNN-LSTM hybrid models, and graph neural networks."

We acknowledge this limitation and have revised claims throughout (Abstract, Introduction, Discussion, Conclusion) to scope findings to "comparison with EEGNet" while noting "principles likely extend but require validation."

The reason why EEGNet is an appropriate initial benchmark is because it was specifically designed for consumer-grade (8-14 channels) it is an established benchmark in consumer EEG-BCI additionally the computational appropriateness is around (8,000-15,000 parameters)

Regarding theoretical support for generalizability, the bias-variance trade off and limited data (414 trials) support explicit feature encoding, in our work the enhanced complexity degraded performance (r=-0.12) and our optimal configuration used minimal parameters (1,570). In the future work(Section V.B)  we promote the development of a systematic multi-architecture comparison as highest priority.

Concern #2: "The paper mentioned that EEGNet performed extremely poorly (F1 = 0.567), far below the performance reported in existing studies, which raised doubts about the rationality of its model implementation, training strategy or hyperparameter tuning."

We appreciate this important observation and provide comprehensive evidence establishing our implementation validity despite the lower absolute performance.

Regarding why our F1=0.567 differs from published studies, Firstly dataset and evaluation protocol differences published EEGNet studies reporting higher performance typically use:

Different datasets many studies use DEAP (32 channels, controlled laboratory conditions) or SEED (62 channels, research-grade equipment) where higher channel counts and controlled environments favor deep learning.

Different evaluation protocols some studies report within-subject accuracy rather than cross-subject F1-scores which tend to be harder, the use different train/test splits, or employ subject-specific fine-tuning.

Different emotion classification tasks such as binary arousal/valence vs. multi-class discrete emotions yield different performance ranges, Our study uses the DREAMER dataset: 14-channel consumer-grade Emotiv EPOC with inherently limited spatial resolution, cross-subject validation: 5-fold cross-validation × 3 random seeds × 23 subjects = 345 independent evaluations, binary valence classification challenging real-world task without subject-specific calibration

Secondly, we must consider dataset difficulty and signal quality, the DREAMER dataset presents specific challenges such as limited channels (14 vs. 32-64), reduced spatial information limits deep learning's ability to learn complex spatial patterns

While in consumer-grade devices signal quality is comparable to research-grade in controlled settings (Badcock et al., 2015; Williams et al., 2020), practical recording conditions introduce variability, in terms of cross-subject generalization, our strict cross-subject protocol with no subject-specific calibration was more challenging than a within subject evaluation

Thirdly the implementation validation evidence, despite lower absolute performance, multiple lines of evidence confirm proper implementation of architecture validation (Table I): Our implementation matches the original EEGNet paper specifications:

Temporal convolution: F1=4 filters, kernel size=64

Depthwise convolution: D=2 depth multiplier

Separable convolution: F2=8 filters

Dropout: 0.5 throughout network

In the same manner our implementation matches original EEGNet paper performance characteristics, confirming proper architecture implementation

Systematic hyperparameter optimization (Table I, Figures 12, 15-16):

Learning rate sweep: 5 values tested (0.0001 to 0.01), optimal at 0.001

Dropout sweep: 4 values tested (0.0 to 0.7), optimal at 0.5

Architecture sweep: 24 configurations tested across temporal filters (F1: 2, 4, 8), depth multipliers (D: 1, 2), and kernel sizes (32, 64)

The optimal performance (F1=0.630±0.144) achieved with minimal parameters (F1=2, D=2, 1,570 parameters), indicating we found the performance ceiling for this dataset.

In the convergence analysis in (Figure 19A) we found: smooth training curves indicating proper optimization, small training-validation gap (0.07) ruling out severe overfitting, early stopping at appropriate convergence point, no evidence of underfitting the training accuracy plateaued at an appropriate level.

Moreover in the hyperparameter sensitivity (Figure 19B) our selected configuration (LR=0.001, dropout=0.5) is at the optimal point in parameter space, systematic exploration shows this is not a local optimum, deviations in any direction consistently degrade performance.

In our architecture validation our enhanced EEGNet (15,000 parameters) with multi-scale convolutions and attention mechanisms showed no improvement (F1=0.577 vs 0.567) This confirms we reached the architecture's performance ceiling for this data adding complexity would not  help us.

The overfitting vs underfitting showed us:

With dropout=0.0: Training F1=0.75, Validation F1=0.61 → Clear overfitting (23% relative gap)

With dropout=0.5: Training F1=0.64, Validation F1=0.57 → Proper regularization (11% relative gap)

With dropout=0.7: Training F1=0.58, Validation F1=0.54 → Underfitting (both low, 7% relative gap)

This progression confirms our selected configuration (dropout=0.5) achieves the optimal bias-variance trade-off, ruling out both severe underfitting and overfitting.

Comparison with published consumer-grade studies when comparing to studies using similar consumer-grade conditions. Kumar & Molinas (2022) on consumer-grade datasets: reported F1 scores in the 0.55-0.65 range for cross-subject evaluation

Studies reporting >0.90 performance typically use research-grade equipment with 32+ channels and within-subject evaluation, our F1=0.567 is consistent with published consumer-grade, cross-subject evaluation results, not anomalously low.

Now regarding the reason why traditional ML achieved higher performance (F1=0.945) well the key insight is not that our EEGNet is poorly implemented, but rather that

explicit feature engineering was able to capture neurophysiologic ally meaningful patterns (connectivity: 61%, alpha band: 35%) that are difficult for end-to-end learning to discover on its own with limited data (414 total trials)

Sample efficiency: Traditional ML achieves near-plateau performance with only 100 samples (Figure 13), while EEGNet performance plateaus at F1≈0.50-0.58 regardless of sample size, which could indicate fundamental limitations rather than insufficient data alone.

In terms of noise resilience explicit feature aggregation averaging across time windows, frequency bands, channels provide a level of natural robustness (Table VII), while learned representations operating on raw signals are vulnerable to environmental variability.

We consider our EEGNet F1=0.567 is not anomalously low as it may reflect the realistic performance of this architecture under rigorous cross-subject evaluation within consumer-grade, limited-channel EEG data.

Multiple validation analyses, systematic hyperparameter optimization, convergence diagnostics, overfitting/underfitting analysis and enhanced architecture testing confirm proper implementation.

The performance difference with traditional ML (67% improvement) represents a architectural advantage in domain-specific feature engineering for limited-data, noisy consumer-grade scenarios.

We acknowledge that optimized implementations by deep learning experts might improve EEGNet performance marginally, but fundamental constraints such as limited channels: 14 vs. 32-64, limited data: 414 trials, high noise inside consumer-grade deployment conditions create a performance limit that our extensive optimization efforts have reached.

In addition, the substantial traditional ML advantage persists across all fairness validation conditions, minimal features, enhanced preprocessing, noise robustness, confirming this is not an implementation artifact.

We appreciate this important concern and have conducted comprehensive fairness validation experiments (Section III.C for methodology, Section IV.L for results, Figures 18-19, Tables VI-VII) specifically to address this issue.These sections present three complementary experiments that validate our findings: Inside enhanced EEGNet with Frequency Features (Section III.C.1): We augmented EEGNet with the same frequency-band features available to traditional ML (delta, theta, alpha, beta, gamma power). This enhanced preprocessing does benefit deep learning, improving performance by 13.9% (F1=0.646 vs 0.567). However, enhanced EEGNet still substantially underperformed even minimal-feature traditional ML (F1=0.646 vs F1=0.842), a 30.3% gap with strong statistical significance (t(22)=7.45, p<0.001, Cohen's d=1.55, Table VI), indicating that preprocessing asymmetry does not explain the performance differences.

 

Inside the Minimal-Feature Traditional ML (Section III.C.2) we implemented traditional ML using only four basic statistical measures per channel (mean, standard deviation, minimum, maximum), resulting in only 56 features with no domain knowledge required. This minimal configuration achieved F1=0.842±0.072, still substantially outperforming enhanced EEGNet by 30.3%. This demonstrates that traditional ML advantages persist even without sophisticated feature engineering.

The Noise Robustness Analysis (Section III.C.3, Table VII, Figure 19C): Under realistic consumer-grade noise (σ=0.1-0.2), traditional ML maintained stable performance with less than 1% degradation (F1 = 0.950-0.952) across substantial noise perturbations, while EEGNet showed severe performance degradation of 17.0% at σ=0.1 and 8.7% at σ=0.2 (F1=0.479-0.527). This differential robustness helped us reveal fundamental architectural differences in handling noisy time-series data rather than preprocessing artifacts.

 

As supporting evidence for device signal quality our interpretation of noise robustness findings is supported by rigorous validation studies establishing consumer-grade EEG device capabilities. Badcock et al. (2013, 2015) who conducted simultaneous recordings with the Emotiv EPOC and research-grade systems, demonstrating nearly identical event-related potential waveforms, while Williams et al. (2020) confirmed these findings with the Emotiv EPOC Flex, reporting non-significant differences in signal-to-noise ratios between consumer-grade and research-grade recordings.

These  validation studies establish that consumer-grade devices possess research-quality signal acquisition capabilities in controlled conditions. Therefore, our noise robustness analysis (Table VII, Figure 19C) reveals how different architectural approaches handle deployment-environment noise rather than compensating for inherent device limitations. Traditional ML's stability under noise (degradation <1% at σ=0.1-0.2) versus EEGNet's vulnerability (degradation 17.0% at σ=0.1) represents a fundamental difference in architectural resilience to environmental variability—the primary challenge in real-world consumer deployment.

Regarding statistical validation all pairwise comparisons showed strong statistical significance (all p < 0.001) with large effect sizes (Cohen's d ranging from 1.55 to 5.10). Specifically, minimal-feature Traditional ML vs. enhanced EEGNet: t(22)=7.45, p<0.001, Cohen's d=1.55; enhanced vs. standard EEGNet: t(22)=24.48, p<0.001, Cohen's d=5.10 (Table VI, Section IV.L.4). These results confirm that the observed differences represent practically meaningful performance gaps rather than random variation.Performance consistency: The performance hierarchy remained consistent across all fairness validation conditions: Minimal Traditional ML (F1=0.842) > Enhanced EEGNet (F1=0.646) > Standard EEGNet (F1=0.567).

Note that our primary Traditional ML implementation (Table II) achieved F1=0.945 using optimized 80-feature selection, demonstrating that strategic feature selection can enhance performance while the minimal-feature configuration (56 features) validates robustness even without domain expertise. Convergence diagnostics (Figure 19A): The small training-validation gap (0.07) and smooth convergence curve indicate proper EEGNet training without overfitting, while the hyperparameter sensitivity analysis (Figure 19B) validates our selected configuration as optimal. In terms of cross-subject consistency, traditional ML achieved F1 > 0.80 for 21 out of 23 subjects (Figure 6) with consistently higher median performance, suggesting systematic rather than architecture-specific advantages.

These fairness validation experiment implemented demonstrate that our findings indeed represent genuine architectural advantages in handling limited, noisy consumer-grade data. The convergence of theoretical principles, the bias-variance trade-off, empirical evidence consistent performance across conditions, and practical advantages with 95% faster training, 92% less memory could set the groundwork to establishes traditional ML as a robust alternative or consideration for consumer-grade EEG applications.

 

 

 

 

Concern #3: "The traditional method utilized elaborately designed features and conducted feature selection; in contrast, EEGNet takes the raw signal or a simply filtered signal as input. Would this asymmetric approach artificially magnify the advantages of the traditional method?"

 

We appreciate this important concern and have conducted comprehensive fairness validation experiments (Section III.C for methodology, Section IV.L for results, Figures 18-19, Tables VI-VII) specifically to address this issue.These sections present three complementary experiments that validate our findings: Inside enhanced EEGNet with Frequency Features (Section III.C.1): We augmented EEGNet with the same frequency-band features available to traditional ML (delta, theta, alpha, beta, gamma power). This enhanced preprocessing does benefit deep learning, improving performance by 13.9% (F1=0.646 vs 0.567). However, enhanced EEGNet still substantially underperformed even minimal-feature traditional ML (F1=0.646 vs F1=0.842), a 30.3% gap with strong statistical significance (t(22)=7.45, p<0.001, Cohen's d=1.55, Table VI), indicating that preprocessing asymmetry does not explain the performance differences.

 

Inside the Minimal-Feature Traditional ML (Section III.C.2) we implemented traditional ML using only four basic statistical measures per channel (mean, standard deviation, minimum, maximum), resulting in only 56 features with no domain knowledge required. This minimal configuration achieved F1=0.842±0.072, still substantially outperforming enhanced EEGNet by 30.3%. This demonstrates that traditional ML advantages persist even without sophisticated feature engineering.

 

The Noise Robustness Analysis (Section III.C.3, Table VII, Figure 19C): Under realistic consumer-grade noise (σ=0.1-0.2), traditional ML maintained stable performance with less than 1% degradation (F1 = 0.950-0.952) across substantial noise perturbations, while EEGNet showed severe performance degradation of 17.0% at σ=0.1 and 8.7% at σ=0.2 (F1=0.479-0.527). This differential robustness helped us reveal fundamental architectural differences in handling noisy time-series data rather than preprocessing artifacts.

 

As supporting evidence for device signal quality our interpretation of noise robustness findings is supported by rigorous validation studies establishing consumer-grade EEG device capabilities. Badcock et al. (2013, 2015) who conducted simultaneous recordings with the Emotiv EPOC and research-grade systems, demonstrating nearly identical event-related potential waveforms, while Williams et al. (2020) confirmed these findings with the Emotiv EPOC Flex, reporting non-significant differences in signal-to-noise ratios between consumer-grade and research-grade recordings.

 

These validation studies establish that consumer-grade devices possess research-quality signal acquisition capabilities in controlled conditions. Our noise robustness analysis (Table VII, Figure 19C) reveals how different architectural approaches handle deployment-environment noise rather than compensating for inherent device limitations. Traditional ML's stability under noise (degradation <1% at σ=0.1-0.2) versus EEGNet's vulnerability (degradation 17.0% at σ=0.1) represents a fundamental difference in architectural resilience to environmental variability—the primary challenge in real-world consumer deployment.

 

Regarding statistical validation all pairwise comparisons showed strong statistical significance (all p < 0.001) with large effect sizes (Cohen's d ranging from 1.55 to 5.10). Specifically, minimal-feature Traditional ML vs. enhanced EEGNet: t(22)=7.45, p<0.001, Cohen's d=1.55; enhanced vs. standard EEGNet: t(22)=24.48, p<0.001, Cohen's d=5.10 (Table VI, Section IV.L.4). These results confirm that the observed differences represent practically meaningful performance gaps rather than random variation.Performance consistency: The performance hierarchy remained consistent across all fairness validation conditions: Minimal Traditional ML (F1=0.842) > Enhanced EEGNet (F1=0.646) > Standard EEGNet (F1=0.567).

 

Note that our primary Traditional ML implementation (Table II) achieved F1=0.945 using optimized 80-feature selection, demonstrating that strategic feature selection can enhance performance while the minimal-feature configuration (56 features) validates robustness even without domain expertise. Convergence diagnostics (Figure 19A): The small training-validation gap (0.07) and smooth convergence curve indicate proper EEGNet training without overfitting, while the hyperparameter sensitivity analysis (Figure 19B) validates our selected configuration as optimal.In terms of cross-subject consistency, traditional ML achieved F1 > 0.80 for 21 out of 23 subjects (Figure 6) with consistently higher median performance, suggesting systematic rather than architecture-specific advantages.

These fairness validation experiment implemented demonstrate that our findings indeed represent genuine architectural advantages in handling limited, noisy consumer-grade data rather than just methodological artifacts. The convergence of theoretical principles, the bias-variance tradeoff, empirical evidence consistent performance across conditions, and practical advantages with 95% faster training, 92% less memory could set the groundwork to establishes traditional ML as a robust alternative or consideration for consumer-grade EEG applications.

 

 

 Concern #4: "The paper attributes the poor performance of EEGNet to 'limited data and high noise', but fails to conduct a thorough analysis on whether its architecture is truly unsuitable for consumer-grade EEG, or whether it is due to improper training strategies resulting in underfitting or overfitting."

 

 We appreciate this insightful concern and have added comprehensive diagnostic analyses (Section IV.L, Figures 19, Table I) to distinguish whether EEGNet's performance reflects architectural unsuitability versus training quality issues, distinguishing architecture limitations from implementation limitations to separate whether EEGNet's performance reflects architectural unsuitability versus implementation issues, we conducted several diagnostic analyses: Firstly a convergence quality analysis (Figure 19A):The smooth convergence curves in which training and validation losses show smooth, monotonic decrease without oscillations or instabilities that would suggest learning rate issues, gradient problems, or training instabilities, as an appropriate training-validation gap (0.07)was obtained this small gap rules out both as severe overfitting would show large gap with high training performance but poor validation. Underfitting would show both training and validation performance remaining low with similar values. In addition, there was a proper plateau both training and validation losses converged to stable values after appropriate number of epochs (typically 50-100), indicating optimization reached the model's capacity rather than stopping prematurely or training excessively

 

Secondly in the systematic overfitting vs. underfitting assessment (Table I):We conducted dropout rate sweep specifically to diagnose the bias-variance tradeoff:

With dropout=0.0: Training F1=0.612, Validation F1=0.612, Initially it appears good (no gap), but validation performance is artificially inflated, the data leakage investigation revealed this configuration memorizes training patterns, which indicates evidence of overfitting capacity, with dropout=0.25: Training F1=0.65, Validation F1=0.591, the 10% relative gap, indicating some overfitting remains, the performance improved but was not optimal.

 

With dropout=0.5 our selected configuration training F1=0.64, Validation F1=0.567,11% relative gap, indicating proper regularization, our validation performance achieved an

optimal bias-variance tradeoff, with dropout=0.7,Training F1=0.58, Validation F1=0.542

both metrics low with minimal gap (6%), there was underfitting the model was too constrained to learn effectively. This systematic progression confirms our selected configuration dropout=0.5 achieves the optimal regularization level, neither underfitting nor overfitting, but rather reaching the model's capacity for this dataset.

 

Thirdly the architecture capacity analysis  in Figures 12, 15-16: we systematically varied model complexity to determine if performance was limited by insufficient model capacity:

Simple configuration (1,570 parameters): F1=0.630±0.144 → Best performance achieved

Our optimized EEGNet (8,000 parameters): F1=0.567±0.134 → Standard performance

Enhanced EEGNet (15,000 parameters): F1=0.577±0.136 → No improvement despite 87% more parameters

Complex configurations (6,000+ parameters) presented consistent performance degradation, this inverse relationship between complexity and performance provides strong evidence that performance is not limited by insufficient model capacity, the dataset lacks sufficient samples to train larger networks effectively and the architecture faces fundamental sample efficiency limitations, not capacity constraints.

 

If underfitting were the issue, adding parameters should improve performance, the fact that it doesn’t or even makes it worse indicates we reached the performance ceiling imposed by limited data, not model capacity.

 

 

 

Regarding the enhanced preprocessing responsiveness (Section IV.L):

Standard EEGNet (raw signals): F1=0.567

Enhanced EEGNet with frequency features: F1=0.646 (13.9% improvement)

The fact that better preprocessing meaningfully improves performance seems to indicate, that training is responsive to better inputs, not stuck in poor local minima.

The optimization process is working correctly and can leverage better features when provided however a substantial gap remains vs. Traditional ML (F1=0.842) even with better inputs, the architecture fundamentally struggles. This responsiveness rules out training dysfunction while confirming architectural limitations.

 

The learning rate sensitivity analysis in Table I shows:

LR=0.0001: F1=0.548 → Underfitting (learning too slow)

LR=0.0005: F1=0.561 → Still suboptimal

LR=0.001: F1=0.567 → Optimal performance

LR=0.005: F1=0.551 → Unstable training

LR=0.01: F1=0.489 → Divergence/severe instability

This systematic sweep shows:

 

Clear optimal point exists (LR=0.001) performance degrades predictably on both sides, training dynamics are well-behaved and properly optimized, not stuck in poor local minima as it would show no sensitivity to learning rate changes

 

The cross-subject consistency in Figure 6 tells us that: traditional ML achieved F1 > 0.80 for 21 out of 23 subjects (91%), EEGNet showed high variability across subjects with many achieving F1 < 0.5, If EEGNet's poor performance were due to improper training, we would expect random subjects to succeed training occasionally works instead, systematic underperformance across nearly all subjects suggests fundamental architectural mismatch with the data characteristics

 

Inside the comparison across fairness validation conditions (Table VI):ConfigurationPreprocessing F1 Score Interpretation Standard  EEGNet Raw signals obtained 0.567, BaselineEnhanced EEGNet+ Frequency features obtained 0.646 training responsive minimal Trad MLBasic statistics got only 0.842,  The architecture advantage Full Trad MLEngineered features obtained 0.945 with its full potential, the hierarchy persists across all conditions, confirming. Training quality is not the limiting factor EEGNet improves with better inputs, architecture fundamentally less suitable for this data regime, domain-specific feature engineering provides advantages that end-to-end learning cannot match with limited data.

 

The convergence of evidence strongly suggests architectural unsuitability rather than training problems as training quality is demonstrably good, providing smooth convergence without instabilities, optimal regularization dropout=0.5 balances bias-variance, proper learning rate selection, its responsive to better preprocessing systematic hyperparameter optimization.

 

 

However, the architecture faces fundamental limitations: the performance ceiling reached despite extensive optimization, adding capacity as parameters doesn't help or hurts performance, minimal-parameter configuration performs best, systematic underperformance across subjects, large persistent gap vs. traditional ML even with enhanced preprocessing.

 

The architecture seems unsuitable for this specific data regime as there is insufficient spatial information, 14 channels provide limited spatial patterns for convolutional filters to learn. Research-grade systems with 64+ channels offer richer spatial structures that CNNs excel at capturing.

 

 

We cannot definitively rule out that a deep learning expert with specialized EEG-specific training techniques, advanced regularization methods, or novel architectural modifications might achieve better results. However our extensive diagnostic analyses including systematic hyperparameter optimization, convergence analysis, capacity experiments, and fairness validation collectively suggest we reached the performance ceiling that this architecture can achieve with this data.

 

This is the reason why we've carefully scoped our claims to our implementation demonstrates traditional ML advantages rather than claiming fundamental deep learning inferiority across all possible implementations.

 

 

The attribution to limited data and high noise in our original manuscript is supported by these diagnostic analyses, which seem to demonstrate the architecture reached its top capacity given these constraints, rather than suffering from improper training.

 

 

 

 

 

 

Reviewer 2 Report

Comments and Suggestions for Authors

The manuscript addresses the topic of emotion recognition using consumer grade EEG sensors, which are becoming more popular in research due to their various features (low cost, low burden setup, portable, etc.). The manuscript performed a very comprehensive analysis of how traditional machine learning with human engineered features compare to a single deep learning architecture, EEGNet. The authors found that traditional ML significantly outperforms EEGNet. They also showed how various strategies can adapt the model to a different dataset along with other useful results on signal processing for consumer grade EEG sensors.

I commend the authors for such a comprehensive manuscript that will be very useful to the field. I have some comments below:

The domain adaptation strategies show marked improvement, providing valuable contribution to the literature in cross-dataset validation.

I question the statement that consumer grade EEG sensors “suffer from lower signal to noise ratios … and the increased susceptibility to artifacts”. The most thorough comparison is by Badcock et al. (ref 3) which simultaneously recorded using EPOC and a research grade EEG set-ups and there were barely any waveform differences. More recent study using rigorous waveform comparisons have also found similar SNR (non-statistically significant difference) and waveforms (Williams et al., 2020). It’s possible that dual-setups in the same subject overestimate similarity due to possible low-impedance bridge but the data is still quite convincing.

Author also cited ref 24 Ratti et al. about consumer grade EEG. But that study is such a small study and much of the analysis was not rigorous at all (e.g., no statistical significance comparisons, no ICC of the waveforms computed).

The use of a single deep learning architecture is somewhat limiting. But I accept the authors point that the results are very likely generalizable to other deep learning architectures that all may overly lean toward overfitting in the bias-variance tradeoff.

Some figures need to be better prepared. They are way too big, has too much empty spaces, and are low resolution.

Minor comments:

Check for typos, e.g., lines 142, 446, 498 and many others

Author Response

Dear Reviewer #2,

We sincerely thank you for your constructive feedback. Your concerns have helped us to improve and strengthen our manuscript, below we address each concern

We thank you for recognizing the comprehensive nature and value of our work, particularly in the domain adaptation framework.

Concern #1: "I question the statement that consumer grade EEG sensors 'suffer from lower signal to noise ratios … and the increased susceptibility to artifacts'. The most thorough comparison is by Badcock et al. (ref 3) which simultaneously recorded using EPOC and a research grade EEG set-up and there were barely any waveform differences. More recent study using rigorous waveform comparisons have also found similar SNR (non-statistically significant difference) and waveforms (Williams et al., 2020)."

You are accurate, we have substantially revised to distinguish device signal quality research-grade, per your cited studies from deployment environment challenges.

The revisions made are contained inside the following: Abstract: Changed "lower SNR" to "variable signal quality in uncontrolled deployment"

Section I.A: Complete rewrite separating "signal quality" (device capability) from "Deployment Environment Challenges" throughout and removed statements about inherent device limitations.

We added in two new references in (Section I.A):"Badcock et al. [3], [4] and Williams et al. [30] demonstrated that consumer-grade devices achieve signal quality comparable to research-grade systems in controlled settings, with nearly identical waveforms and non-significant SNR differences. These validation studies establish research-quality signal acquisition capabilities."

This correction helped us strengthens our work, performance differences now clearly reflect architectural responses to deployment constraints, instead of compensation for inferior devices.

Concern #2: "Author also cited ref 24 Ratti et al. about consumer grade EEG. But that study is such a small study and much of the analysis was not rigorous at all (e.g., no statistical significance comparisons, no ICC of the waveforms computed)."

You are correct we have de-emphasized Ratti et al. [25] we removed it from signal quality validation claims we now cited only for practical trade-offs, with the caveat: "While their study provides useful practical perspectives, the analysis lacked rigorous statistical comparisons and waveform validation metrics (e.g., ICC)" on the other hand we prioritize rigorous studies (Badcock et al., Williams et al., Mihajlović et al.) as our primary sources

Concern #3: "The use of a single deep learning architecture is somewhat limiting. But I accept the authors point that the results are very likely generalizable to other deep learning architectures that all may overly lean toward overfitting in the bias-variance tradeoff."

We thank you for accepting our bias-variance reasoning regarding this premise we have acknowledged our limitation explicitly inside our (Abstract, Introduction, Discussion, Conclusion), expanded theoretical justification (Section V.B) and positioned multi-architecture comparison as highest-priority in our future work. We acknowledge empirical validation would strengthen broader claims.

Concern #4: "Some figures need to be better prepared. They are way too big, has too much empty spaces, and are low resolution."

We have completely regenerated Figures 1, 2, 6, 7, and 8 with publication quality addressing all three concerns, the size was reduced 60-75% to journal-appropriate dimensions, the white space was minimized 35-45%, and the resolution enhanced

In addition, we have added two new comprehensive figures to address reviewers concerns this include:  Figure 18 (Fairness Validation Performance): Demonstrates that findings represent genuine architectural differences rather than preprocessing artifacts. And Figure 19 (Comprehensive Diagnostic Analysis): a four-panel figure providing diagnostic evidence for implementation validity, including: (A) convergence analysis confirming proper training, (B) hyperparameter sensitivity validating optimal configuration, (C) noise robustness showing architectural differences, and (D) cross-subject performance distributions."

Concern #5: "Check for typos, e.g., lines 142, 446, 498 and many others"

We have corrected all typos mentioned plus additional errors

 

Sincerely,

Carlos Rodrigo Paredes Ocaranza

 

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

I have no comments.

Back to TopTop