1. Introduction
Electroencephalography (EEG)-based brain–computer interfaces (BCIs) have demonstrated clinical efficacy in medical diagnostics and rehabilitation [
1]. Recent advancements in analog front-end design, electrode materials, and digital signal processing have enabled miniaturized wearable EEG devices, expanding applications beyond clinical applications [
2,
3]. Learning effectiveness depends on appropriate levels of stress and mental workload, and modern personalized education requires tools to monitor these states in real time at the individual level. Learning is a complex process that relies on a delicate balance of cognition and emotion [
4]. The levels of mental workload and stress have a direct impact on the effectiveness of one’s learning process [
5,
6]. Empirical evidence indicates that a moderate level of stress and mental engagement can enhance one’s learning performance, whereas both under- and over-load can impede the acquisition and retention of knowledge [
7,
8,
9]. With the growing emphasis on self-directed and personalized learning in modern educational paradigms, there is a pressing need to monitor and regulate these mental states at the individual level [
10,
11].
Recent advancements in evaluating mental workload and stress using EEG have been significantly enhanced by deep learning architectures. By combining convolutional neural networks (CNNs), bidirectional long short-term memory (Bi-LSTM), and gated recurrent units (GRUs), one study achieved 97.60% accuracy in stress-level classification on multi-channel EEG recordings from 48 subjects [
12]. Another work reported 97.86% accuracy using an attention-based CNN–LSTM model on multimodal signals (ECG and pulse rate) from 34 subjects [
13]. For mental workload assessment, an attention-based LSTM model achieved 87.1% classification accuracy [
14]. These algorithms show promising results and substantially improve the usability of EEG for mental workload and stress assessment.
Despite these promising algorithmic results, many existing systems rely on multi-channel, scalp-mounted EEG devices that limit everyday usability. Commonly used research-grade systems include the 14-channel Emotiv EPOC headset (used to collect the STEW dataset [
15]), Neuroelectrics Enobio [
16], and OpenBCI-based configurations [
17], all of which require multiple electrodes placed on the scalp. Even single-channel EEG devices such as the NeuroSky MindWave [
18], while simpler to set up, still have an appearance and wearing style that are not ideal for daily use. Beyond EEG, recent reviews show that workload and stress can also be estimated using other physiological signals, including heart rate variability, electrodermal activity, respiration, pupil diameter, and other peripheral measures, highlighting the potential of less obtrusive sensing approaches. A recent survey on physiological sensor technologies in workload estimation emphasizes the growing importance of multimodal and wearable sensing strategies in this context [
19]. In this work, we focus specifically on EEG-based assessment, while recognizing that integrating EEG with peripheral physiological signals is an important direction for future research.
Since the introduction of ear-EEG in 2011 [
20], research has progressively demonstrated its clinical utility beyond experimental settings. Some applications focused on sleep monitoring and mental state monitoring [
21,
22], investigations have also expanded into neurodegenerative disease screening [
23]. Notably, multi-night in-ear EEG recordings have identified distinctive sleep-related EEG biomarkers associated with mild cognitive impairment (MCI) [
24]. It has several advantages over other types of wearable EEG devices such as tighter contact between electrodes and skin, and a more stable electrode positioning [
25], making ear-EEG a promising tool for long-term EEG monitoring. Furthermore, ear-EEG significantly reduces the setup time and complexity compared to the traditional scalp EEG measurements. Even users without prior experience with EEG devices can easily operate the system by placing electrodes into ears, akin to wearing common earphones. Thus, combining ear-EEG with earphone form factors can greatly increase the usability and user-friendliness of EEG systems in real-life scenarios.
In summary, prior studies have demonstrated that deep learning models applied to multi-channel EEG can achieve high accuracy for mental workload and stress classification, and that various physiological modalities can also be used for this purpose. Nevertheless, most existing systems rely on bulky, scalp-mounted EEG hardware and are evaluated primarily in controlled laboratory settings, which limits their suitability for unobtrusive, everyday use. Furthermore, the potential of compact, wearable EEG form factors for reliable workload and stress estimation remains underexplored. To address this gap, this study proposes a low-cost, scalable in-ear EEG system that adopts the form factor of consumer-grade earpieces and integrates off-the-shelf components. Unlike conventional EEG systems that follow the 10–20 layout, all electrodes in our design are placed on the ear and within the ear canal, greatly improving the portability and usability of the system. We then investigate its feasibility for predicting stress and mental workload levels during learning tasks. By leveraging recent advances in portable EEG technology and machine learning, this work aims to lay the groundwork for adaptive, closed-loop feedback systems that optimize learning outcomes by tailoring instructional interventions to individual learners’ needs. Overall, the proposed approach seeks to balance the ecological validity of real-world educational settings with the rigor of controlled laboratory experiments, thereby helping to bridge the gap between the two.
2. Materials and Methods
2.1. In-Ear EEG System Design
The in-ear EEG system used in this study was designed to resemble a neckband earphone, aiming to enhance user-friendliness and mass manufacturability, as illustrated in
Figure 1a. How the device was worn is depicted in
Figure 1b. The single-channel EEG acquisition system comprised an ADS1299 analog front-end (Texas Instruments Incorporated, Dallas, TX, USA) operating at a 500 Hz sampling rate. The digitalized signals were transmitted through a Bluetooth Low Energy (BLE) protocol using an FR5086 system-on-chip (Shanghai Frequen Microelectronics Co., Ltd., Shanghai, China). The device was powered by a 240 mAh battery that can last longer than a 4 h consecutive recording. The Printed Circuit Board (PCB), control buttons and USB-C charging port were located on the right side of the neckband, while the battery was housed on the left. The two sides were connected by electric wires running internally within the neckband. To minimize potential interference, the wire for EEG signal transmission that is placed inside the neckband was shielded from the power supply wires running alongside it. The dry electrodes were manufactured by Dongguan Reebelk Silicone Products Co., Ltd. (Dongguan, China); all electrodes were manufactured with silicone material and graphite to achieve both high conductivity and high softness for user comfort. As shown in
Figure 1c, two bias electrodes were positioned on the concha site in each ear, with the passive recording electrode in the left ear canal and the reference electrode in the right ear canal. The bias electrodes are shaped like the ear support found in sports headphones to provide the earpiece with a tighter fit and help to decrease motion artifacts, The electrodes were connected to the ADC through metal connectors on the earpiece; the structure is illustrated in
Figure 1d.
The recording and reference electrodes are connected to the differential analog negative input 1 (IN1N) and differential analog positive input 1 (IN1P) on the ADS1299; the bias electrodes are connected to the BIAS pins. The ADS1299 is subsequently connected to the BLE chip via the SPI protocol. The in-ear EEG acquisition system presented in this study was also validated and utilized in another study by Fang et al. [
22].
2.2. Experiment
The experimental protocol received approval from the Survey and Behavioral Research Ethics Committee at the Chinese University of Hong Kong (SBRE-21-0832). All subjects provided informed consent prior to participating in the study. A total of 66 subjects participated both stress and mental workload experiments.
2.2.1. The Stress Experiment
The stress experiment aimed to record EEG signals while subjects experienced distinct stress levels. Initially, 51 subjects completed two stress-related conditions: an eyes-open resting task at the beginning of the experiment, representing a relatively lower stress level, and a cold pressor task, which induced a higher stress level. Subsequently, an additional 15 subjects were recruited, for whom an eyes-closed meditation task was inserted after the eyes-open resting task and before the mental workload tasks, in order to induce the lowest stress level among all conditions. First, subjects were asked to perform a training session to familiarize themselves with the task, equipment and surroundings; the training session also helped to reduce the level of nervousness during the following eyes-open resting sessions. Then, they were asked to keep their eyes opened while staying still for two minutes to establish a baseline stress level. At last, subjects were asked to immerse one of their hands in ice-cold water for two minutes to induce a high-stress condition; this cold pressor task was also used in other studies to induce stress responses in the EEG signal [
26,
27]. The water temperature was kept at 5 °C throughout the immersion. No task randomization was performed for the stress experiment. The cold pressor task aims to induce stress. Putting it at the end of the whole experiment will add to the stress the subject already experienced with a sequence of challenging mental workload tasks and thus represent the highest stress during the whole experiment. Another set of experiments with 15 subjects was added in which the subjects were asked to perform a two-minute closed-eye meditation exercise after resting, a paced breathing exercise where the subjects were guided to control their respiration rate to be around 10 breaths/min, while maintaining a comfortable posture to engage in a low-stress state [
28]. Because increased alpha activity can represent lower stress levels [
29], this meditation task was included to emulate the lowest stress condition and for exploring the difference of stress level among three different tasks. Following the completion of each two-minute task, subjects were asked to provide a subjective stress rating on a scale from 1 to 9. This self-reported measure served as a valuable source of feedback, allowing for a more comprehensive understanding of the individual’s subjective experience of stress in relation to the objective physiological measures captured by the EEG recordings. The timeline of the study is shown in
Figure 2.
2.2.2. The Mental Workload Experiment
In the present study, we adopted the same cognitive tasks and mental workload manipulations previously employed by So et al. [
11] to investigate the feasibility of using a single-channel, in-ear EEG device for real-time monitoring of mental workload during learning activities. The four cognitive tasks employed in this study were selected to span distinct cognitive and motor domains, and each exhibits robust parametric scalability across difficulty levels, which means the task difficulty can be systematically manipulated along theoretically meaningful dimensions, enabling fine-grained assessment of mental workload variation. This multi-domain approach ensures comprehensive assessment of mental workload across fundamental cognitive processes relevant to learning and problem-solving. The arithmetic task engages numerical reasoning and computational problem-solving, activating the central executive component of working memory alongside higher-order executive functions such as attention allocation and response inhibition. Moreover, these tasks reflect fundamental cognitive and motor functions involved in everyday learning and problem-solving. For example, arithmetic and mental rotation are core components of learning in STEM education, involving numerical reasoning and spatial visualization skills. Finger tapping relates to motor skill acquisition and coordination, which are essential in many learning contexts, including musical training. Lexical decision taps into language processing, vital for reading and verbal learning. In the arithmetic task, subjects were presented with arithmetic equations on a screen and required to determine their correctness. They used the ’left’ and ’right’ arrow keys on the keyboard to indicate their answers. The mental workload levels in this task were manipulated based on the complexity of the calculations involved. The content of the task for each mental workload level was as follows. Low mental workload: Single-digit addition. Medium mental workload: Double-digit addition with carry or subtraction with borrow. High mental workload: Mixed arithmetic operations. For the finger tapping task, subjects engaged in visualmotor coordination. They were presented with a pattern on the screen and required to tap the corresponding keys (‘A’, ‘S’, ‘D’, ‘F’, ‘J’, ‘K’, ‘L’, and ‘;’) on the keyboard. The content of the task varied across different mental workload levels, as follows. Low mental workload: Involves a single hand and a single finger. Medium mental workload: Involves a single hand and multiple fingers. High mental workload: Involves two hands and multiple fingers. In the mental rotation task, subjects were presented with pairs of polyomino/polycube figures and tasked with determining whether they represented the same figure or not. The polyomino/polycube figures could either be distinct figures or the same figure viewed from different perspectives. These polyomino/polycube figures were composed of interconnected small squares/cubes. Specifically, polyomino figures that consisted of 2D squares were used in the low mental workload condition. In contrast, polycubes constructed using 3D cubes were used in the medium mental workload condition, whereas polycubes with the highest number of cubes were used in the high mental workload condition. In the lexical decision task, linguistic ability was involved. Subjects were presented with words on the screen and required to determine if the word was a real English word or a pseudo-word. The words at different mental workload levels were created by varying factors, such as the word usage, number of characters, and part of speech. The pseudo-words in this task were generated using Wuggy software v1.0.0 [
30]. Within each task, there were three levels of mental workloads: low, medium, and high. To ensure a robust and reliable assessment of each mental workload, each level within a task consisted of five separate sessions, with each session comprising 10 trials. This hierarchical structure allowed for a fine-grained analysis of the EEG data, capturing both within-subject and between-subject variability in neural responses to mental workload. The hierarchical relationship within the dataset is described in
Figure 3. The time given for each trial was 2.5 s. The time given for resting was 120 s. The maximum total time required for the experiment was 1860 s. The timeline of the whole study is shown in
Figure 4.
2.3. Data Collection
Prior to analysis, the collected data were subjected to a rigorous quality control procedure to minimize potential confounds and ensure the integrity of the results. At the subject level, data files were carefully screened based on two key criteria: signal quality and file integrity. Immediately following each subject’s experimental session, the signal quality was evaluated using a five-point scale, with 1 representing the poorest quality and 5 indicating the highest quality. This rating was assigned based on a meticulous observation of the subject’s behavior during the experiment, with particular attention paid to factors that could introduce artifacts into the EEG recording. For example, subjects who exhibited excessive movement, resulting in discernible motion artifacts, received lower scores on the quality scale. To ensure that only data of sufficient quality was included in the analysis, a signal quality threshold was established, and any subject data associated with corrupted files was excluded from further consideration. At the segment level, an additional layer of quality control was implemented to identify and retain only those signal segments that met predetermined voltage and flat signal criteria. The voltage criterion was designed to exclude segments with signal saturation, defined as any segments with a magnitude range larger than 80 μV. The flat signal criterion was designed to exclude segments that did not contain useful information, defined as any segments with a magnitude range less than 1 μV. By applying these stringent criteria, we ensured that the data used in the stress and mental workload studies consisted solely of high-quality, informative segments that could provide reliable insights into the underlying neural processes. The data screening process was the only method used for data selection throughout the study. This approach ensures data integrity and provides a reliable basis for interpreting and generalizing the findings from the stress and mental workload experiments. The stress experiment included a total of 66 subjects. To ensure the integrity and reliability of the data, a rigorous screening process was implemented through data package sequence verification and visual data inspection. Subjects whose data were contaminated with an excessive amount of artifacts were excluded from further analysis. After this comprehensive screening process, the final sample consisted of 55 subjects (17 males, age ranging from 18 to 22), whose EEG recordings were deemed to be of sufficiently high quality for inclusion in the study, including 8 subjects in the meditation. Using the same screening criteria, EEG data of 44 subjects (8 males, age ranging from 18 to 22) were deemed to be of adequate quality for inclusion in the mental workload study. The mental workload task took much longer time than the stress task; some subjects introduced a significant amount of artifacts to the data, such as motion artifacts and muscle artifacts. Those data were marked as invalid during the experiment thus only 44 were selected.
2.4. Data Processing
2.4.1. Pre-Processing
After acquiring the raw EEG data, a series of pre-processing steps were performed to reduce the noise and enhance the efficiency of feature extraction. The pre-processing pipeline began with a preliminary screening to identify and remove files with a high likelihood of containing contaminated data. Subjects with missing EEG files or incorrect data input, such as erroneous time stamps, were excluded from further analysis. This initial step served to ensure the integrity and reliability of the dataset, minimizing the impact of corrupted or incomplete recordings on subsequent processing and analysis. Following the initial screening, a high-pass filter with a 1 Hz cutoff frequency, a low-pass filter with a 49 Hz cutoff frequency, and a 50 Hz notch filter were employed to remove power-line noise and high-frequency muscle artifacts. As the experiments were conducted within a controlled environment, with the experimenter monitoring the signal during the whole experiment, a simple threshold-based method was used for eliminating data chunk contaminated with a large number of artifacts. There are two common types of artifacts in our data: movement artifacts and muscle artifacts. The typical amplitude of the collected EEG data after filter is within ±20 μV; signal chunks exceeding this threshold are considered as contaminated, as shown in
Figure 5. To prepare the data for analysis, the pre-processed EEG signals were segmented differently for the stress and mental workload studies. In the stress study, the signals were divided into 3 s segments, providing a temporal resolution suitable for capturing the dynamic changes in stress levels. In contrast, for the mental workload study, each trial was treated as a single segment, preserving the integrity of the task-related neural activity. A critical consideration in the stress study was the imbalance in the number of signal segments obtained from different tasks. This imbalance stemmed from the varying number of subjects participating in each task, with only 429 segments from the meditation task, 1258 segments from the eyes-open task, and 737 segments from the cold pressor. If left unaddressed, this imbalance could introduce significant biases and compromise the validity of classification tasks performed on the original data proportions. To mitigate this issue, a class balancing technique was implemented prior to training the stress classification models. This involved randomly sampling from the larger classes to achieve an equal number of samples in each class. By ensuring a balanced representation of the different stress conditions, the class balancing step aimed to prevent the classifiers from being unduly influenced by the more prevalent classes, thus improving the generalizability and robustness of the models. It is important to note that class balancing was not performed for the regression tasks in the mental workload study. The nature of regression analysis, which aims to capture the continuous relationship between EEG features and mental workload levels, does not require the same balancing considerations as classification tasks, where discrete class labels are assigned.
2.4.2. Time Series Features and Labeling
Global time series features were extracted from the raw EEG signal for the classification process. There are three main types of features utilized: statistical features, Hjorth features and frequency domain features. Each category of features captures distinct aspects of the EEG signal, providing complementary information for characterizing the underlying neural activity. Statistical features, which are not domain-specific, are commonly employed to describe the dispersion and shape of data distributions. These features include standard measures such as standard deviation, mean, skewness, and kurtosis. By quantifying the central tendency, variability, and higher-order moments of the EEG signal, statistical features offer a general assessment of the signal’s properties in the time domain. Hjorth features, on the other hand, are features of a times series bridging the gap between the time domain and the frequency domain. They can be derived from the power spectrum of the time series [
31]. The Hjorth features used are Hjorth complexity, mobility and activity. These features provide insights into the signal’s complexity, the rate of change, and the overall power, respectively, capturing both temporal and spectral characteristics of the EEG signal. Frequency domain features are specifically tailored to the analysis of EEG signals. In this study, the EEG signal is decomposed into five distinct frequency bands: delta (0.1–4 Hz), theta (4–8 Hz), alpha (8–12 Hz), beta (12–32 Hz), and gamma (32–45 Hz). These frequency bands are widely recognized in the EEG literature and are associated with various cognitive and neurophysiological processes. By examining the power and other characteristics of the EEG signal within each frequency band, valuable information can be extracted for labeling and further analysis. Additionally, the spectral edge frequency, which represents the frequency below which a specified percentage of the total power is contained, is also considered as a feature in this study. The full list of the above-mentioned features can be found in
Appendix A. To generate the ground truth for supervised learning, we assigned labels to the EEG feature vectors based on the experimental conditions and subjective reports. For stress classification, data segments were labeled according to the task being performed. For binary classification, segments recorded during the meditation task were labeled as low stress, while segments from the cold pressor task were labeled as high stress. For stress regression, The target variable for the regression models was the subjective stress rating (scale 1–9) provided by the participant immediately after each task. All EEG feature vectors extracted from a specific task block were assigned the single subjective rating given for that block. For mental workload classification, labels were determined by the predefined difficulty levels of the cognitive tasks.
2.4.3. Model
To address the classification tasks in this study, we evaluated several well-established machine learning models that balance computational efficiency with classification performance. For mental workload classification, we selected support vector machines (SVMs) [
11], Random Forest [
32,
33], artificial neural networks (ANNs) [
34,
35], and logistic regression. Given the moderate complexity of the task and the constraints of deploying models on resource-limited wearable devices, we deliberately excluded deep learning-based approaches. Logistic regression was included as a baseline model to establish a reference for performance comparison. A preliminary two-class classification was conducted to distinguish between the lowest and highest levels of task difficulty as a proof of concept. Among the tested models, the ANN achieved the highest classification performance. For example, in stress classification task, the ANN achieved an accuracy of 0.78 ± 0.05, while SVMs achieved 0.75 ± 0.06, logistic regression achieved 0.72 ± 0.07 and Random Forest achieved 0.76 ± 0.05. The better performance of the ANN over linear models and tree-based models suggests that stress classification benefits from learning smooth non-linear decision boundaries across the EEG features. The feature space includes interactions between frequency bands (e.g., alpha/beta ratios) that may be better captured by the continuous activation functions in ANNs. Furthermore, the ANN can be further quantized and pruned for deployment on low-power devices.
For the regression task predicting continuous stress levels, we investigated multiple deep learning architectures given the increased complexity compared to binary classification. We used simple linear regression as a baseline and evaluated additional architectures beyond the ANN used in classification, namely convolutional neural networks (CNNs) and long short-term memory (LSTM) networks [
36,
37,
38]. After a comprehensive performance comparison across all models, we found that relatively simple ANNs with four or fewer hidden layers achieved performance comparable to or exceeding that of more complex architectures. Specifically, for the stress regression task, simple linear regression yielded R
2 < 0.7, while CNN and LSTM models demonstrated only modest improvements with an R
2 value around 0.8. Given the targeted deployment of these models on wearable devices with constrained computational resources, we selected the simple ANN as the optimal choice for the regression tasks. This decision prioritizes practical applicability over marginal performance gains, as the computational overhead and increased inference latency of complex architectures would compromise the real-time monitoring capability essential for wearable applications.
The models were trained using a cross-entropy loss function for classification tasks and mean squared error (MSE) for regression, with the Adam optimizer applied throughout. ReLU activation functions were used in all hidden layers. For the output layers, sigmoid activation was employed in binary classification, softmax for three-class classification, and a linear activation function for regression. This architecture was chosen for both classification and regression due to its balance between simplicity and effectiveness. The ANN architecture consists of a sequential network that processes input features through a series of linear transformations. The model begins with an input layer of 35 units, followed by three fully connected layers with decreasing dimensions of 20, 10, and 5 units, respectively. Each hidden layer uses ReLU activation, while the output layer consists of a single unit with a sigmoid activation function for binary classification tasks. For the three-class classification problem, the output layer contains three nodes with a softmax activation.
2.4.4. Training
Cross-validation was performed to ensure a fair evaluation of the model using different subsets of the dataset. For classification studies containing multiple subjects, group shuffle split with the subject ID as the group label was used, such that one subject could not have sessions in both the training and the validation set (i.e., a K-subject-out strategy). Otherwise, if the intra-subject variance could be detected by the model in the training phase then it would not be general enough for evaluation when predicting in the validation phase. The shuffling under this circumstance could prevent non-random assignment to the training and validation sets. For studies containing a single subject, a K-fold split on recordings was used. In both the group shuffle split and K-fold split, three-fold splits were used to perform cross-validation.
2.4.5. SHAP Feature Importance
SHAP (SHapley Additive exPlanations) was used in this study to analyze feature importance. SHAP feature selection method makes use of the Shapley value, which can help us to explain the difference between the actual prediction and the average prediction for each prediction instance in the classification tasks [
39]. Each feature in the prediction model contributes a certain amount to this difference. The Shapley value is the average marginal contribution to the above difference across all possible combinations of other features. The mathematical expression of the Shapley value of feature i is shown in Equation (
1).
where S is a subset of the features used in the model and p is the number of features. val(S) is the prediction for feature values in set S that are marginalized over features that are not included in set S.
4. Discussion
Our study examined the feasibility of using a single-channel in-ear EEG device in monitoring stress and mental workload. The result demonstrated that the model could distinguish between high and low stress with an accuracy around 77% for a cross-subject model and an average R2 of 0.76 ± 0.20 for the within-subject model. For mental workload models, it can accurately classify high and low mental workload levels with an accuracy between 70% and 80%. In the stress study, meditation, eyes-open resting and cold pressor task were conducted to induce different stress levels. However, according to the confusion matrix, as shown in
Figure 12, a three-class classification model is prone to predicting eyes-open resting as meditation and vice versa. Even though the subjective ratings showed a significance difference between meditation and eyes-open resting, a study [
44] showed that novice meditators showed a lower difference in resting and meditation compared with experienced ones. As most subjects recruited in our experiment were non-meditators, the phenomena seen in the confusion matrix is in accord with this finding. Thus, binary classifications between eyes-open resting against cold pressor task and meditation against cold pressor task were conducted instead. They both reached comparable accuracy with another study with single-channel EEG as reported in [
45]. Results are also comparable with the results from commercial products with only one channel, which have an accuracy of 0.77 [
46].
For stress regression results, a similar study conducted by [
47] achieved an R2 value of 0.92 ± 0.02 on a signal from the Fp1, Fp2, F5 and F6 channels. According to [
48], stress demonstrated functional brain asymmetry, which is difficult to record with a single-channel EEG setup, and this may lead to an inferior overall regression accuracy compared with multi-channel studies. Some subjects, such as s87 in
Figure 7d, may not be accurate due to their self-rating, where the subject rated both eyes-open resting and meditation as the same lowest stress level. Different individuals have various levels of metacognition skill, and this may affect rating abilities [
49]. This may explain a low R2 value of 0.41 compared to the average value; low accuracy from subjects like s82 might also be due to this issue. Overall, the system demonstrated its ability to monitor stress levels with good accuracy except for some individuals. Further work is needed to explore the reasons for these outliers and improvement methods. Moreover, While the cold pressor task effectively induced stress, it primarily reflects a physically driven response that may differ from the cognitive or psychosocial stressors relevant to learning contexts. Therefore, the present findings serve as a feasibility demonstration rather than a full validation across mental stress types. Moreover, as gamma-band activity can overlap with pain-related muscle artifacts, the observed features should be interpreted cautiously, and future work should validate the in-ear EEG system under purely cognitive stressors while controlling for potential EMG confounds.
For mental workload tasks, the binary prediction models for arithmetic and finger tapping tasks exhibit higher accuracy compared to those of lexical decision and mental rotation. This discrepancy may be due to the task-triggered signals not aligning well with the recording site. In a prior study, which investigated the signal correlation between EEG ear channels and EEG scalp channels [
50], it was found that in-ear EEG signals have a higher correlation with signals from adjacent temporal lobe regions than other areas. Therefore, an in-ear EEG device should most efficiently receive signals originating from the temporal lobe.
The substantially lower intra-subject variation in alpha and beta band power observed in the mental rotation task compared to arithmetic and finger tapping likely explains the reduced classification accuracy. This phenomenon reflects the general stability of alpha and beta oscillatory dynamics during mental rotation. While effective workload classification relies on detecting distinct task-dependent shifts in oscillatory amplitude, the mental rotation task was characterized by remarkably steady alpha and beta oscillations across difficulty levels. This spectral stability meant that the oscillatory power changes failed to differentiate between difficulty levels, leaving the classification model insufficient discriminative information to reliably distinguish between different levels of mental workload.
An important observation is the substantial heterogeneity in within-subject stress regression performance across participants, with R2 values ranging from 0.41 to 0.99 (mean 0.76 ± 0.20). The exceptionally high performance for subject s80 warrants a careful examination. One plausible explanation relates to motion artifacts. The cold pressor task is inherently accompanied by increased motor activity (hand immersion, body tension responses), which typically generates muscle and motion artifacts in in-ear EEG recordings. Although our threshold-based artifact rejection approach effectively removes large-amplitude artifacts, residual low-amplitude muscle artifacts just below the rejection threshold could potentially become predictive features if they systematically co-occur with the cold pressor task. To determine whether s80’s result reflects superior signal quality versus overfitting to artifact-related characteristics, future work should employ cross-frequency analysis comparing high-frequency bands across subjects. Another phenomenon is that a minority of subjects (e.g., s87) reported limited variability in stress ratings, which may have affected the interpretation of R2 values for within-subject regression models. This observation may reflect individual differences in metacognitive awareness and self-assessment capabilities rather than a systematic failure of the EEG-based prediction model.
Although single-channel EEG devices offer portability and ease of use, their adoption in clinical and real-world settings faces significant limitations. Key challenges include reduced spatial resolution and susceptibility to artifacts [
51]. The susceptibility to artifacts imposes certain constraints on the operational scope of portable EEG devices. For instance, the proposed system is not suitable for high-mobility scenarios, such as sports, where motion artifacts would heavily contaminate the signal. However, for the learning and professional working environments targeted in this study—where subjects are primarily stationary—the application proves highly feasible. Although sporadic motion artifacts are unavoidable even during sedentary behavior, the exclusion of contaminated data segments results in negligible data loss relative to the total recording duration. Consequently, this artifact rejection strategy does not compromise the system’s capacity to reliably monitor longitudinal trends in stress and mental workload. The most common single-channel EEG devices have electrodes placed on the forehead; this restricted spatial coverage makes it difficult to capture activities from parietal, temporal, and occipital regions, thus narrowing their application. The same applies to our system, where the electrodes are located near the temporal lobe; activities that are specific to other regions are hard to capture. Unlike multi-channel arrays that leverage spatial filtering to suppress noise, single-channel systems lack redundancy. Effective noise cancellation methods such as ICA-based noise removal are not applicable to single-channel signals. Other research also demonstrates that single-channel configurations miss critical neural events detectable in multi-channel setups, while algorithmic constraints limit their ability to generalize across diverse stressors [
52]. Our work also demonstrated such limitations as the accuracy of classification varied greatly for different types of tasks. Furthermore, we acknowledge that there is still room for improvement in terms of model performance. To further enhance the accuracy and robustness of our stress and mental workload classification algorithms, future work should focus on expanding the scope of this research to include a larger and more diverse sample of subjects, as well as a broader range of learning tasks. By incorporating data from a wider variety of individuals and educational contexts, we can develop models that are more generalizable and better equipped to handle the complex, multifaceted nature of real-world learning.