Classification of Targets and Distractors in an Audiovisual Attention Task Based on Electroencephalography

Within the broader context of improving interactions between artificial intelligence and humans, the question has arisen regarding whether auditory and rhythmic support could increase attention for visual stimuli that do not stand out clearly from an information stream. To this end, we designed an experiment inspired by pip-and-pop but more appropriate for eliciting attention and P3a-event-related potentials (ERPs). In this study, the aim was to distinguish between targets and distractors based on the subject’s electroencephalography (EEG) data. We achieved this objective by employing different machine learning (ML) methods for both individual-subject (IS) and cross-subject (CS) models. Finally, we investigated which EEG channels and time points were used by the model to make its predictions using saliency maps. We were able to successfully perform the aforementioned classification task for both the IS and CS scenarios, reaching classification accuracies up to 76%. In accordance with the literature, the model primarily used the parietal–occipital electrodes between 200 ms and 300 ms after the stimulus to make its prediction. The findings from this research contribute to the development of more effective P300-based brain–computer interfaces. Furthermore, they validate the EEG data collected in our experiment.


Introduction
The interaction between humans and artificial intelligence (AI) still lacks the level of engagement and synchronization that symbolizes the interactions between humans.The primary goal of the WithMe project (The WithMe project is a research project funded by the Research Foundation Flanders (FWO).More information can be found at https:// researchportal.be/en/project/withme-making-human-artificial-intelligence-interactionsmore-entraining-and-engaging,accessed on 1 November 2023.) is to thoroughly study the processes that occur in the human brain during joint activities with another individual, such as working towards shared objectives [1].The brain signals collected in this study were primarily indicative of attention but also of emotion and reward.The purpose of this research was to determine relevant electroencephalography (EEG) features indicative of attention using machine learning (ML).
To this end, a specific experiment was designed.Temporal audiovisual integration and support of visual attention by sound was well demonstrated in the pip-and-pop experiment [2].The pip-and-pop experiment is based on a visual search, which does not lead to a strong visually evoked potential.Moreover, as we expected that the rhythmic presentation of target stimuli also affects working memory, the task was replaced with a modified digit-span task, where five target digits had to be remembered and reported in our experiment [1].This task involves visual attention, working memory, and sequence recall.To investigate the role of attention, we directly measured the brain activation by means of EEG.Specifically, event-related potentials (ERPs) have been shown to be excellent tools for studying attention [3,4].Risto Näätänen was a pioneer in this domain, as they studied the connection between ERPs and attention, which led to the discovery of (auditory) mismatch negativity ERP [5][6][7][8].Additionally, research has shown that the amplitude of P300 is directly related to the amount of attentional resources available for stimulus processing [8][9][10][11].The P300 ERP is observed to be elicited for deviant stimuli in a sequence of standard stimuli, where the deviant stimuli are in some way more relevant to the presented task [12][13][14].In our experiment, we thus expected that the targets would elicit a P300 ERP.Research showed that the P300 actually consists of two subcomponents: P3a and P3b [15].P3a generally reaches its peak around 250 ms to 280 ms after a stimulus and is associated with attention-related brain activity [16].On the other hand, the P3b peak can vary in latency, lying between 300 ms and 500 ms post-stimulus [15].P3b is elicited by improbable events, provided that the improbable event is somehow relevant to the task at hand [17].In our experimental setting, we expected to elicit a P3a, as the target stimuli were not scarce (there are approximately 50% targets and 50% distractors), and our experiment was designed to evoke attention.We did not expect to elicit a P3a for distractors, as subjects should not pay attention to them.
The goal of this work was to accurately classify whether a target or distractor stimulus was presented to the subject based on the subject's EEG data.For this purpose, we applied different existing ML methods to classify EEG data and investigate which method performed best on our specific use case.As we expected to elicit attention when a target was shown (and not when a distractor was shown), the trained ML was effectively an attention detector.We expected the attention to manifest itself in the form of a P3a ERP, and therefore we expected that the model would base its predictions on the presence of a P3a peak.Detecting P3a signals and, more broadly, P300 signals has a wide range of applications [18,19], particularly in P300-based brain computer interfaces (BCIs) [20], for example, in spellers [21][22][23] and intelligent home control systems [24,25].These applications can be of great help for patients suffering from amyotrophic lateral sclerosis (ALS) or spinocerebellar ataxia, as they can enable them to communicate in a daily environment [21,23,26,27].In the literature, a wide array of techniques have been reported for classifying and detecting P300 [28].Some techniques rely on a data transformation and subsequently use logistic regression to classify the transformed data, for example, xDAWN + RG [29][30][31][32].Recently, deep learning approaches, primarily based on convolutional neural networks (CNNs), for example, EEGNet [33][34][35], have also gained in popularity [36][37][38].Finally, as EEG data are essentially heavily correlated multivariate time series, it is possible to apply standard time series classification techniques as well [39][40][41].
Building BCIs that are trained on multiple subjects and generalize well to previously unseen subjects holds significant value [42].Indeed, BCIs often need to be retrained or at least calibrated for the end user [43], which is a costly and user-unfriendly process [44,45].However, due to intersubject variability in EEG data, training models that generalize to multiple-subjects (cross-subject (CS) models) is a harder task than training models for onesubject (individual subject (IS) models) [44][45][46].For this reason, we also investigated the hypothesized drop in performance when transitioning from IS to CS models.Additionally, the ML models should be able to make predictions in real time, as this is essential in real-world BCI applications.
Finally, we analyzed which EEG channels and time points were used by our models to make its predictions and checked whether these align with the expected P3a attention signature.However, ML models such as CNNs are considered "black boxes", as no clear explanation exists for the decisions made by these models [47].The rapidly emerging and improving field of explainable AI (xAI) aims to tackle these issues by providing insights into ML models' decision-making processes.Some xAI techniques that are often used to gain insights into EEG classification models are local interpretable model-agnostic explanations (LIME) [48,49], DeepLIFT [33,50,51], and saliency maps [52][53][54], among others.
In summary, we aimed to enhance the interaction between humans and AI and designed a novel experiment for this purpose.Specifically, we considered building a ML model to recognize targets shown to a subject, which equates to creating an attention detector.These models should ideally generalize well to previously unseen subjects.The primary contributions of this study are:

•
Training of state-of-the-art classification methods to accurately predict target and distractor stimuli based on EEG data.

•
Analysis of the performance difference between IS and CS models.

•
Investigation into which EEG channels and time points were important for the model predictions, using xAI.
Ultimately, the contributions of this research collectively advance our understanding of human-AI interaction and will aid in the development of more effective BCIs and their associated applications.
The remainder of this paper is structured as follows: Section 2.1 introduces the WithMe experiment and dataset, while Section 2.2 explains the data preprocessing routine.Section 2.3 illustrates the classification problems and provides a description of the classification methods used in this study.Section 3 presents the results and provides an in-depth analysis of the best-performing model.This section also contains an extensive discussion of the achieved results.Finally, in Section 4, we draw conclusions and provide possible directions for future research.

Materials and Methods
In this section, we describe the WithMe dataset that was analyzed using acMLin this study.We then define the preprocessing steps that were applied to the EEG data.Finally, we present the classification problem and the classification methods and metrics that we used to tackle this problem.

The WithMe Experiment
A total of 42 young adults participated in the experiment (21 women, 21 men; mean age 23.64 ± 2.69 years).They were recruited through the university network and through the social and professional networks of the authors.All subjects declared to have normal or corrected to normal vision and showed normal hearing (<25 dB hearing loss) for the frequencies used in the experiment based on a standard pure tone audiometry hearing test.To mitigate the potential influence of age and/or intelligence, the subject's age range was limited to young adults under 30 years, and they were only accepted if they were enrolled in or had finished some form of higher education.
Before starting the experiment, subjects had to fill in a questionnaire that asked for general background information to identify some personal characteristics.For example, subjects were asked if they ever enrolled in some form of musical education and/or were an active musician.More details about the questionnaires and their extensive analyses are described in [1].
The experiment consisted of a modified digit-span task.A target digit was presented, followed by either no, one, or two distractor digits; another target; etc.One sequence of digits always consisted of five targets and five distractors, although the subjects did not know this a priori.After one sequence of targets and distractors was presented, the subject had to report all targets in the correct order that they were presented to them.The targets and distractors were presented as an encircled number x ∈ {0, 1, 2, . . ., 9}.Additionally, a distractor could show up as an empty circle.Target digits were colored black (rgb(0, 0, 0)), while distractor digits were displayed in dark gray (rgb(x, x, x), x ∈ [50,75]), with the exact value of x determined individually to ensure the difference between targets and distractors was just noticeable.An example sequence is shown in Table 1.In total, 30 different sequences of targets and distractors were created, which were shown to the subject under four different conditions in a pseudo-randomized order [1].Depending on the condition, the subject received either no support (Con1), visual rhythmic support (Con2), auditory nonrhythmic support (Con3) or visual rhythmic and auditory support (Con4), as shown in Figure 1b.This added up to a total of 120 sequences shown to the subject.In conditions with auditory support (Con3 and Con4), targets were accompanied by a 500 Hz tone burst, which lasted 50 ms.In conditions with rhythmic support, targets were presented with a fixed time interval of exactly 1.25 s between them.In these rhythmic conditions, the sequence of digits was preceded by five rhythms inducing stimuli to induce the subject with the rhythm.In Con2, this was achieved using empty black circles, while in Con4 auditory, tone bursts were used to induce the rhythm.For more detailed information about the experiment, we refer the reader to the original paper that describes the experiment and behavioral analysis [1].Table 1.An example of a sequence of stimuli shown to the subject, with the targets in black and distractors in gray.In conditions with rhythm (Con2 and Con4), this sequence was preceded with five empty circles to induce the rhythm.The subject was expected to report the target digits and ignore the distractors.

Dataset and Preprocessing
During this experiment, EEG data were sampled at 2048 Hz using the standard 64electrode EEG 10-10 system, as shown in Figure 1a.Thereafter, standard EEG preprocessing techniques were applied.The data were re-referenced to the average of both earlobes, just one earlobe if the other one was too noisy, or another pair of channels if both earlobes were badly recorded or too noisy.In the case of bad channels, these were identified and removed.EEG data were notch-filtered at the line frequency (50 Hz) and its multiples, after which a bandpass filter from 0.2 Hz to 100 Hz was applied.The data were split into epochs, ranging from 0.2 s prestimulus to 1.0 s poststimulus, resulting in 1.2 s epochs.Independent component analysis (ICA) was applied to the epoched data.Any components that represented artifacts were removed through visual inspection of the ICA components.
During the previous steps, some channels were marked as bad channels.Instead of dropping these channels, we chose to interpolate them using their neighboring channels, as the former would result in an inconsistent number of channels across sequences and subjects.The interpolation was performed using the MNE-Python package (All Python packages that were used in this work can be found in Table A1, together with their version number and citation.)[55].Finally, the data were downsampled to 50 Hz, as this reduced computation time, decreased file read/write time, and saved memory, while generally leading to little or no loss of information [56].We should however note that, based on the Nyquist theorem, this limits the highest frequency that can be accurately represented to half of the sampling frequency, i.e., 25 Hz.This preprocessing routine ideally resulted in 600 target epochs, 600 distractor epochs, and 300 induction epochs for each of the 42 subjects.However, during preprocessing, some epochs were rejected for various reasons, for example, an excessive number of bad electrodes or too much noise.On average, less than 0.6% of the epochs were rejected per subject.
As mentioned in Section 1, we expected to observe a P3a ERP when subjects saw a target stimulus.The P3a ERP is characterized by a positive voltage deflection between 250 ms and 280 ms after the stimulus, although the exact timing can vary [16,57,58].As our experiment used visual stimuli, we expected the P3a ERP to be the most pronounced in the parietal-occipital region of the brain [56].Figure 2 shows the evoked response for one subject, averaged over all parietal-occipital electrodes, as indicated in the figure inset.We observed a clear positive deflection between 200 ms and 300 ms after the stimulus, in line with our expectations.

Classification Problem
The models trained in this study considered a two-class classification problem (target versus distractor) and took single-trial EEG epochs as input to predict a binary label.As the data were downsampled to 50 Hz, one epoch contained 60 time steps, for 64 electrodes.This means that the input was of shape (N, 64, 60), with N being the number of epochs.It is important to note that it was impossible to obtain 100% accuracy for this model.Indeed, the model made a prediction based on the subject's assessment of a stimulus, and it is possible that a subject did not correctly recognize all targets and distractors.As the ground truth labels were based on the predefined labels of the experiment, it is possible that there was a slight mismatch between the labels and the subject's perceived class.Nevertheless, we assumed that this problem was rare, meaning that commonly used metrics, for example, accuracy, provided a valid interpretation.
Ideally, the models should be able to generalize to previously unseen subjects.To investigate this, we trained the models in two ways: models trained on IS and models trained on (nearly) all subjects, also called CS models.The former was evaluated using a randomly sampled test set with a standard 80% train and 20% test set split, while the latter were evaluated using a leave-one-out (LOO) methodology.In general, models perform better when trained and tested on individual subjects [59].This can be attributed to the variability in subject's EEG data elicited by the same stimuli.However, in practice, EEG classification models should ideally extrapolate to previously unseen subjects.For example, BCIs often need to be calibrated for new end users, which usually takes 20 to 30 min [60][61][62].Therefore, it is interesting to investigate which model architectures are best suited to build subject-independent classifiers.

Classifiers
To solve this classification problem, we trained and evaluated different existing ML models.Different methodologies for classifying EEG data exist.For example, we can extract features from EEG data and use these extracted features as the input to a classifier.These features can, among others, be extracted from the time domain, frequency domain, or the time-frequency domain, or through methods such as principal component analysis [63,64].Such methods are referred to as feature-based methods.Another common approach uses raw or preprocessed EEG data as the input to the classifier.In this approach, commonly referred to as end-to-end methods, the classifier itself extracts relevant features from the data during training and uses these features to classify a sample.As both methodologies are interesting approaches, we used methods belonging to both approaches.In this study, we applied four distinct classifiers and compared the results on a novel data set.An overview of the classifiers and their methodologies is presented in Table 2. First, we applied the xDAWN pipeline, which has demonstrated significant success in several EEG classification tasks [29,30].For example, the BCI challenge organized as part of the IEEE Neural Engineering Conference 2015 was won by an xDAWN-based approach [30].In this study, we employed a similar approach, consisting of first estimating two sets of xDAWN spatial filters, one for each class (target and distractor) [29].Subsequently, the grand average evoked potential of each class iwass filtered using the corresponding filters, after which they were concatenated to each of the trials.Then, the covariance matrix of each resulting trial was used as a feature for the next steps in the pipeline [65,66].The next step was to project the covariance matrices on the tangent space using a Riemannian metric, as described in [31,32].After these feature extraction steps, a classifier was used to make the final predictions.Based on [30,67], we used logistic regression [68].For the remainder of this paper, we refer to this method as xDAWN + RG (xDAWN + Riemannian Geometry).Calculating the xDAWN covariance matrices and projection to the tangent space were performed using the PyRiemann package [67].
Table 2. Overview of the methods that were used in this study, together with their original target domain and methodology.

Target Domain Methodology
xDAWN + RG [29] EEG feature-based MiniRocket [40] time series feature-based Rocket [39] time series feature-based EEGNet [33] EEG end-to-end The second method we used was EEGNet [33].EEGNet exhibits strong performance on a variety of EEG-based classification tasks, such as P300 ERP classification [33,53] and motor imagery classification [69].Whereas the previous method used extracted features as the input to the classifier, EEGNet performs both the feature extraction and classification.
EEGNet is a deep learning model, more specifically, a CNN.As its name suggests, EEGNet is optimized for classifying EEG data by employing a set of specific design choices.First, it uses temporal convolutions to learn frequency filters [33].As suggested by the authors, the length of the temporal kernel used in these convolutions is set to half the sampling rate, which allows the model to capture frequency information at frequencies of 2 Hz and higher [33].Second, depthwise convolutions are used to learn frequency-specific spatial filters.In this context, depthwise convolutions have two main advantages.First, they noticeably reduce the number of trainable parameters, since these convolutions are not fully connected to the previous layer; instead, they are connected to each feature map individually.This induces the second, EEG-specific advantage: the model learns spatial filters for each temporal filter, which enables the efficient extraction of frequency-specific spatial filters [33].The last convolutional part consists of a separable convolution, which is a combination of depthwise and pointwise convolution.The former learns how to summarize individual feature maps in time, while the latter learns how to optimally combine the feature maps [33].Finally, all features are passed to a dense layer for classification.More details on the EEGNet architecture can be found in [33].We used the standard EEGNet-8,2 layout, which means that the model learns 8 temporal filters and 2 spatial filters per temporal filter.
The first two methods were designed for EEG specifically.However, since EEG data are essentially heavily correlated multivariate time series, it was interesting to study the results of a more general method designed to classify such time series.To this end, we applied random convolutional kernel transform (Rocket) [39].Based on the success of CNNs for time series classification, Rocket uses random convolutional kernels combined with simple linear classifiers.This novel combination achieved state-of-the-art performance on the UCR time series archive using only a fraction of the computational cost of existing methods [39,70].As a follow-up to Rocket, the authors also designed MiniRocket [40].They claimed that MiniRocket can be trained up to 75 times faster than Rocket, while achieving nearly the same performance.MiniRocket distinguishes itself from Rocket primarily by reducing the degree of randomness that Rocket generates, resulting in MiniRocket being almost deterministic [40].Since methods used to classify EEG data, such as EEGNet, can be very computationally expensive, it was worth exploring the effectiveness of less computationally expensive methods.We used the Rocket and MiniRocket implementations in the sktime package and combined them with the ridge regression classifier implemented in scikit-learn, as suggested by the authors [39,40,68,71].

Metrics
To allow the comparison of various approaches, it is essential to have predetermined performance metrics.We focused on three metrics that are widely used in the EEG classification literature: accuracy, F1 score, and area under the receiver operating characteristic curve (ROC AUC) [72].First, the accuracy states the number of correctly classified samples across both classes.Second, the F1 score assesses the predictive performance of a model via calculating the harmonic mean of the precision and recall metrics.The equations used to calculate the accuracy and F1 score are given in Equations ( 1) and ( 4), respectively, where we use the following abbreviations: true positive (TP), false positive (FP), true negative (TN), and false negative (FN).Third, by plotting the true positive rate against the false positive rate for different classification thresholds, we obtained the ROC curve.The ROC AUC is defined as the area under this curve and provides a measure for how well a classifier can distinguish between true and false samples or, in our case, targets and distractors, respectively.Finally, we also assessed the required training time and model complexity of all models.accuracy = TP + TN TP + TN + FP + FN (1)

Individual Subject Models
The performance of the models, assessed using the metrics introduced in Section 2.3.2, is shown in Table 3 and Figure 3. Using an EEG-specific model architecture benefits the performance of IS models.While xDAWN + RG and EEGNet perform equally well, they demonstrate superior accuracy, F1 score, and area under the curve (AUC) in comparison to MiniRocket and Rocket.As expected, MiniRocket achieves slightly inferior performance compared to Rocket.However, MiniRocket's training time was 15 times faster on our dataset.Notably, while xDAWN + RG and EEGNet exhibit equal performance, xDAWN + RG is significantly less computationally expensive than EEGNet.On central processing units (CPUs) alone, EEGNet's training time is nine times longer.Although training times can be accelerated for EEGNet using (expensive) graphics processing units (GPUs), even when using an NVIDIA GTX 1080 GPU, EEGNet still requires 2.5 times as long to train as xDAWN + RG.

Cross-Subject Models
Similar results were obtained for the CS models, where EEG-specific approaches perform better than Rocket and MiniRocket, as shown in Table 4 and Figure 4.However, in this scenario, EEGNet outperforms xDAWN + RG.We hypothesized that this can be attributed to EEGNet's added complexity and its greater number of parameters compared to xDAWN + RG.This additional capacity is more likely to be able to learn features that extrapolate well to previously unseen data points.

Individual-Subject Models vs. Cross-Subject Models
As we discussed in Section 2.3, we expected the performance of the IS models to be better than that of the CS models.Despite having access to a significantly larger amount of data, constructing a CS model is a considerably more challenging task.To illustrate the performance disparity between the two, refer to Table 5 and Figure 5, which showcase the performance difference by subtracting the CS model's performance from that of the IS model.EEGNet, MiniRocket, and Rocket exhibit similar performance for both IS and CS models.However, the xDAWN + RG model demonstrates a noticeable decrease in performance.Given the lower absolute performance of the (Mini)Rocket models compared to EEGNet and xDAWN + RG, we focus on the latter for the remainder of this discussion.We hypothesize that the inferior performance on CS models when using xDAWN + RG can be attributed to its simpler and lightweight nature.Furthermore, xDAWN + RG works by first calculating the evoked responses for all classes.These can differ significantly from subject to subject, both in P3a peak height and in time [27,73,74].The convolutional nature of EEGNet likely enables it to capture the temporal dynamics of the elicited responses more effectively across different subjects.It is important to note that the CS models have access to a significantly larger corpus of training data than the IS models, which is part of the reason that they keep up reasonably well with the IS models.

Analysis of the EEGNet Cross-Subject Model
We then conducted further investigation into the CS EEGNet model.We conducted this analysis for the EEGNet model, as it performed the best in both the IS and CS scenarios.Furthermore, we included this analysis only for the CS models, as they are the most useful in practice due to their generalization capabilities.However, the conclusions are similar for the IS models.

Confusion Matrices
First, we investigated whether the model focused on the correct features to make a prediction.For example, it is possible that we trained a sound detector instead of a target/distractor model.Indeed, conditions Con3 and Con4 contained auditory clues for the target.Theoretically, the model could rely solely on the activation in the auditory stimuli processing region of the brain and achieve acceptable performance.For example, if the model performs perfectly on Con3 and Con4, while predicting all trials belonging to Con1 and Con2 to be distractors (due to the absence of auditory stimuli), it would achieve an accuracy of approximately 75%.The confusion matrices in Figure 6 negate this assumption.The model performs comparably in detecting distractors under all conditions.However, the model performs slightly better at identifying targets correctly for Con3 and Con4.The accuracies for specific conditions, shown in Table 6, also reflect this.Indeed, the accuracies for conditions Con3 and Con4 are higher than the accuracies for Con1 and Con2.We hypothesized that the inclusion of auditory support causes an additional signature in the EEG data, making it easier for the model to recognize targets.Additionally, it was already confirmed through a previous analysis that the subjects were able to recall the targets better in conditions with auditory support [1].Next, we explored the electrodes and timings that are predominantly used by our models for making predictions.Trivially, we expected that the model would not use the prestimulus (t < 0) EEG data.As deep learning methods such as EEGNet are inherently black box models, we resorted to xAI methods to obtain (interpretable) insights into the model.A possible technique is a saliency map, which is a visual representation that highlights the degree of importance of regions or features in an input sample in the model prediction [52].To generate a saliency map, the gradient of the model output with respect to the input sample is computed using backpropagation [53].More specifically, this process involves fixing the weights of the trained model and propagating the gradient with respect to the layer's inputs back to the first layer that receives the input data.Figure 7 shows such a saliency map.This saliency map illustrates the electrodes and timings that had the greatest average impact on the model prediction when identifying a sample as a target.It was computed by first calculating the average saliency map for each test subject individually, then normalizing these saliency maps, and ultimately taking the average across all 42 subjects.In Figure 8, the same information is repeated, displayed as a topographic map at five time points.From Figures 7 and 8, we can see that our model predominantly used the parietal-occipital electrodes and time points between 200 ms and 300 ms after the stimulus to make its prediction, which is what we expected.We also investigated the saliency maps under different conditions but noticed no significant difference between the conditions.

Conclusions and Future Work
The WithMe project has led to the collection of a large, novel EEG dataset that can be used to create ML methods to automatically detect attention using P3a ERPs in single-trial data.This is of great importance to BCIs, as they often rely on the P3a, r, more broadly, the P300 ERP and have a wide range of applications.
We successfully achieved the goal in this study, which was to classify target and distractor stimuli based on the subject's EEG data.To achieve this goal, we studied four classification methods that differed significantly in origin and complexity.We investigated the performance of these methods both as IS and CS models, with the latter being the most practically relevant due to its generalization capabilities.For the IS models, xDAWN + RG and EEGNet obtained an accuracy of 76%, outperforming MiniRocket and Rocket.While EEGNet was able to obtain the same accuracy of 76% in the CS case, the accuracy of xDAWN + RG dropped to 0.73%.We attribute this difference to the larger complexity of EEGNet, which likely enables it to generalize better to previously unseen subjects.The drop in performance between IS and CS models was not as pronounced as we expected it to be and was even nonexistent for EEGNet.We attributed this to the fact that the CS models had approximately 42 times more training data available.The EEGNet CS model performed slightly better on samples recorded under conditions Con3 and Con4, which were the conditions that included auditory support.While EEGNet achieved the best performance overall, it also had the highest model complexity (highest number of trainable parameters) and took the longest time and most computing resources to train.However, all four models were able to make predictions in real time.This property is essential for real-world human-AI interaction experiments and applications.
Finally, the application of xAI enabled us to investigate which EEG channels and time points were used by the otherwise black box EEGNet CS model to make its predictions.Indeed, using saliency maps, we concluded that the model primarily based its prediction on the values of the electrodes in the parietal-occipital region between 200 ms and 300 ms after the stimulus.This is in line with our hypotheses, as we expected to elicit an attention-related P3a ERP in the parietal-occipital region of the brain when the subject saw a target digit.
In conclusion, we achieved the goal of accurately classifying targets and distractors based on a subject's EEG data.At the same time, our work contributes to the development of more effective BCIs and their applications.Finally, we validated the EEG data collected in the WithMe experiment.
While this study provides valuable insights into attention detection using EEG data, it is important to acknowledge some limitations.For example, as mentioned in Section 2.3, part of the data used to train the model were labeled incorrectly, as the ground truth labels were based on the predefined labels of the experiment rather than the subject's perceived class.A possible solution is to limit the data to samples where the entire sequence is reported correctly.However, this means that we would lose a lot of data, which would in turn decrease the performance of the models.Alternatively, we could remove all "bad sequences", where a bad sequence is defined as a sequence in which none of the targets were remembered correctly.This could be caused by either incorrectly identifying the stimuli or by bad memory management, despite correctly identifying the targets and distractors.However, the number of answers that did not include at least one of the target digits (regardless of its place in the sequence) is negligible.
In future work, an experiment dedicated to attention should be used to circumvent the limitations regarding bad labels, as described in Section 4. This would allow for labels that exactly correspond to the subject's perception of a stimulus, which would in turn lead to more accurate attention detectors.The ultimate goal could then be to use this attention detector in a BCI to detect whether a subject paid attention.In case they did not, the BCI could repeat the sequence or stimulus to make sure that the subject can act accordingly.This could also improve learning systems, that is, systems that know whether a student actually paid attention to the provided information [75,76].Regarding the training and optimization of ML models, it would be interesting to include an exhaustive feature selection procedure to allow the ML model to focus on the (most) relevant features.Additionally, we want to explore other ways to enable CS generalization, for example, using transfer learning [77,78].This could further increase the generalization performance of all methods.In particular, this has the potential to elevate the performance of lightweight models such as xDAWN + RG to that of the computationally expensive EEGNet.While this work focuses on the detection of attention using epoched EEG data, the experiment can also be used to study working memory [1].Indeed, the complete sequence EEG data should permit an investigation regarding working memory and whether it is influenced by auditory and/or rhythmic support.

CPU
response averaged across parietal-occipital electrodes Target Distractor

Figure 2 .
Figure 2. The evoked response for targets and distractors for one subject.The data were averaged over all electrodes of the parietal-occipital region in the brain, as indicated in the figure inset.

Figure 3 .
Figure 3. Violin plots of the test accuracy, F1 score, and AUC for models trained on individual subjects.

Figure 4 .
Figure 4. Violin plots of the test accuracy, F1 score, and AUC for cross-subject models.

Figure 6 .
Figure 6.Confusion matrices for the cross-subject EEGNet model, split across the four conditions defined in Figure 1b.The confusion matrices were obtained by aggregating all the test predictions of the CS models.

Figure 7 .Figure 8 .
Figure 7. Saliency map for epochs labeled as targets using the cross-subject EEGNet model.We averaged normalized saliency maps over all 42 test subjects for the CS model.

Table 3 .
Classifier test performance for individual subject models, averaged across the 42 subjects.The best performances are indicated in bold.

Table 4 .
Classifier performance for cross-subject models.Every subject was used as a test subject once; we report the average across all test sets.The best performances are indicated in bold.

Table 5 .
Drop in performance, calculated by subtracting the test performance of cross-subject models from that of individual subject models.The best performances are indicated in bold.Violin plots of the drop in performance, calculated by subtracting the test performance of models from that of individual subject models.

Table 6 .
The test accuracies of the CS EEGNet model for the different conditions.

Table A1 .
The versions of the Python packages used in the project.