Classiﬁcation of Event-Related Potentials with Regularized Spatiotemporal LCMV Beamforming

: The usability of EEG-based visual brain–computer interfaces (BCIs) based on event-related potentials (ERPs) beneﬁts from reducing the calibration time before BCI operation. Linear decoding models, such as the spatiotemporal beamformer model, yield state-of-the-art accuracy. Although the training time of this model is generally low, it can require a substantial amount of training data to reach functional performance. Hence, BCI calibration sessions should be sufﬁciently long to provide enough training data. This work introduces two regularized estimators for the beamformer weights. The ﬁrst estimator uses cross-validated L2-regularization. The second estimator exploits prior information about the structure of the EEG by assuming Kronecker–Toeplitz-structured covariance. The performances of these estimators are validated and compared with the original spatiotemporal beamformer and a Riemannian-geometry-based decoder using a BCI dataset with P300-paradigm recordings for 21 subjects. Our results show that the introduced estimators are well-conditioned in the presence of limited training data and improve ERP classiﬁcation accuracy for unseen data. Additionally, we show that structured regularization results in lower training times and memory usage, and a more interpretable classiﬁcation model.


Introduction
Brain-computer interfaces (BCIs) establish a direct communication pathway between the brain and an external device [1]. Severely disabled patients with impaired or absent communication capabilities can benefit from BCIs to restore normal functioning [2,3]. BCIs can be implemented in multiple ways, using non-invasive recording techniques such as electroencephalography (EEG) [4], magnetoencephalography (MEG) [5], functional nearinfrared spectroscopy (fNIRS) [6], and optically pumped magnetometers (OPM MEG) [7], or semi-invasive and invasive methods such as electrocorticography (ECoG) [8] or microelectrode arrays [9], which require surgery to implant a recording device. Although invasive BCIs yield the highest information transfer rate [10], non-invasive BCIs are preferable for short-term use since they are not susceptible to the risks that come with surgery. Among the non-invasive options, EEG is the most cost-effective and practical as it is not limited to the same controlled settings as MEG and OPM MEG.
In addition to the recording method, BCIs differ in the communication paradigms used for communication [4]. A popular class of BCI paradigms relies on the evocation of event-related potentials (ERPs) in the brain in response to visual, auditory, or tactile stimulation, given their low decoding cost and generally short calibration time before usage [11,12]. In this study we focused on the visual P300 oddball ERP in response to a rare but attended visual stimulus. The decoder detects whether this ERP is present to determine which stimulus the user attended. The P300 paradigm has been used extensively in BCI development and is easy to set up [13][14][15][16].
There are multiple state-of-the-art P300 classification methods, such as support vector machines (SVMs) [17], deep learning models [18,19], and Riemannian geometry classifiers [15]. Although these models often return a high classification accuracy, there is a need for lightweight models, as lightweight models lead to fast off-line analyses and can be transferred to consumer-grade hardware. When moving towards plug-and-play solutions, BCI calibration sessions should be short and model training times should be low. The spatiotemporal beamformer [20,21] belongs to this class of ERP-decoding models as it achieves state-of-the-art performance and is fast to train. Earlier work has shown that it is possible to apply the spatiotemporal beamformer to multiple time-locked visual BCI paradigms, including the P300 oddball paradigm, steady-state visually evoked potentials (SSVEPs), code-modulated visually evoked potentials (cVEPs) [22], and motion-onset visually evoked potentials (mVEPs) [23].
This work shows that the original spatiotemporal beamformer [21] can fall short in its performance when BCI calibration data are restricted. We also show that the spatiotemporal beamformer does not scale well for higher spatial and temporal resolution cases. As a response to these issues, we introduce a regularization method that exploits prior knowledge about the spatiotemporal nature of the EEG signal to improve the accuracy for settings with low data availability and to speed up the classifier training time, thereby considerably reducing memory usage. Similar structured regularization approaches have been applied to other linear ERP classifiers [24,25] and have shown significant increases in performance. Additionally, we show that regularization results in an interpretable classification model, which can aid in analyzing and developing spatiotemporal beamformerbased classifiers.

Notation
We represent matrices with cursive capital letters, vectors with bold lowercase letters, and scalars with cursive lowercase letters. The epoched EEG data with n epochs, c channels, and s samples are represented in epoch format as {X i ∈ R c×s } n i=1 or in flattened vector format by concatenating all channels for each epoch. Flattening results in {x i ∈ R cs } n i=1 such that x i = vec(X i ). The real covariance matrix of the epochs in vector format is denoted by C, and estimators thereof asĈ.

Spatiotemporal Beamforming
LCMV-beamforming was initially introduced to EEG analysis as a filter for source localization [26] to enhance the signal-to-noise ratio (SNR). Van Vliet et al. [20] first applied the spatiotemporal LCMV-beamformer as a method for the analysis of ERPs. The extension of this method to the combined spatiotemporal domain [20] and the data-driven approaches proposed by Treder et al. [27] and Wittevrongel et al. [21] allow for its application to classification problems.
For the following analyses, we assume that all EEG channels are normalized with zero mean and unit variance without loss of generality. Solving Equation (1) under the linear constraint given by Equation (2) returns the filter weights w defining the spatiotemporal LCMV-beamformer.
arg min These weights minimize the variance of the output of the filter while enhancing the signal characterized by the constraint. a = vec(A) is the data-driven activation pattern, a template of the signal of interest maximizing the difference between two classes of epochs, determined as follows: The method of Lagrange multipliers then gives the closed-form solution to the minimization problem posed by Equations (1) and (2) as: The beamformer can be applied to epochs (unseen or not) as: resulting in a scalar output y i per epoch. The linear constraint in Equation (2) ensures that the beamformer maps epochs containing a target response to a score close to one and, conversely, epochs not containing a target response to a score close to zero.

Covariance Matrix Regularization
Although the spatiotemporal beamformer, in theory, achieves optimal separation between target and non-target classes, in analogy to linear discriminant analysis [27], it does not always perform well on unseen data. The main challenge is to find a good estimator for the inverse covariance matrix C −1 since the real underlying covariance matrix generating the data is, in principle, unknown.

Empirical Covariance Estimation
Earlier spatiotemporal beamformer studies [21,22,28,29] use the empirical covariance and inverse covariance, calculated as follows: The Moore-Penrose pseudoinverse + ensures that a solution exists whenĈ emp is singular. Figure 1a,b show examples of the empirical estimators of the covariance and the inverse covariance matrices, respectively. The empirical estimator suffers from performance and stability issues if the number of epochs n used for estimation is not much larger than the number of features cs [30,31].

Shrunk Covariance Estimation
The shrinkage covariance estimator creates a better-conditioned inversion matrix problem and generally performs better when applied to unseen data. The estimators for the covariance and inverse covariance are given by: with 0 < α < 1. Analogous to L2 regularization of the beamforming problem, shrinkage reduces the ratio between the smallest and largest eigenvalues of the covariance matrix by strengthening the diagonal. Earlier work [23] applied shrinkage regularization to ERP decoding with the spatiotemporal beamformer and showed competitive performance compared to other state-of-the-art decoding techniques such as stepwise LDA or SVM. The abovementioned researchers chose the shrinkage coefficient α as a fixed hyperparameter. However, its optimal value depends on the number of training epochs, the covariance matrix's dimensionality, and the independence and variance of the data, which can vary across evaluation settings and per session. The optimal value for α can be found with a line search using crossvalidation method, but this can be a costly procedure. Methods exist to estimate an optimal shrinkage value directly from the data. Most notable among these are the Ledoit-Wolf procedure [32], the Rao-Blackwell Ledoit-Wolf method [33], and the oracle approximating shrinkage method [33]. A more recent estimation method [34] emulates a leave-one-out cross-validation (LOOCV) scheme expressed by the data-driven closed-form estimate: Herein, we opt for the LOOCV shrinkage estimator because it avoids some of the assumptions made by [32,33] and because it generalizes to structured covariance estimations, as described in Section 2.3.3.

Spatiotemporal Beamforming with Kronecker-Toeplitz-Structured Covariance
Exploiting prior knowledge about the spatiotemporal structure of the EEG signal leads to a more regularized estimator of the covariance. When viewing the example of empirical spatiotemporal EEG covariance in Figure 1a, it becomes clear that this matrix consists of a block pattern of repeated, similar matrices. Due to the multi-channel nature of the signal, we assume that the covariance of spatiotemporal EEG epochs is a Kronecker product of two smaller matrices [35][36][37], as expressed by: with ⊗ denoting the Kronecker product operator.Ŝ ∈ R c×c andT ∈ R s×s correspond to estimators of the spatial and temporal covariance of the data, respectively. Furthermore, because the temporal covariance of the EEG-signal is stationary (i.e., it is only dependent on interval length between covarying time samples) [38], it is assumed to have a Toeplitzmatrix structure:t Property 1 then leads to Equation (13) to estimate the inverse covariance.
for any non-singular matrices U and V [39].
Using Equation (14) removes the need to calculate the full, high-dimensional Kronecker productŜ + ⊗T + . Figure 1e,f show examples of the structured covariance and inverse covariance estimators, respectively, consisting of a spatial Kronecker factor (Figure 1g,h) and a temporal component (Figure 1i,j).
The Kronecker approach has shown significant performance yields in different linear spatiotemporal EEG and MEG applications [24,37,[41][42][43]. Van Vliet and Salmelin [25] applied a Kronecker-structured covariance estimator to ERP classification with linear models in a post hoc fashion. Our work goes further by embedding the Kronecker structure in the spatiotemporal beamformer training process, using a data-adaptive shrinkage method, and regularizing the covariance further by imposing a Toeplitz structure on the temporal covariance.

Kronecker-Toeplitz-Structured Covariance Estimation
The question remains how to estimateŜ andT. Although the flip-flop and non-iterative flip-flop algorithms [44][45][46] can estimate Kronecker or Kronecker-Toeplitz-structured covariances, new results show that a fixed point iteration is more efficient [47,48]. After each iteration, the spatial and temporal covariance matrices are scaled to unit variance to ensure that the fixed-point iteration converges. Finally, shrinkage can also be introduced in the fixed-point iteration to improve stability and achieve more robust regularization [42,[48][49][50].
The spatial and temporal covariance matrices are shrunk at every fixed-point iteration with shrinkage factors β k and γ k before matrix inversion in the next iteration. Combined, this leads to the iterative estimation algorithm described by the following equations: S 0 andT 0 can be initialized to any positive definite matrix. We choose to use the identity matrices I c×c and I s×s . After each iteration, all diagonals ofR k+1 are set to their mean values to ensure thatR k+1 andT k+1 are Toeplitz-structured. Xie et al. [51] show that the LOOCV estimates for the optimal values of β k+1 and γ k+1 also yield a closed-form solution for the Kronecker fixed-point-iteration algorithm: The shrinkage parameters 0 < β k+1 < 1 and 0 < γ k+1 < 1 should be re-determined after each iteration. The oracle approximation shrinkage method can also be used to determine β k+1 and γ k+1 [51,52] but performs worse for spatiotemporal EEG data since not all assumptions are met.

Dataset
We use the dataset from [21], containing P300 oddball EEG recordings of 21 healthy subjects since it is a high-quality dataset with a high number (32) of electrodes and concurrently recorded EOG responses for ocular artifact rejection. Nine targets were arranged on a monitor in front of the subject during an experimental session. The subject was asked to pay attention to a cued target for a block of stimulations. Within each block, the stimulations were organized in in 15 separate subsequent trials. A trial was defined as 9 stimulations in which each target is flashed precisely once per trial. Each target was cued four times, resulting in a dataset consisting of 36 blocks (4860 stimulations) per subject. Each stimulation corresponded to a single epoch in the preprocessed dataset. See [21] for a complete description of the dataset and the recording procedure.

Software and Preprocessing
Data processing and classifier analysis were performed in Python using Scikit-Learn (version 1.0.1) [53] and SciPy (version 1.7.1) [54]. The preprocessing pipeline was implemented using the MNE-Python toolbox (version 0.24.0) [55]. The dataset was converted to BIDS-EEG format [56] and managed and loaded with MNE-BIDS (version 0.9) [57]. The Riemannian classifier from Section 2.6.3 was implemented using pyRiemann (version 0.2.7). Statistical tests were performed in R (version 4.1.2).
The EEG recorded at 2048 Hz was re-referenced off-line to the average of the mastoids. The reference electrodes were dropped from the analysis. Data were subsequently filtered between 0.5Hz and 16Hz using forward-backward filtering with a fourth-order Butterworth IIR filter. The EEG signal was corrected for ocular artifacts using independent component analysis (ICA) by rejecting components that correlated with the bipolar EOG channels vEOG and hEOG, according to adaptive Z-score thresholding. Finally, epochs were cut from 0.1 s to 0.6 s after stimulus onset. No baseline correction was performed since this affects the temporal covariance of the data, violating the Toeplitz structure assumption [38].
2.6. Classification 2.6.1. Cross-Validation Scheme per Subject We use a variation of grouped k-fold cross-validation per subject to evaluate the classifiers. We applied four-fold cross-validation by splitting the blocks of each subject into four continuous folds. Unlike regular cross-validation, we only used a single fold to train the classifiers while using the other three folds for validation. This scheme resulted in a training set of 9 blocks of 135 epochs each. We chose this approach since we are primarily interested in the performance of the classifiers in the case of low data availability. The classification task was to determine the cued target for each block. The fraction of correctly predicted cues provided the accuracy of a classifier. Data from all trials were used in the training stage, whereas classifier validation was performed multiple times per fold, each time using an increasing amount of trials (i.e., using the first trial, using the first two trials, etc., until all 15 trials have been used). For each of the 9 stimulated targets, the averages over the corresponding epochs across the utilized trials were used to predict the cued target in that block. The target with the maximum classifier score was then chosen as the predicted cued target. Before training the classifiers, a Z-score normalization transformation was developed on the training data to scale all EEG channels to unit variance. This transformation was then applied to the validation data.

Spatiotemporal Beamformer Classifier
Before calculating the spatiotemporal beamformer (STBF), the signal was downsampled to 32 Hz or twice the low-pass frequency 16 Hz, resulting in 17 time samples between 0.1 s and 0.6 s. According to the Nyquist theorem, more samples would not contain more information; hence, the minimum temporal resolution was chosen to reduce the dimensionality of the covariance. The activation pattern is the difference between the averages of epochs in response to cued targets and the averages of those in response to non-cued targets. We constructed three variations of the spatiotemporal beamformer: STBF with empirical covariance estimation (STBF-EMP) as in Section 2.3.1, STBF with LOOCV-shrunk covariance estimation (STBF-SHRUNK) as in Section 2.3.2, and STBF with Kronecker-Toeplitz-structured covariance estimation (STBF-STRUCT) with LOOCV shrinkage for the Kronecker factors as in Section 2.3.4.

Riemannian Geometry Classifier
We opted for a Riemannian geometry-based classifier to compare our results. The Riemannian model (xDAWN+RG) uses the xDAWN spatial filter combined with Riemannian geometry in tangent space as implemented by Barachant et al. [58]. This classifier uses four xDAWN spatial filters and each epoch's empirical spatial covariance matrix. The target with the maximum score is the prediction of the cued target. xDAWN+RG was trained and validated without downsampling using epochs at the original sample rate of 2048 Hz.

Minimum Required Fixed-Point Iterations
The fixed point iteration algorithm described in Equations (15a)-(16b) is used to estimate the Kronecker-Toeplitz-structured covariance for the STBF-STRUCT classifier. Fixedpoint iteration is an iterative procedure starting from (in our case) non-informed initial guesses for the spatial and temporal covariance matrices. As a stopping criterion, one could impose a threshold on the difference in outcome of successive steps, e.g., based on the covariance norm or the classifier accuracy. However, few iterations or even just one [59] suffice to achieve satisfactory performance in practice.

Classifier Accuracy for Limited Training Data
It is of interest to keep the calibration time before BCI operation as short as possible. We mimic this problem by training the classifier with as few training epochs as possible. We evaluate the performance of all classifiers for different levels of available training data and apply the cross-validation procedure nine times (the number of blocks in the training fold) for all subjects, keeping the corresponding number of blocks in the training folds and dropping the rest. Figures 3 and A1 show each classifier's accuracy relative to the data availability. We statistically compare the two newly proposed classifiers, STBF-STRUCT and STBF-SHRUNK, for different levels of training data availability using a one-sided paired Wilcoxon rank-sum test with Holm correction for the multiple pairwise comparisons between classifiers. We performed this analysis three times: by only using the first trial of a block, by averaging epochs across the first two trials of a block, and across the first five trials of a block. Results validated on one trial are reported in Table 1, two-trial results in Table 2, and five-trial results in Table 3.   Accuracies are shown for the evaluation settings averaging over 1, 2, and 3 trials of testing stimuli. Figure A1 contains results for all numbers of trials. Although STBF-EMP is unstable when few training data are available, regularization of the covariance matrix (STBF-SHRUNK and STBF-STRUCT) drastically improves performance.
The tables show that STBF-STRUCT has a significant advantage over STBF-SHRUNK when the number of training blocks is low. This effect is present for 1-, 2-, and 5-trial evaluations. This advantage decreases (the p-value increases) when adding more training blocks. Both STBF-STRUCT and STBF-SHRUNK perform significantly better than STBF-EMP for all evaluated settings. Compared to xDAWN+RG, STBF-STRUCT also has significantly higher accuracy in almost all evaluated settings, except when using only one training block. STBF-SHRUNK does not outperform xDAWN+RG when training data are scarce but gains a significant advantage when using more training data.

Classifier Training Time
In order to evaluate the training time of the investigated classifiers, the cross-validation scheme was run four times for each subject, each time with an increasing number of EEG channels retained in the analysis, to explore the scalability of each classifier for analyses with higher spatial resolutions. The temporal resolutions were not varied, but we expect that increasing the temporal resolution has a similar effect on training time, since the training times for the STBF-based classifiers are primarily dependent on the number of parameters in their respective covariance matrix estimators, as evidenced by the complexity calculations in Section 4.2. Figure 4 shows the measured training times. These results were obtained using a laptop with an Intel ® Core™ i7-8750H CPU (Intel Corporation, Santa Clara, CA, USA) and 16 GB of RAM.  Figure 4 shows that the training time of STBF-STRUCT increased less steeply than that of STBF-SHRUNK and STBF-EMP. The training time of all three STBF-based classifiers was much lower than that of xDAWN+RG, which appears nearly constant when using 4, 8, 16, or 32 channels.

Classification Accuracy
As evidenced by Figure 3 and Tables 1-3, the regularized classifiers STBF-SHRUNK and STBF-STRUCT significantly improve the classification accuracy compared to the original STBF-EMP for all the numbers of training blocks indicated. We believe there are three reasons for this. First and foremost, the empirical covariance matrix in STBF-EMP becomes ill-conditioned when the number of available training epochs is smaller than the number of features (n < cs), rendering its inversion with the Moore-Penrose pseudoinverse unstable. This is the case for STBF-EMP when n = cs = 32 × 17 = 544, after which the accuracy of STBF-EMP starts to increase. This effect is visible in Figure 3, where the accuracy starts increasing when using more than four training blocks, amounting to 540 epochs. The noticeable dip in accuracy when using around 540 epochs can be explained by numerical effects in the pseudoinverse for very small eigenvalues [60][61][62][63]. Regularization of the covariance matrix with shrinkage ensures that the covariance matrix is non-singular and better conditioned so that it can stably be inverted. Second, covariance regularization introduces a trade-off between the variance and bias of the model [32]. Better performance on unseen data can be achieved when some model variance is traded for extra bias. Regularization reduces the extreme values present, as shown in Figure 1, resulting in a classifier with better generalization. Third, the true spatiotemporal covariance matrix may vary throughout BCI sessions, e.g., due to movement of the EEG-cap, changing impedances of electrodes, subject fatigue, the introduction of new spatiotemporal noise sources, and other possible confounds. A regularized covariance matrix should better account for changes in true covariance. Note that the LOOCV method in principle assumes that the covariances of the training data and unseen data are the same. Because the covariance might have changed for unseen data, the shrinkage estimate obtained with LOOCV is probably still an underestimation of the optimal-but unknown-shrinkage coefficient that would yield the best classification accuracy for the unseen data.
Another observation is the significantly better accuracy score of STBF-STRUCT over STBF-SHRUNK when the amount of available training data is small. This property is an attractive advantage in a BCI setting since it is desirable to keep the calibration (training) phase as short as possible without losing accuracy. The accuracy advantage of the structured estimator is a consequence of the Kronecker-Toeplitz covariance structure, which is informative for the underlying process generating the epochs, if it is assumed that the EEG signal is a linear combination of stationary activity generated by random dipoles in the brain with added noise [24,35,41]. Hence, STBF-STRUCT can utilize this prior information to better estimate the inverse covariance. The increase in accuracy for small training set sizes can also be explained by the smaller number of parameters necessary to estimate the inverse covariance (see Section 4.2), increasing the stability of matrix inversions.
When compared to the state-of-the-art xDAWN+RG classifier, we conclude that STBF-STRUCT reaches similar accuracy when using only one block of training data. The authors suspect this is due to both classifiers having insufficient training information to reach satisfactory classification accuracy. When more data are available, STBF-STRUCT reaches significantly better accuracies. Combined with the benefits laid out in Sections 4.2 and 4.3, this makes it an attractive option for ERP classification. STBF-SHRUNK does not show decisive accuracy improvements over xDAWN+RG using a few training blocks, but this improves as the training data increases.

Time and Memory Complexity
As mentioned above, inverting the full cs × cs dimensional covariance matrix to construct STBF-EMP and STBF-SHRUNK can be costly and unstable, in particular in highresolution settings with many EEG channels or time samples. Constructing the full covariance and inverse covariance matrices also requires a considerable amount of memory. The structured covariance estimator of STBF-STRUCT has two advantages here.
First, because of Properties 1 and 2 there is no need to calculate the full cs × cs symmetric covariance and inverse covariance matrices for STBF-STRUCT or keep them in memory; they can instead be replaced by two smaller symmetric matrices of dimensions c × c and s × s, respectively. Furthermore, since the temporal component of the Kronecker product is Toeplitz-structured, it only requires s parameters to estimate. Although the inverse covariance of STBF-EMP and STBF-SHRUNK is defined by Second, structured estimation has better time complexity. Covariance estimation and inversion occupy the largest part of the STBF training time. For STBF-EMP and STBF-SHRUNK, the time complexity of this process is O(nc 2 s 2 + c 3 s 3 ). Thanks to Property 1, the complexity can be reduced to O(nc 2 s 2 + c 3 + s 3 ) for the structured estimator of STBF-STRUCT. The results presented in Figure 4 confirm these calculations. It can be observed that the training time of STBF-STRUCT stays low compared to STBF-EMP and STBF-SHRUNK when dimensionality increases.
The results shown in Figure 4 also confirm that the STBF-based estimators are very fast to train compared to the state-of-the-art estimator xDAWN+RG, which confirms the results in [21]. Since the training times of all STBF-based classifiers are already of the order of tenths of seconds, the question arises as to whether the improvements achieved using the structured estimator would be relevant in practice. However, the authors believe that these results could significantly impact some use-cases of the spatiotemporal beamformer, such as high-spatial-or temporal-resolution ERP analyses. One example is single-trial ERP analysis with a high temporal resolution to extract ERP time features. Such higher-resolution analy-ses can later be incorporated into an ERP classification framework. In addition, the speed provided by structured estimation yields a faster off-line evaluation of the STBF ERP classifier, in which multiple cross-validation folds, subjects, and hyperparameter settings often need to be explored, which can quickly increase runtime. Improvements in computation speed and memory usage can remove the need for dedicated computation hardware and enable group analyses to be run on a personal computer.

Interpreting the Weights
The weight matrix of the STBF determines how each spatiotemporal feature of a given epoch should contribute to enhancing the SNR of the discriminating signal in the classification task. Alternatively, the activation pattern can be regarded as a forward EEG model of the activity, generating the discriminating signal and the weights as a backward model [60,64]. Regularization enables a researcher to interpret better the distribution of the weight over space and time after reshaping the weight vector w to its spatiotemporal matrix equivalent, W, such that vec(W) = w. Figure 5 compares the weights calculated in STBF-EMP and STBF-SHRUNK with the weights from STBF-STRUCT. Since the linear filter's noise suppression and signal amplification functions are deeply entangled, it is not necessarily true that features with a high filter weight directly correlate to features containing discriminatory information [64]. However, it is still possible to interpret the weights in terms of which features contribute most to the classification process, be it through noise suppression, signal amplification, or-most likely-a combination of both. The weights obtained by STBF-EMP seem to be randomly distributed over space and time; the regularized estimator used by STBF-SHRUNK and STBF-STRUCT reveal a more interpretable weight distribution. The STBF-SHRUNK weights show a sparse spatial distribution, whereas the STBF-STRUCT weights show a sparse distribution in both space and in time.

Oz
As expected, Figure 5b and d exhibit weight around the central and parietal regions, where the P300 ERP component is present. Especially the spatial weights of STBF-SHRUNK in Figure 5d correspond to the spatial activation pattern in Figure 5a. This is not surprising, since shrinkage transforms the covariance matrix closer to the identity matrix and assuming identity covariance in Figure 4 yields weights identical to the activation pattern (up to a scaling factor). Additionally, Figure 5b shows that weights in the baseline interval and after 0.6 s, which should contain no response information, are close to zero for the structured estimator. Meanwhile, these weights are high in the occipital region between 0.1 s and 0.2 s, containing early visual components with relatively low SNR. This high weight for the early visual components confirms the results from Treder and Blankertz [65] that state that, in addition to the P300, the early N1 and P2 ERP components are also modulated by oddball attention and contain discriminatory information between attended and nonattended stimuli.
Using an interpretable classification model has many advantages. For instance, one can use the weight matrix to determine relevant time samples and EEG channels for persubject feature selection to refine the model further. The number of channels is also an important cost factor in practical BCI applications. Determining which channels do not contribute to the classification accuracy helps to reduce the number of required electrodes. Spatially clustered weights indicate that some electrodes are not used by the classifier and can be discarded accordingly with no substantial accuracy reduction. As another example, information about the timing and spatial distribution of the discriminatory information in the response can be extracted from the weights and linked to neurophysiological hypotheses.

Conclusions
Although it is possible to regularize the spatiotemporal LCMV beamformer classifier for ERP detection through other methods, such as by employing feature selection, by adding regularizing penalties to the cost function beamforming problem, or by crafting a cleaner activation pattern, our work focused on estimation methods for spatiotemporal covariance. We introduced a covariance estimator using adaptive shrinkage and an estimator exploiting prior knowledge about the spatiotemporal nature of the EEG signal. We compared these estimators with the original spatiotemporal beamformer and a state-of-the-art method in an off-line P300 detection task. Our results show that the structured estimator performs better when training data are sparsely available and that results can be computed faster and with substantially less memory usage. Since these algorithms are not paradigm-specific, the conclusions can be generalized to other ERP-based BCI settings.
Future work should focus on introducing more robust regularization strategies using prior knowledge, such as shrinking the spatial covariance to a population mean or a previously known matrix based on sensor geometry or characterizing the temporal covariance as a wavelet or autoregressive model. More accurate results could be obtained by expressing the covariance as the sum of multiple Kronecker products to account for spatial variation in temporal covariance. It could also be interesting to explore the impact of covariance regularization on transfer learning of the STBF between subjects to alleviate calibration entirely. Finally, it could be insightful to evaluate the proposed algorithms in a real-world on-line BCI setting.   (G0A4118N, G0A4321N,  G0C1522N), and the Hercules Foundation (AKUL 043).

Institutional Review Board Statement:
The study was conducted according to the guidelines of the Declaration of Helsinki, and approved by the Ethics Committee of KU Leuven's university hospital (UZ Leuven) (S62547 approved 11 June 2019).

Informed Consent Statement:
Informed consent was obtained from all subjects involved in the study.

Data Availability Statement:
No new data were created or analyzed in this study. Data sharing is not applicable to this article. Source code is available at https://github.com/kul-compneuro/stbf-erp (accessed 7 March 2022).

Acknowledgments:
The authors acknowledge François Cabestaing and Hakim Si-Mohammed for their valuable input in the development of this article.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or the interpretation of data; in the writing of the manuscript, or in the decision to publish the results.