Closed-Loop Cognitive-Driven Gain Control of Competing Sounds Using Auditory Attention Decoding

: Recent advances have shown that it is possible to identify the target speaker which a listener is attending to using single-trial EEG-based auditory attention decoding (AAD). Most AAD methods have been investigated for an open-loop scenario, where AAD is performed in an ofﬂine fashion without presenting online feedback to the listener. In this work, we aim at developing a closed-loop AAD system that allows to enhance a target speaker, suppress an interfering speaker and switch attention between both speakers. To this end, we propose a cognitive-driven adaptive gain controller (AGC) based on real-time AAD. Using the EEG responses of the listener and the speech signals of both speakers, the real-time AAD generates probabilistic attention measures, based on which the attended and the unattended speaker are identiﬁed. The AGC then ampliﬁes the identiﬁed attended speaker and attenuates the identiﬁed unattended speaker, which are presented to the listener via loudspeakers. We investigate the performance of the proposed system in terms of the decoding performance and the signal-to-interference ratio (SIR) improvement. The experimental results show that, although there is a signiﬁcant delay to detect attention switches, the proposed system is able to improve the SIR between the attended and the unattended speaker. In addition, no signiﬁcant difference in decoding performance is observed between closed-loop AAD and open-loop AAD. The subjective evaluation results show that the proposed closed-loop cognitive-driven system demands a similar level of cognitive effort to follow the attended speaker, to ignore the unattended speaker and to switch attention between both speakers compared to using open-loop AAD. Closed-loop AAD in an online fashion is feasible and enables the listener to interact with the AGC. 0.05) These results show the proposed closed-loop cognitive-driven gain controller a effort to the attended speaker, to ignore the unattended speaker and to switch attention compared to the open-loop AAD system. In terms of the level of story understanding the highest median understanding level is obtained for the open-loop AAD, while a lower median understanding level is


Introduction
Hearing aids aim at restoring the normal hearing abilities by several processing steps including speech enhancement. The main objective of speech enhancement is to improve the intelligibility of the recorded microphone signals, which are often corrupted by various noise sources [1,2]. In a scenario with multiple competing speakers, the performance of many speech enhancement algorithms, for example, beamforming and blind source separation, depends on correctly identifying the target speaker, i.e., the speaker which the listener is attending to.
Recent advances in electroencephalography (EEG) have shown that it is possible to identify the target speaker from single-trial EEG recordings [3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18], which are noninvasive and have appropriate temporal resolution for auditory stimuli. Several single-trial EEG-based auditory attention decoding (AAD) methods have been proposed to identify the speaker which the listener is attending to, aiming to be incorporated in a real-world applications, e.g., to control a hearing aid. AAD methods aim at identifying the attended speaker by relating the EEG responses of the listener to speech signals of speakers. These methods are based on, for example, a least-squares cost function [3][4][5][6][7]11,14,16], canonical correlation analysis [8,13], a state-space model [9,18], and neural networks [12,15,17]. The least-squares-based AAD method used in Reference [3,[5][6][7]11,14] aims at reconstructing the attended speech envelope from the EEG responses of the listener using a trained spatio-temporal envelope estimator. To identify the attended speaker, the reconstructed speech envelope is compared with the speech envelopes of the competing speakers using correlation coefficients. Since these correlation coefficients are typically highly fluctuating, a large correlation window of about 30 s is typically required to obtain a reliable decoding performance, which causes a large processing delay [9,18,19].
The possibility of decoding auditory attention from EEG recordings has led to an increasing research interest in the topic of incorporating AAD in a brain-computer interface for real-world applications, for example, to cognitively drive speech enhancement algorithms [19][20][21][22][23][24][25]. Cognitive-driven speech enhancement algorithms potentially provide the listener with the ability to selectively attend to a specific speaker. It should, however, be noted that the performance of most aforementioned AAD methods and cognitive-driven speech enhancement algorithms has been investigated for an open-loop scenario, where AAD is performed in an offline fashion without presenting online feedback to the listener. In addition, scenarios with no attention switch between speakers have typically been investigated, which is unrealistic in practice.
To investigate the performance of AAD for real-world applications, closing the loop by presenting feedback according to the AAD results in an online fashion is of crucial importance. Feedback presentation may influence the subsequent intent of the listener and the brain signals that encode that intent. In Reference [5], the feasibility of a closed-loop system based on least-squares-based AAD has been shown by presenting the AAD results as visual feedback, i.e., using different colors or a sphere with different radii. However, the feasibility of closed-loop AAD enabling the listener to interact with speech enhancement in an online fashion and allowing the listener to switch attention between speakers remains to be investigated.
In this paper, we aim at developing a closed-loop AAD system that allows to enhance a target speaker, suppress an interfering speaker and switch attention between both speakers. Specifically, we propose a cognitive-driven adaptive gain controller (AGC), which is based on real-time AAD (RAAD). The RAAD first generates correlation coefficients for both speakers from the EEG responses of the listener and the speech signals of both speakers. To this end, we adopted the least-squares-based AAD method from Reference [3], either using a small correlation window of length 0.25 s or a large correlation window of length 15 s. The fluctuating correlation coefficients are then translated into more reliable probabilistic attention measures, based on which the attended and the unattended speaker are identified. To this end, we propose an AAD algorithm either using a generalized linear model (GLM) or using a state-space model (SSM), similarly to Reference [9,18]. The AGC as an ideal speech enhancement algorithm then amplifies the identified attended speaker and attenuates the identified unattended speaker, where the gains for both speakers are based on the probabilistic attention measures. Finally, the loop of cognitive-driven gain control is closed by presenting the amplified attended speaker and the attenuated unattended speaker to the listener via loudspeakers, enabling the listener to interact with the AGC in an online fashion and switch attention between speakers. For an acoustic scenario comprising two competing speakers where one speaker is located on the left side and the other speaker is located on the right side, we investigate the decoding performance and the speech enhancement performance of the proposed closed-loop cognitive-driven gain controller system with 10 participants based on objective and subjective evaluations. In addition, we provide a detailed analysis and experimental comparison between the open-loop and the closed-loop AAD system using either the GLM or the SSM.
The paper is organized as follows. In Section 2, we introduce the experiment protocol used to calibrate and evaluate the proposed cognitive-driven gain controller system, describe the stimuli and the data acquisition used for the experiments, and present the proposed cognitive-driven gain controller system. In Section 3, we evaluate the decoding performance and the speech enhancement performance of the proposed cognitive-driven gain controller system. In Section 4, we discuss the experimental results in more detail, summarize the main contributions, and suggest possible topics for further research.

Experiment Protocol
In this section, we present the experiment protocol used to calibrate and evaluate the cognitive-driven gain controller system. The experiment protocol consists of a calibration phase, an open-loop AAD phase, and a closed-loop AAD phase (see Figure 1).

Calibration Phase
In the calibration phase, the cognitive-driven gain controller system was individually calibrated for each participant using the EEG responses for a scenario with two competing speakers (see Figure 2). Participants were cued by an arrow on a screen to listen attentively to one of the speakers while recording the ongoing EEG responses. Participants were also instructed to minimize eye movement and blinking, which may cause EEG artifacts. The EEG responses were recorded during four sessions, lasting 30 min in total. The first and the second session each lasted 10 min, while the third and the fourth session each lasted 5 min. For the first and the third session, the participants were cued to attend to the left speaker, whereas, for the second and the fourth session, the participants were cued to attend to the right speaker. Following each session, the participants were asked to fill out a questionnaire consisting of multiple-choice questions about the stories uttered by the speakers. There was one question per minute of each story. The questionnaire was used to check whether the participants attended to the cued speaker. After the fourth session, there was a short break. During this break, the recorded EEG responses were used to calibrate the cognitive-driven gain controller system (see Section 2.5), individualized per participant.

Open-Loop AAD Phase
In the open-loop AAD phase, the calibrated AAD algorithms were used to identify the attended and the unattended speaker without presenting feedback to the participants. The open-loop AAD phase consisted of one session lasting 10 min. During this session, participants were cued by an arrow on a screen every minute to switch attention between the competing speakers while recording the ongoing EEG responses. Afterwards, participants were asked to rate how much effort it took to follow the attended speaker, to ignore the unattended speaker, and to switch attention to the cued speaker on a scale from 0 to 10, with 0 being least effort and 10 being most effort. In addition, participants were asked to rate how well they understood the attended story on a scale from 0 to 10, with 0 being nothing understood and 10 being everything understood. Furthermore, the participants were asked to fill out a questionnaire consisting of multiple-choice questions about the stories uttered by the speakers. While participants were rating and answering the questionnaire, the decoding performance of several AAD algorithms (see Section 2.5.1) was evaluated using the recorded EEG responses. The following AAD algorithms were considered: • LW-GLM: AAD algorithm using a generalized linear model (GLM) with a large correlation window (LW) of 15 s. • LW-SSM: AAD algorithm using a state-space model (SSM) with a large correlation window (LW) of 15 s. • SW-SSM: AAD algorithm using a state-space model with a small correlation window (SW) of 0.25 s. Using a small correlation window was motivated by the results in Reference [9], where it was shown that the state-space model is able to translate highly fluctuating coefficients of the spatio-temporal envelope estimators into reliable probabilistic attention measures.
Note that, in this paper, an AAD algorithm using a generalized linear model with a small correlation window was not considered, since initial experiments showed highly fluctuating correlation coefficients with unreliable probabilistic attention measures. Overview of the proposed cognitive-driven gain controller system for a scenario with two competing speakers. The EEG responses were acquired using the EEG amplifier. The acquired EEG responses and EEG trigger markers were streamed using the gUSBamp application. Using the gUSBamp application and the LSL software package, the streamed EEG responses were forwarded to the real-time AAD (RAAD) for online decoding, to the Lab Recorder application for recording, and to the OpenViBE software for online EEG visualization. The RAAD was implemented and run using MATLAB (MATLAB 1). The RAAD identified the attended and the unattended speaker and generated their corresponding probabilistic attention measure. The generated probabilistic attention measure of the attended speaker (p a ) was forwarded to the AGC using the LSL software package. Based on the probabilistic attention measure, the AGC amplified the attended speaker (λ aŝa ) and attenuated the unattended speaker (λ uŝu ) as acoustic stimuli. The AGC (together with trigger marker and visual stimuli) was implemented and run using MATLAB (MATLAB 2). The AAD loop was then closed by presenting the acoustic stimuli using the audio interface and two loudspeakers.

Closed-Loop AAD Phase
In the closed-loop AAD phase, the calibrated cognitive-driven gain controller system was used to identify the attended and the unattended speaker and to close the loop by presenting the amplified attended speaker and the attenuated unattended speaker in an online fashion via loudspeakers. The closed-loop AAD phase consisted of three sessions, each lasting 10 min. During each session, participants were cued by an arrow on a screen every minute to switch attention between the presented competing speakers while recording the ongoing EEG responses. To identify the attended and the unattended speaker, in each session a different AAD algorithm was used, i.e., LW-GLM, LW-SSM, and SW-SSM. These AAD algorithms were randomly assigned to the sessions for each participant. After each session, the participants were asked to fill out a questionnaire consisting of multiple-choice questions about the stories uttered by the speakers, similarly to the open-loop AAD phase. In addition, the participants were asked to rate how much effort it took to follow the attended speaker, to ignore the unattended speaker, to switch attention to the cued speaker, and how well they understood the attended story. For analyzing the experimental results, 24% of the results needed to be excluded, 5% due to a technical hardware problem when saving the results and 19% due to poor attentional performance reported by participants themselves after a few sessions (Some participants reported that they either completely lost concentration to attend to the cued speaker or they completely engaged with the story uttered by the non-cued speaker.).

Participants
Ten native German-speaking participants (aged between 22 and 31 years; 6 male, 4 female) took part in this study (An EEG experiment participation was announced, to which ten persons responded.). Informed consent was received from all participants. All participants were self-reported normal hearing and reported no past or present neurological or psychiatric conditions. Informed consent was obtained from all subjects involved in the study. The study was carried out in accordance with the Declaration of Helsinki.

Stimuli
Two German audio stories, uttered by two different male speakers, were used as the speech signals of the competing speakers. One story was from a German audio book website [26], and the other story was from a selection of audio books [27]. Before performing the experiment, participants reported no knowledge of the audio stories. Speech pauses from the audio stories that exceeded 0.5 s were shortened to 0.5 s. The audio stories were normalized to the same root-mean-square (RMS) value at a comfortable level which was individualized by each participant. The audio stories with no repetition were considered as the acoustic stimuli for the calibration phase, the open-loop AAD phase and the closed-loop AAD phase. The acoustic stimuli were presented at a sampling frequency of 44,100 Hz using MATLAB (MATLAB 2 in Figure 2), a Fireface UC audio interface system (provided by RME Audio, Germany) and two loudspeakers placed at the left side (with an azimuth of −45 • ) and the right side (with an azimuth of 45 • ) and a distance of 1 m from the the participants. The visual stimuli consisting of an arrow for cueing were presented using a monitor in front of the participants. In addition, the EEG trigger markers synchronized with the acoustic and visual stimuli were generated using the Fireface UC audio interface system and a g.TRIGbox (provided by g.tec, Austria). The presentation of the acoustic and visual stimuli and the trigger marker generation were performed using the same computer employed for the cognitive-driven gain controller system (see Figure 2 and Table 1).

Data Acquisition
Aiming at using a small number of electrodes for AAD, EEG responses were acquired using C = 16 electrodes. The electrodes were placed on the scalp area at F1, F2, FC3, FC4, FT7, FT8, Cz, C5, C6, P3, P4, P7, P8, Oz, PO3, and PO4 (see Figure 3). This electrode placement was inspired by the results in Reference [6,28], where it was shown that an electrode configuration covering the temporal, central, frontal, and parietal scalp areas yields a reliable decoding performance. The EEG responses were referenced to the P9 electrode (Since we did not observe a significant difference in AAD performance between referencing to the P9 electrode or to the nose electrode, we decided to use P9 as the fixed reference electrode for all phases.). The EEG responses were acquired using active (g.LADYbird) electrodes and a g.USBamp bio-signal amplifier (provided by g.tec, Austria). The acquired EEG responses and EEG trigger markers were streamed at a sampling frequency of 500 Hz using the gUSBamp application from the Lab Streaming Layer (LSL) software package (provided by Swartz Center for Computational Neuroscience, UCSD). Using the gUSBamp application, the streamed EEG responses were also forwarded to RAAD for online decoding, to the Lab Recorder application (provided by Swartz Center for Computational Neuroscience and Kothe) for recording, and to the OpenViBE software for online EEG visualization (see Table 1). The gUSBamp application, the OpenViBE software and the Lab Recorder application were run on the same computer employed for the cognitive-driven gain controller system (see Figure 2).

Cognitive-Driven Gain Controller System
In this section, we present the proposed cognitive-driven gain controller system consisting of RAAD and AGC (see Figure 4). Section 2.5.1 describes the RAAD, which generates probabilistic attention measures based on which the attended and the unattended speaker are identified. Section 2.5.2 describes the AGC, which amplifies the identified attended speaker and attenuates the identified unattended speaker based on the probabilistic attention measures.  The RAAD consists of three blocks (see Figure 4), i.e., pre-processing of the EEG responses and speech signals, correlation coefficient generation and AAD using either GLM or SSM.
A. Pre-Processing: The streamed EEG responses from the gUSBamp application were re-referenced to a common average reference, band-pass filtered between 0.5 Hz and 9 Hz using a fourthorder Butterworth band-pass filter, and, subsequently, downsampled to 64 Hz in an online fashion. Contrary to the online EEG pre-processing, the speech pre-processing was performed in an offline fashion, since the speech signal s 1,t of speaker 1 and the speech signal s 2,t of speaker 2, with t the discrete time index for t = 1 . . . T, are available. The envelopes of both speech signals e 1,k and e 2,k , with k the sub-sampled time index for k = 1 . . . K, were obtained using a Hilbert transform, followed by lowpass filtering at 9 Hz and downsampling to 64 Hz. The pre-processed EEG responses and the speech envelopes were then provided in an online fashion to the correlation coefficient generation block.

B. Correlation Coefficient Generation:
To generate the correlation coefficients of speaker 1 and speaker 2, we adopted the least-squares-based AAD method from Reference [3], which estimates the attended speech envelope from the EEG responses using a spatio-temporal envelope estimator trained during the calibration phase.
(1) Training step (calibration phase): In the training step, the attended speaker is assumed to be known. The attended speech envelope is then estimated from the pre-processed EEG responses r c,k , with c the electrode index for 1 . . . C, using a spatio-temporal envelope estimator g [3], i.e., with where J denotes the number of envelope estimator coefficients per electrode. The trained envelope estimator g is obtained by minimizing the least-squares error between the (known) attended speech envelope e a,k and the reconstructed envelopeê a,k , regularized with the squared norm of the derivative of the envelope estimator coefficients to avoid over-fitting [3,14,29], i.e., with Λ Λ Λ denoting the derivative matrix [14] and β denoting a regularization parameter. The solution of (6) is equal to with the correlation matrix Q and the cross-correlation vector q given by (2) Correlation coefficient generation step (open-loop and closed-loop AAD phase): To generate the correlation coefficients of speaker 1 and speaker 2, we compute the Pearson correlation coefficients between the estimated attended envelopeê a,k in (1) and the speech envelopes e 1,k and e 2,k , i.e., whereê a,k denotes the stacked vector of estimated attended envelopes corresponding to a correlation window of length K COR , i.e., and e 1,k and e 2,k are defined similarly as in (10).
In the training step, the pre-processed EEG responses obtained from the calibration phase were segmented into trials of length 15 s, shifted by 1 sample (corresponding to 1 64 s). The parameters J and β of the envelope estimator in (3) and (7) were determined for each participant using a leave-one-trial-out cross-validation approach, similarly as in Reference [3,14]. Using these parameters, a trained spatio-temporal envelope estimator g in (7) was then computed for each participant using all trials from the calibration phase.
In the correlation coefficient generation step, the pre-processed EEG responses were segmented in the same way as in the training step. The correlation coefficients ρ 1,k and ρ 2,k in (9) were computed either using a large correlation window of length K COR = 960 samples (corresponding to 15 s) with an overlap of 959 samples or using a small correlation window of length K COR = 16 samples (corresponding to 0.25 s) with no overlap. In Reference [4,7,9,18], it has been shown that the performance of AAD algorithms is affected by fluctuations of the correlation coefficients. In this paper, we propose two methods (GLM and SSM) to translate the fluctuating correlation coefficients into more reliable probabilistic attention measures.
C. Auditory Attention Decoding Using Generalized Linear Model: The AAD algorithm using the GLM consists of a training and a decoding step. The training step takes place during the calibration phase, whereas the decoding step takes place during the open-loop and the closed-loop AAD phase.
(1) Training step: The correlation coefficients of speaker 1 and speaker 2 in (9) are first segmented into non-overlapping (NOL) windows of length K NOL , i.e., with i the window index for i = 1 . . . I. The mean differential correlation coefficient between speaker 1 and speaker 2 in window i is computed as We model the attention stated i in window i as a binary random variable [30], i.e., d i = 1, attending to speaker 1 in window ī d i = 2, attending to speaker 2 in window i , which is assumed to follow a Bernoulli distribution with probabilityp i , i.e., Using a GLM, the probability of attending to speaker 1 is then given by [31] with the linear predictorz i , i.e.,z where α 0 and α 1 denote the GLM parameters. Obviously, the probability of attending to speaker 1 monotonically increases from 0 to 1 forz i ∈ (−∞, ∞).
The probability mass function in (15) can be written as an exponential distribution using the canonical link function θ i = logit(p i ) =z i , with logit(p i ) = log with b(θ i ) = log(1 + exp(θ i )).
The maximum likelihood (ML) estimate of the GLM parameters in (19) is then obtained by maximizing the log-likelihood function, i.e., This estimate can be computed, for example, by using an iteratively re-weighted least-squares algorithm and Newton-Raphson method [32,33], i.e., with r the iteration index and where (· ) denotes the derivative operator. Algorithm 1 summarizes the GLM parameter estimation in the training step.  (26) and (27), respectively 4: calculate y (r) and W (r) using (28), (29), and (25), respectively 5: update the GLM parametersα α α (r+1) using (23) 6: end for output:α α α =α α α (R+1) (2) Decoding step: To decode which speaker a participant is attending to in window i, the mean differential correlation coefficient∆ρ i is computed using (13), based on which the linear predictorz i is computed using the (trained) GLM parametersα α α in (17). The probability of attending to speaker 1, i.e., P d i = 1 , and the probability of attending to speaker 2, i.e., P d i = 2 , are then obtained using (16). Based on these probabilities, it is decided that the participant attended to speaker 1 if P d i = 1 > P d i = 2 , or attended to speaker 2 otherwise. The probabilistic attention measure of the attended speakerp a,i in window i is, hence, determined as Obviously, the probabilistic attention measure of the attended speakerp a,i lies between 0.5 and 1. The probabilistic attention measure of the unattended speaker p u,i is determined asp u,i = 1 −p a,i . The process flow of AAD using the GLM is depicted in Figure 4.
The AAD algorithm using the GLM was implemented and run using MATLAB (MATLAB 1 of RAAD in Figure 2). For the training step, Algorithm 1 was executed with R = 30 iterations using the correlation coefficients obtained from the calibration phase. Both for the training and the decoding steps, the correlation coefficients were computed using the large correlation window (i.e., K COR = 960 samples) and the mean differential correlation coefficient in (13) was computed using a window of length K NOL = 16 samples (corresponding to 0.25 s). During the decoding step, the probabilistic attention measuresp a,i andp u,i were forwarded to the AGC using the LSL software package (see Figure 2). Each participant's own data were used for training the GLM parameters and for decoding. To evaluate the performance of the proposed LW-GLM algorithm, the decoding performance for each participant was computed as the percentage of correctly decoded NOL windows. To evaluate the delay to detect a cued attention switch of the proposed LW-GLM algorithm, the delay was computed as the time takes for the LW-GLM algorithm to detect an attention switch after the moment the arrow on a screen cued to switch attention.

D. Auditory Attention Decoding Using State-Space Model:
As an alternative to the GLM, it has been proposed in Reference [9] to use a SSM to translate the absolute values of the coefficients of the spatio-temporal envelope estimator into probabilistic attention measures. Contrary to Reference [9], in this paper, we propose to use the absolute values of the correlation coefficients instead of the coefficients of the spatio-temporal envelope estimator, which need to be obtained for both the attended and the unattended speaker. Similarly to (14), we model the attention state d k at time instance k as a binary random variable, i.e., d k = 1, attending to speaker 1 at time instance k d k = 2, attending to speaker 2 at time instance k , which is assumed to follow a Bernoulli distribution with probability p k . Similarly to (16), the probability of attending to speaker 1 is given by where the variable z k is now modeled as an autoregressive (AR) process, i.e., The parameter c 0 is a hyperparameter ensuring stability of the AR process, and the noise process w k is assumed to follow a normal distribution with variance η k , i.e., where a 0 and b 0 are hyperparameters. The AR model in (35) implies that the variable z k at time instance k is predicted from z k−1 at the previous time instance with some uncertainty, which is modeled by the noise process w k . To relate the correlation coefficients ρ 1,k and ρ 2,k in (9) to the attention state d k , we model the probability of the absolute values of the correlation coefficients, given attention to speaker 1 or speaker 2, using a log-normal distribution (Please note that modeling the probabilities of the absolute values of the correlation coefficients with log-normal distributions allows for a closed-form iterative solution [9], compared to modeling the probabilities of the correlation coefficients either with normal or von Mises-Fisher distributions [30].), i.e., with whereγ a ,ν a andμ a denote the hyperparameters of the attended log-normal distribution. Similarly, we model the probability of the absolute values of the correlation coefficients, given no attention to speaker 1 or speaker 2, as with whereγ u ,ν u , andμ u denote the hyperparameters of the unattended log-normal distribution. Since a small overlap between the attended and the unattended log-normal distributions is desired for a reliable decoding performance, the hyperparameters γ {a,u} ,ν {a,u} , andμ {a,u} are tuned to minimize the overlap. Aiming at estimating the probability of attending to speaker 1 and speaker 2 at time instance k = k * (see Figure 4), we now consider the absolute values of the correlation coefficients within a sliding window of length K SSM = K P + K A + 1, with K P and K A denoting the number of correlation coefficients prior to and after k * , respectively. The set of parameters to be estimated in this window is given by Ω Ω Ω = z k * −K P :k * +K A , η k * −K P :k * +K A , δ a , µ a , δ u , µ u . The maximum a posteriori (MAP) estimate is obtained by maximizing the log-posterior function, i.e., Ω Ω Ω = arg max Ω Ω Ω (Ω Ω Ω|φ 1,k , φ 2,k , k = k * − K P : k * + K A ), which can be computed iteratively using the Expectation Maximization (EM) algorithm as in Reference [9,30]. Using the estimated variable z k , the probability p k = P(d k = 1) of attending to speaker 1 at time instance k is obtained using (34). These probabilities are segmented into non-overlapping windows of length K NOL , i.e., and the probability of attending to speaker 1 in window i is then computed as the mean of the probabilities, i.e., withd i the attention state in window i. The probability of attending to speaker 2 in window i is computed as Based on these probabilities, it is decided that the participant attended to speaker 1 if P d i = 1 > P d i = 2 , or attended to speaker 2 otherwise.
The probabilistic attention measure of the attended speakerp a,i in window i is, hence, determined as The probabilistic attention measure of the unattended speakerp u,i is determined aŝ p u,i = 1 −p a,i . The process flow of AAD using the SSM is depicted in Figure 4. The AAD algorithm using the SSM was implemented and run using MATLAB (MAT-LAB 1 of RAAD in Figure 2). The hyperparameters in (35) and (37) were set to c 0 = 1, a 0 = 2.008 and b 0 = 0.2016, similarly as in Reference [9]. The hyperparametersγ a ,ν a andμ a in (39) were set by fitting a gamma and a normal distribution to the absolute values of the correlation coefficients of the (oracle) attended speaker obtained from the calibration phase. Similarly, the hyperparametersγ u ,ν u , andμ u in (41) were set by fitting a gamma and a normal distribution to the absolute values of the correlation coefficients of the (oracle) unattended speaker obtained from the calibration phase. The SSM parameter set Ω Ω Ω was estimated using the EM algorithm as in Reference [9] with 20 iterations. On the one hand, for the LW-SSM algorithm using a large overlapping correlation window (i.e., K COR = 960 samples, 1 sample shift), a small SSM window of length K SSM = 1 sample (corresponding to 1 64 s) with K P = 0 and K A = 0 was used. On the other hand, for the SW-SSM algorithm using a small non-overlapping correlation window (i.e., K COR = 16 samples), a large SSM window of length K SSM = 60 samples (corresponding to 15 s) with K P = 53 (corresponding to 13.25 s) and K A = 6 (corresponding to 1.50 s) was used as in Reference [9]. The length of the window K NOL in (43) was set such that both algorithms generated the probabilistic attention measure of the attended speakerp a,i in (46) every 0.25 s. This means that for the LW-SSM algorithm a window of length K NOL = 16 samples was used, while, for the SW-SSM algorithm, a window of length K NOL = 1 sample was used. Each participant's own data were used for hyperparameter and parameter setting, as well as for decoding. To evaluate the performance of the proposed LW-SSM and SW-SSM algorithms, the decoding performance for each participant was computed as the percentage of correctly decoded NOL windows. To evaluate the delay to detect a cued attention switch of the proposed LW-SSM and SW-SSM algorithms, the delay was computed as the time takes for the LW-SSM and SW-SSM algorithms to detect an attention switch after the moment the arrow on a screen cued to switch attention.

Adaptive Gain Controller (AGC)
The probabilistic attention measure of the attended speakerp a,i in window i, either obtained using the GLM in (30) or using the SSM in (46), is then used to drive the AGC (see Figure 2).
The speech signal s 1,t of speaker 1 and the speech signal s 2,t of speaker 2 are first segmented into non-overlapping windows of length K AGC , i.e., for window i Based on the AAD result for window i, the attended speech vectorŝ a,i and the unattended speech vectorŝ u,i are determined as ŝ a,i = s 1,i ,ŝ u,i = s 2,i if the identified attended speaker is speaker 1 s a,i = s 2,i ,ŝ u,i = s 1,i otherwise.
By multiplying the attended speech vectorŝ a,i with the gain λ a,i and multiplying the unattended speech vectorŝ u,i with the gain λ u,i , the objective of the AGC is to achieve a desired signal-to-interference-ratio (SIR) between the identified attended and unattended speakers in window i. The desired SIR in window i is defined as a linear function of the probabilistic attention measurep a,i , i.e., such thatp a,i = 1 corresponds to SIR max , i.e., the maximum desired SIR, andp a,i = 0.5 corresponds to SIR = 0 dB. The SIR in window i at the output of the AGC is equal to with the energy of the attended and unattended speech vector in window i given by By setting (51) equal to the desired SIR in (50) and constraining the overall energy at the output of the AGC to be equal to the overall input energy, i.e., the gains λ u,i and λ a,i can be computed as To avoid annoying artefacts due to highly time-varying gains, the gains λ u,i in (54) and λ a,i in (55) are averaged over four windows, i.e., The amplified attended speech vectors a,i and the attenuated unattended speech vectors u,i in window i are finally obtained ass These signals are then presented to the participant via two loudspeakers. The AGC was implemented and run using MATLAB (MATLAB 2 in Figure 2). The sampling frequency of the speech signals of both speakers was equal to 44,100 Hz. The maximum desired SIR in (50) was set to 7 dB. The speech enhancement performance of the AGC was evaluated in terms of the SIR improvement ∆SIR, i.e., with SIR in = 10 log 10 SIR out = 10 log 10 where s a,i and s u,i denote the (oracle) attended and unattended speech vectors, defined similarly as in (47).

Results
In this section, we evaluate the decoding performance and the speech enhancement performance of the proposed cognitive-driven gain controller system described in the previous section. In Section 3.1, we evaluate the decoding performance of the proposed AAD algorithms for the open-loop and the closed-loop AAD phase. In Section 3.2, we evaluate the speech enhancement performance of the AGC for the closed-loop AAD phase. Finally, in Section 3.3, we compare the subjective evaluation between the open-loop and the closed-loop AAD phase.

Auditory Attention Decoding Performance
For all considered AAD algorithms (LW-GLM, LW-SSM, and SW-SSM), Figure 5 depicts the correlation coefficients ρ 1,k and ρ 2,k of speaker 1 and speaker 2 and the probability of attending to speaker 1, i.e., P d i = 1 or P d i = 2 , for an exemplary session from the open-loop AAD phase. It can be observed that all AAD algorithms translate the fluctuating correlation coefficients into smooth probabilistic attention measures. When using the large correlation window, i.e., LW-GLM and LW-SSM, the correlation coefficients are more discriminative and the probabilistic attention measures are more reliable with a lower variability compared to using the small correlation window, i.e., SW-SSM. This can mainly be explained by the fact that the large correlation window provides a larger amount of data from the reconstructed attended envelope and the envelopes of the speech signals compared to the small correlation window. A large discriminability and reliability of the correlation coefficients and the probabilistic attention measures are obviously essential to obtain a large decoding performance. For the considered AAD algorithms, Figure 6 depicts the decoding performance for the open-loop and the closed-loop AAD phase. It can be observed that all AAD algorithms yield a median decoding performance that is larger than chance level (50%). For the open-loop AAD phase, the LW-GLM, LW-SSM, and SW-SSM algorithms yield a median decoding performance of 65.0%, 60.5%, and 56.5%, respectively. For the closed-loop AAD phase, the LW-GLM, LW-SSM, and SW-SSM algorithms yield a median decoding performance of 67.7%, 64.2%, and 60.4%, respectively. The larger median decoding performance obtained by the LW-GLM and LW-SSM algorithms is consistent with the probabilistic attention measures in Figure 5, where due to the large correlation window more reliable probabilistic attention measures are obtained compared to the SW-SSM algorithm. A statistical multiple comparison test (Kruskal-Wallis test followed by post-hoc Dunn and Sidak test [34]) revealed no significant difference (p > 0.05) in decoding performance between the openloop and the closed-loop AAD phase nor between the considered AAD algorithms.

Signal-to-Interference Reduction of Adaptive Gain Controller
For the considered AAD algorithms, Figure 8 depicts the SIR improvement for the closed-loop AAD phase. It can be observed that the LW-GLM, LW-SSM, and SW-SSM algorithms yield a median SIR improvement of 1.1 dB, 1.7 dB, and 0.5 dB, respectively. The larger SIR improvement obtained by the LW-GLM and LW-SSM algorithms can be explained by the larger decoding performance compared to the SW-SSM algorithm. The larger decoding performance leads to a larger number of windows during which the attended speaker is correctly amplified and the unattended speaker is correctly attenuated. In addition, it can be observed that the SW-SSM algorithm yields an SIR improvement with a larger variability (−2.7-3.0 dB) than the LW-GLM algorithm (0.6-2.1 dB) and the LW-SSM algorithm (0.7-3.8 dB). This can be explained by the larger variability of the probabilistic attention measures obtained by the SW-SSM algorithm (see Figure 5). Due to the linear role of the probabilistic attention measure in the AGC for determining the desired SIR between the attended and the unattended speaker, as shown in (50)

Subjective Evaluation of Open-Loop and Closed-Loop AAD
For the open-loop and the closed-loop AAD phase, Figure 9 presents the perceived effort to follow the attended speaker, to ignore the unattended speaker, to switch attention between both speakers, and the level of story understanding.
In terms of the perceived effort to follow the attended speaker and to ignore the unattended speaker (Figure 9a,b), it can be observed that the lowest median effort is obtained for the open-loop AAD, while a higher median effort is required for the closedloop AAD, especially when using the SW-SSM algorithm. This can be attributed to the negative SIR improvement in some windows (see Figure 8), where the attended speaker is wrongly attenuated and the unattended speaker is wrongly amplified. Nevertheless, a statistical multiple comparison test (Kruskal-Wallis test followed by post-hoc Dunn and Sidak test) revealed no significant difference (p > 0.05) between all considered openloop and closed-loop AAD cases. Similarly, in terms of the effort to switch attention between both speakers (Figure 9c), a statistical multiple comparison test revealed no significant difference (p > 0.05) between all considered open-loop and closed-loop AAD cases. These results show that the proposed closed-loop cognitive-driven gain controller system demands a similar perceived effort to follow the attended speaker, to ignore the unattended speaker and to switch attention compared to the open-loop AAD system. In terms of the level of story understanding (Figure 9d), the highest median understanding level is obtained for the open-loop AAD, while a lower median understanding level is obtained for the closed-loop AAD. This is consistent with the perceived cognitive effort (Figure 9a,b,c), where the open-loop AAD demands the lowest effort, possibly resulting in more cognitive resources available for story understanding compared to the closed-loop AAD. Nevertheless, a statistical multiple comparison test revealed no significant difference (p > 0.05) between all considered open-loop and closed-loop AAD cases. Finally, Figure 10 presents the level of improvement in system usage achieved by the participants throughout the sessions of the closed-loop AAD phase. It can be observed for all considered AAD algorithms that a significant improvement in system usage is obtained.

Discussion
The experimental results for the open-loop AAD system show that the largest median decoding performance is obtained by the LW-GLM algorithm (65%). This is in accordance with the experimental results in Reference [6], where it has been shown that open-loop AAD using a low number of electrodes with a correlation window smaller than 15 s results in a decoding performance lower than 75%. It should, however, be noted that the decoding performance in Reference [6] was obtained based on an optimal EEG electrode configuration, whereas the decoding performance reported in this paper was obtained based on a fixed EEG electrode configuration. In addition, the experimental results show that there is no significant difference in decoding performance between the open-loop and the closed-loop AAD system using the proposed AAD algorithms. This is consistent with the experimental results in Reference [5], where no significant difference in decoding performance between an open-loop and a closed-loop AAD system using visual feedback has been observed.
The experimental results show that the LW-GLM and LW-SSM algorithms using the large correlation window yield a larger median decoding performance compared to the SW-SSM algorithm using the small correlation window. The large correlation window provides a larger amount of data from the reconstructed attended envelope and the envelopes of the speech signals compared to the small correlation window, resulting in more discriminative correlation coefficients, more reliable probabilistic attention measures and a larger decoding performance. This is in accordance with the experimental results in Reference [4,6], where it has been shown that a larger correlation window results in a larger decoding performance. In addition, the experimental results show that the LW-GLM algorithm yields a larger median decoding performance than the LW-SSM algorithm. This may be explained by the fact that the LW-GLM algorithm infers the probabilistic attention measures based on the mean differential correlation coefficients rather than the absolute value of the correlation coefficients, hence providing a larger dynamic range including positive and negative values.
In conclusion, the results demonstrate the feasibility of closed-loop AAD in an online fashion, enabling the listener to interact with an adaptive gain controller (as an ideal speech enhancement algorithm) for a scenario with two competing speakers. On the one hand, the closed-loop cognitive-driven gain controller system improves the SIR between the attended and the unattended speaker. This may make it easier to follow the attended speaker, ignore the unattended speaker and switch attention between both speakers, resulting in a lower cognitive effort compared to open-loop AAD. On the other hand, the closed-loop cognitive-driven gain controller system introduces a significant delay to detect attention switches, which causes the attended speaker to be wrongly attenuated and the unattended speaker to be wrongly amplified for sometime. This may make it more difficult to follow the attended speaker and ignore the unattended speaker, resulting in a higher cognitive effort compared to open-loop AAD. Nevertheless, the subjective evaluation results indicate that overall the closed-loop cognitive-driven gain controller system demands a similar effort as open-loop AAD.
A delay to detect attention switches significantly influences the performance of the closed-loop cognitive-driven gain controller system. Recently, methods that are able to decode auditory attention with low delay have been proposed, e.g., based on a state-space model [9,18], neural networks [12,18,35], and common spatial patterns [36]. Therefore, investigating the potential of fast AAD methods for a closed-loop cognitive-driven gain controller system to detect attention switches could be interesting as future work.
While the closed-loop AAD experiments were performed without incorporating a practicing phase for the participants, the subjective evaluation results suggest that a significant improvement in system usage was obtained throughout the closed-loop AAD experiment. Future work could, therefore, investigate the impact of incorporating a practicing phase on the decoding and the speech enhancement performance of the cognitive-driven gain controller system. This practicing phase could simply be an extended version of the closed-loop phase with many sessions where the participants can gather enough experience to fully master (i.e. find an intelligent way to control) the cognitive-driven gain controller system.
Although the application of the proposed cognitive-driven gain controller system was limited to acoustic scenarios with two competing speakers, it was shown in Reference [10] that open-loop AAD is feasible for an acoustic scenario with four competing speakers when using perfectly separated clean speech signals for decoding. In addition, the evaluation of the proposed system was limited to acoustic scenarios with non-moving speakers and with speakers located on opposite sides of the listener, whereas in real-world conditions speakers may move and may be located on the same side of the listener. Therefore, it would certainly be interesting as future work to investigate the performance of (an extension of) the proposed cognitive-driven speech enhancement system for more realistic acoustic scenarios.
The application of the proposed cognitive-driven gain controller system is obviously not limited to hearing devices. The system could also be used, e.g., for virtual reality (VR) that simulates a remote environment, e.g., for entertainment, training and medicine. It could be used to adapt the simulated world based on the auditory attention of the VR user.

Conclusions
In this paper, we proposed a closed-loop gain system which cognitively steers an adaptive gain controller based on real-time AAD for a scenario with two competing speakers. The real-time AAD infers the probabilistic attention measures of the attended and the unattended speaker from EEG recordings of the listener and the speech signals of both speakers. Based on these probabilistic attention measures, the adaptive gain controller amplifies the identified attended speaker and attenuates the identified unattended speaker. The loop of cognitive-driven gain control is then closed by presenting the amplified attended speaker and the attenuated unattended speaker via loudspeakers. The experimental results demonstrate the feasibility of the proposed closed-loop cognitive-driven gain controller system (both using AAD algorithms based on GLM and SSM), enabling the listener to interact with the system in real-time. Although there is a significant delay to detect attention switches, which causes the attended speaker to be wrongly attenuated and the unattended speaker to be wrongly amplified for some time, the proposed closed-loop system is able to improve the SIR between the attended and the unattended speaker. Moreover, the subjective evaluation results show that the proposed closed-loop cognitive-driven system demands a similar perceived level of cognitive effort to follow the attended speaker, to ignore the unattended speaker, and to switch attention between both speakers compared to open-loop AAD. With this work, an attempt was made to bring closed-loop cognitive-driven speech enhancement closer to real-world applications.