1. Introduction
Alzheimer’s disease (AD) is the most common type of dementia among the elderly; its socioeconomic cost to society is sizeable and expected to increase. It is characterized by progressive and irreversible cognitive deterioration with memory loss, impairments in judgment and language, and other cognitive deficits and behavioral symptoms that finally become severe enough to limit the ability of an individual to carry out the professional, social or family activities of daily life. As the disease progresses, patients develop increasingly severe disabilities, becoming in the end completely dependent on others. An early and accurate diagnosis of AD would be of much help to patients and their families, both in facilitating planning for the future and in beginning treatment of the symptoms of the disease early.
A diagnosis of AD requires, on the one hand, the confirmation of the presence of a progressive dementia syndrome and, on the other hand, the exclusion of other potential causes of dementia as demonstrated by the patient’s clinical history. An unambiguous diagnosis of AD is considered to require that a post-mortem analysis demonstrate the typical AD pathological changes in brain tissue [
1,
2,
3]. The clinical hallmark of the earliest manifestations of AD is episodic memory impairment. At the time of clinical presentation, other cognitive deficits are usually already present in the patient’s language, executive functions, orientation, perceptual abilities and constructional skills. Associated behavioral and psychological symptoms include apathy, irritability, depression, anxiety, delusions, hallucinations, inhibition decrease, aggression, aberrant motor behavior, as well as changes in eating or sleeping patterns [
4,
5]. While the presence of these symptoms is indicative of AD, reaching a reliable diagnosis in some cases requires expensive and invasive diagnostic tests such as computer tomography (CT), magnetic resonance imaging (MRI) and/or lumbar puncture.
In order to develop a system for an early diagnosis of AD, the potential of a recording technique known as electroencephalography (EEG) has been investigated. EEG consists in recording brain-related electrical potentials using different electrodes attached to the scalp [
6]. EEG activity is commonly divided into specific frequency bands: 0.1 Hz–4 Hz (δ), 4 Hz–8 Hz (θ), 8 Hz–13 Hz (α), 13 Hz–30 Hz (β) and 30 Hz–100 Hz (γ)[
6]. A large number of studies have analyzed measurable changes that AD causes on EEG. A review of these studies can be found in [
7,
8,
9]. Three major perturbations have been reported in EEG: (i) power increase of δ and θ rhythms and power decrease of posterior α and/or β rhythms in AD patients (also known as EEG slowing); (ii) EEG activity of AD patients seems to be more regular than the EEG recording of healthy subjects (which correspond to reduced complexity of the EEG signals for AD patients); and (iii) frequency-dependent abnormalities in EEG synchrony [
7,
8,
10]. Despite being easily reproducible, those markers have limited sensitivity, which is for a large part due to the poor signal-to-noise ratio of EEG signals.
EEG recordings are indeed usually corrupted by spurious extra-cerebral artifacts, which should be rejected or cleaned up by the practitioner. These artifacts are responsible for a consequent degradation of the signal quality. In previous works we presented several methodologies to improve the quality of EEG data in patients with AD using blind source separation (BSS) [
11,
12]. BSS appears to be more suitable for artifact rejection than adaptive filtering and regression [
7]. However, BSS is not fully automatic: one needs to visually inspect the components extracted by BSS and decide which components to remove; this time consuming process is not suitable for routine clinical EEG [
13]. Furthermore, visual inspection is subjective [
14], and the reliability of BSS is therefore limited.
Since manual screening of human EEGs is inherently error prone and might induce experimental bias, automatic artifact detection is an issue of importance. It is most certainly one of the keys to achieve reliable diagnostics, and obtain useful results for clinical purposes [
7]. Automatic artifact detection would consequently be the best guarantee for objective and clean results. Unfortunately, automatic detection is fairly difficult to perform, due to the lack of reliable markers of EEG artifacts. The evaluation of a given set of markers could be performed using either simulated EEG recordings, or real EEG recordings. In the first case, the content of the signals is well defined; however, one cannot guarantee that the signals investigated are comparable with real signals. In the second case, one cannot guarantee that the artifacts are well identified, knowing that EEG studies usually reach fairly poor inter-expert agreements. In the present investigation, we chose the second option. We present here an investigation of commonly used markers of EEG artifacts. We investigate the possibility to automatically clean a database of EEG recordings taken from AD patients and healthy age-matched controls. The data is first decomposed using BSS, and afterwards artifacted sources are rejected. In order to reduce risks coming from the poor human inter-expert agreements, data is not screened by only one expert, but by three independent experts to locate artifacts within the sources. Due to the importance of not rejecting EEG data, the approach is conservative, in the sense that as a rule data is not eliminated if it’s not clearly identified as an artifact. We afterwards investigate the reliability of the markers, by comparing the markers with the resulting human labeling of sources (sources identified as artifacts
versus sources identified as clean EEG).
2. Results
We first compared the number of rejected sources from the control and Alzheimer groups (
Table 1) with a Wilcoxon ranksum test, taking into account a Bonferroni-corrected significant threshold of
(equivalent to
p = 0.05 without correction). None of the experts showed a significant difference of treatment between the groups (the smallest p-value was 0.06).
Table 1.
Statistical comparison of the number of sources rejected in the group of 24 control subjects versus the group of 17 patients suffering from Alzheimer’s disease (two-sided Wilcoxon ranksum test) for each expert considered independently, and all experts aggregated. Rows indicate the number of rejected sources for AD patients (average and standard deviation), the number of rejected sources for control subjects, p-values (p < 0.05 indicates a significant difference) and the Wilcoxon z-score statistic.
Table 1.
Statistical comparison of the number of sources rejected in the group of 24 control subjects versus the group of 17 patients suffering from Alzheimer’s disease (two-sided Wilcoxon ranksum test) for each expert considered independently, and all experts aggregated. Rows indicate the number of rejected sources for AD patients (average and standard deviation), the number of rejected sources for control subjects, p-values (p < 0.05 indicates a significant difference) and the Wilcoxon z-score statistic.
| Expert #1 | Expert #2 | Expert #3 | All |
---|
NAD | | | | |
NCTR | | | | |
p | | | | |
z | | | | |
We afterwards compared the power of the four features (
Table 2) with a Wilcoxon ranksum test, taking into account a Bonferroni-corrected significant threshold of
(equivalent to
p = 0.05 without correction). Whether on patients suffering from Alzheimer’s disease, control subjects or when aggregating all subjects, sample entropy always appears as the most powerful feature, with the strongest Z-score and the lowest p-value (overall, below
). All the features are significant, except for K, which is non-significant for the control group. Group effects comparing rejected sources between Alzheimer and Control groups and the clean sources between Alzheimer and Control groups were systematically improved by artifact rejection. In particular the cleaned sources had very significant differences with the zero-crossing and kurtosis measure (
p =
and
p =
respectively).
Table 2.
Statistical comparison of the features for artifacts sources versus clean sources (two-sided Wilcoxon ranksum test) for the group of 17 patients suffering from Alzheimer’s disease, for the group of 24 control subjects, for all 41 subjects aggregated, and group effects comparing rejected sources between Alzheimer and Control groups and the clean sources between Alzheimer and Control groups. All measures from the three experts were aggregated for this test. Rows indicate p-values and the Wilcoxon z-score statistic. Gray background indicates the most powerful feature (sample entropy).
Table 2.
Statistical comparison of the features for artifacts sources versus clean sources (two-sided Wilcoxon ranksum test) for the group of 17 patients suffering from Alzheimer’s disease, for the group of 24 control subjects, for all 41 subjects aggregated, and group effects comparing rejected sources between Alzheimer and Control groups and the clean sources between Alzheimer and Control groups. All measures from the three experts were aggregated for this test. Rows indicate p-values and the Wilcoxon z-score statistic. Gray background indicates the most powerful feature (sample entropy).
Alzheimer | | SEnt | FD | K | Z |
P | | | | |
Z | | | | |
Control | P | | | | |
Z | | | | |
All | p | | | | |
z | | | | |
Rejected sources | p | | | | |
z | | | | |
Cleaned sources | p | | | | |
z | | | | |
We evaluated the capabilities of these methods to automatically detect EEG artefacts using a classification approach (
Table 3). A multilayer perceptron was trained on half of the database, and tested on the spared samples. The purpose was to classify rejected sources from non-artifacted sources. As we can see, the classifier can generalize this classification, despite the performances being moderate.
Knowing that experts do not always agree on source selection, we controlled for the possible effects of the experts. To that extent, we compared again the power of the four features (
Table 4) with a Wilcoxon ranksum test, and the group effect of differences between experts with a Kruskall-Wallis test, taking into account a Bonferroni-corrected significant threshold of
. Whatever the expert, sample entropy always appears as the most powerful feature, with the strongest Z-score and the lowest p-value (overall, p is always below
). All the features are significant, except for FD and K, which are non-significant for experts #1 and #3 and for all experts respectively. There was no group effect of difference between experts.
Table 3.
Automatic detection of artifacts sources versus clean sources (multilayer perceptron with 2-fold cross-validation) for each expert considered independently and aggregated together. Classification was performed for sources from the group of 17 patients suffering from Alzheimer’s disease, and for sources from the group of 24 control subjects. Rows indicate the classification rate average and standard deviation on 1000 classification attempts.
Table 3.
Automatic detection of artifacts sources versus clean sources (multilayer perceptron with 2-fold cross-validation) for each expert considered independently and aggregated together. Classification was performed for sources from the group of 17 patients suffering from Alzheimer’s disease, and for sources from the group of 24 control subjects. Rows indicate the classification rate average and standard deviation on 1000 classification attempts.
| Control Subjects | Alzheimer Patients |
---|
Expert #1 | | |
Expert #2 | | |
Expert #3 | | |
All | | |
Table 4.
Statistical comparison of the features for artifacts sources versus clean sources (two-sided Wilcoxon ranksum test) for each expert considered independently, and group effects comparing rejected sources between experts and the clean sources between experts (Kruskal-Wallis test). All subjects were aggregated for this test. Rows indicate p-values and the Wilcoxon z-score or Kruskal-Wallis Chi² statistics. Gray background indicates the most powerful feature (sample entropy).
Table 4.
Statistical comparison of the features for artifacts sources versus clean sources (two-sided Wilcoxon ranksum test) for each expert considered independently, and group effects comparing rejected sources between experts and the clean sources between experts (Kruskal-Wallis test). All subjects were aggregated for this test. Rows indicate p-values and the Wilcoxon z-score or Kruskal-Wallis Chi² statistics. Gray background indicates the most powerful feature (sample entropy).
Expert #1 | | SEnt | FD | K | Z |
P | | | | |
Z | | | | |
Expert #2 | P | | | | |
z | | | | |
Expert #3 | p | | | | |
z | | | | |
Rejected sources | p | | | | |
| | | | |
Cleaned sources | p | | | | |
| | | | |
Most of the artifacts identified by the experts are typical of EEG signals: eye blinks, eye movements, EKG, and other unidentified (artifacts with unclear pattern). The differences between the four markers for rejected and non-rejected sources are illustrated on typical examples on
Figure 1. For the rejected sources, Kurtosis K is higher, whereas Zero-crossing rate Z, Sample entropy, and fractal dimension are lower than those of non-rejected sources.
When comparing the control and patient groups before and after the procedure, the Leave-One-Out root mean square error (RMSE) of validation dropped from 0.32 to 0.28 (training RMSE dropped from 0.30 to 0.26). Classification of EEG relative powers after the artifact cleaning procedure was more efficient than before cleaning.
Figure 1.
Illustration of the four markers for non-rejected (top) and rejected (bottom) sources.
Figure 1.
Illustration of the four markers for non-rejected (top) and rejected (bottom) sources.
3. Discussion
We investigated markers that could be used to automatized the three source removal criteria described in
Section 4.3, in order to avoid any manual screening and reduce human errors. The artifact cleaning procedure led to an improved detection of AD, with a systematic improvement of the differences before and after cleaning (see
Table 2) and a classification error dropping from 32% to 28%. The first and second criteria are easy to automatize because are directly related to the EEG signals on the electrodes (first criterion) and estimating mixing matrix (second criterion). The first rule, “Source of abnormally high amplitude (≥100 µV)”, is easy to implement by thresholding the backpropagated sources and eliminating those with peaks over 100 µV. The second rule is related to the detection of isolated sources on the scalp: “Abnormal scalp distribution of the reconstructed channels (only a few electrodes contribute to the source, with an isolated topography)”. In order to simplify the detection of isolated IC on the scalp we can use the information given by the inverse of the de-mixing matrix obtained after the decomposition by EWASOBI algorithm. By calculating the inverse of this matrix we obtain an estimated mixing matrix that can allow us to find the back projection of the source
onto the original data (electrodes domain). Then, we can calculate the energy that each electrode contributes to this
source and label as an artifact if the percentage of the energy of one (or very few) electrode(s) is higher than a pre-fixed threshold (50% of the total energy, for example).
The third rule, “Abnormal wave shape (drifts, eye blinks, sharp waves,
etc.)”, is a real challenge, and was the main object of our investigation. Our working hypothesis was that this rule could be implemented based on the statistical properties of the time series. However, note that we do not show the statistic for the third rule only, but for the global combination of the three rules: indeed, almost all the rejected sources are selected by the experts with the combined application of the three rules. For instance, a source with sufficiently high amplitude and a sufficiently abnormal shape was rejected by most experts. However, a source with slightly high amplitude and normal shape may not be rejected by all three experts (it is on those sources that the experts will not always reach a consensus). We investigated in this manuscript four potential statistical markers in order to characterize the time series, which could provide some information about potentially abnormal shapes in the EEG sources. Our observations are congruent with the existing literature. Kurtosis K is higher for the artifacted source: their distributions are farther from Gaussianity than non-rejected sources [
15]. Despite this effect being well known, kurtosis was the poorest marker in our study, confirming previous results of Delorme
et al. [
15]. Sample entropy is lower for rejected sources, owing to the increased predictability of the repetitive artefact patterns. This result is congruent with the literature: the expected SEnt of artifacts is lower, because their patterns are more regular and predictable in comparison with neural activity [
16]. Similarly, FD is lower for artifacted sources, in accordance with previous publications: clean EEG traces are typically characterized by a flatter and more spread spectrum, with higher FD [
17,
18], especially for ocular artifacts [
19]. Zero-crossing rate Z is lower for the artifacted sources. Unfortunately, despite several studies reporting the use of zero-crossings for the evaluation of artifacts in EEG signals, to the best of our knowledge, none of them reported if this measure increases or decreases. We can nevertheless conjecture that our observation is valid, and can be explained by the presence of low-frequency perturbations forming blocks in the presence EEG artifacts (which is compatible with the effects observed with SEnt and FD).
Automatic classification of the sources leads to an accuracy of ~65% on the validation set depending on the expert involved in source selection. Despite this result is not sufficient to guarantee an efficient automatic rejection, it clearly demonstrates the potential of these measures for semi-automatic rejection. Indeed, by tweaking the classification threshold, one can obtain a higher sensitivity to the detriment of a lower specificity (sensitivity of 80.0% for a specificity of 26.0% using a linear classifier). In other words, the classifier can automatically detect a subset of suspicious sources, and thereby alleviate the task of the expert who will only need to remove the non-artifacted sources from this selection. Taking into account the fact that the expert labeling is error prone, these results are encouraging.