Manual Correction of Voxel Misclassifications in Mesiotemporal Structures Does Not Alter Brain–Behavioral Results in an Episodic Memory Task

Voxel-based morphometry (VBM) is an established method for assessing grey matter volumes across the brain. The quality of magnetic resonance imaging (MRI) and the chosen data preprocessing steps can affect the outcome of VBM analyses. We recognized a lack of publicly available and commonly used protocols, which indicates that standardized and optimized preprocessing protocols are needed. This paper focuses on the time- and resource-consuming manual correction of misclassifications of grey matter voxels in cortical structures important in Alzheimer’s dementia. A total of 126 individuals, including 63 patients with very early Alzheimer’s disease and 63 cognitively normal participants, received thorough neuropsychological testing and 3-Tesla MRI. Automated preprocessing of T1 MPRAGE images was performed, and misclassifications of grey matter voxels were manually identified and corrected. In a second run, the manual correction step was skipped. Multiple regression analyses using DARTEL in SPM8 were then conducted with the manually corrected and uncorrected sample, respectively. Manual correction of voxel misclassifications did not have a major impact on the correlation between episodic memory performance and structural brain imaging results. We conclude that, although performing all preprocessing steps remains the gold standard, skipping manual correction of voxel misclassifications is permitted when investigating populations on the Alzheimer’s disease spectrum.


Introduction
Structural magnetic resonance imaging (MRI) has become a standard tool in the diagnostic process of Alzheimer's disease (AD) [1]. As it has been shown that cortical atrophy appears already in the early stages of AD [2,3], voxel-based morphometry (VBM) is an established method for detecting the particular neuropathological patterns of AD [4,5]. However, the quality of MRI preprocessing greatly varies between scientists, clinicians, and research sites. The direct impact of these different handlings on statistical outcome is not yet clear. In this study, we investigated the impact of manual correction of grey and white matter voxel misclassifications on brain-behavioral results.
Dysfunction of episodic memory belongs to the main characteristics of early AD. Performance on verbal episodic memory tests, such as encoding and recall of word lists [6][7][8], is associated with reduced volume of the hippocampal and entorhinal cortex in patients with mild cognitive impairment (MCI) and Alzheimer's dementia [9,10]. Additionally, apply standard written preprocessing protocols to MRI scans before images are released for clinical or scientific interpretation. Figure 1 gives a summary of the received responses to our search for standardized preprocessing protocols in VBM analysis. and four clinics. We knew from personal contact or thought on the basis of their publications or information on their webpage that they apply VBM techniques. We aimed to include different countries and contacted institutions located in Switzerland, England, the United States of America, Canada, France, Germany, Czech Republic, and Italy to ask if they apply standard written preprocessing protocols to MRI scans before images are released for clinical or scientific interpretation. Figure 1 gives a summary of the received responses to our search for standardized preprocessing protocols in VBM analysis. To our knowledge and in line with the responses we received, there is no available literature addressing the handling of misclassifications and no existing standard protocol for preprocessing steps before VBM analysis. Manual preprocessing steps are very timeconsuming. As measured by a researcher with only few experiences in preprocessing MRI data (FH), the manual correction of voxel misclassifications (i.e., reviewing each slice separately) can take up to 3 h per participant. In clinical settings or for large research samples, this effort is not bearable. Thus, our aim for this study was to optimize the VBM preprocessing protocol without losing meaningful information. We were particularly interested to find out whether manual correction of grey/white matter voxel misclassifications substantially increased the quality of studies that addressed only cortical regions. We therefore investigated the direct impact of manual correction of grey/white matter voxel misclassifications on brain-behavioral results in cortical structures important for AD diagnosis, such as e.g., structures of the parahippocampal gyrus or hippocampal formation.

Participants
We included baseline data from a longitudinal study of 126 native Swiss-German or German speaking adults with complete neuropsychological and MRI data sets. All partic-Found and contacted institution who apply VBM (12) No answer (6) Unspecific answer: Ø Nobody in charge (2) Ø Will get back to us but we never received a specific answer (1) Specific answer (3) Do not use a specific protocol (2) Use specific protocol but do not perform any manual corrections (1) Figure 1. Flow diagram representing type of feedback received from universities and clinics regarding standardized preprocessing protocols in voxel-based morphometry analysis (number of answers in brackets).
To our knowledge and in line with the responses we received, there is no available literature addressing the handling of misclassifications and no existing standard protocol for preprocessing steps before VBM analysis. Manual preprocessing steps are very time-consuming. As measured by a researcher with only few experiences in preprocessing MRI data (FH), the manual correction of voxel misclassifications (i.e., reviewing each slice separately) can take up to 3 h per participant. In clinical settings or for large research samples, this effort is not bearable. Thus, our aim for this study was to optimize the VBM preprocessing protocol without losing meaningful information. We were particularly interested to find out whether manual correction of grey/white matter voxel misclassifications substantially increased the quality of studies that addressed only cortical regions. We therefore investigated the direct impact of manual correction of grey/white matter voxel misclassifications on brain-behavioral results in cortical structures important for AD diagnosis, such as e.g., structures of the parahippocampal gyrus or hippocampal formation.

Participants
We included baseline data from a longitudinal study of 126 native Swiss-German or German speaking adults with complete neuropsychological and MRI data sets. All participants were recruited from the Memory Clinic FELIX PLATTER, Basel, Switzerland. Written informed consent was obtained from all individuals prior to participation. The study was approved by the local ethics committee (EKNZ: Ethikkommission Nordwest-und Zentralschweiz).
The sample consisted of 63 individuals with very early AD (AD group), of which 35 (16 male, 19 female; mean age: 73.80 ± 9.17 (SD) years; range: 51-90 years; mean MMSE score: 28.37) were diagnosed with aMCI due to AD according to DSM-IV [38] and Winblad et al. [39] criteria. Twenty-eight individuals were diagnosed with early Alzheimer's dementia (11 male, 17 female; mean age: 78.00 ± 5.30 (SD) years; range: 64-87 years; mean MMSE score: 26.43) according to NINCDS-ADRDA [40] and DSM-IV criteria [38]. Patients not only received neuropsychological testing including informant questionnaires but also received medical screening, gait analyses, and magnetic resonance imaging. Some patients had additional positron emission tomography scans and/or investigation of cerebrospinal fluid measures. They were diagnosed in an interdisciplinary consensus conference at the Memory Clinic FELIX PLATTER. Additionally, 63 cognitively normal participants (NC group; 38 male, 25 female; mean age: 74.54 ± 6.53 (SD) years; range: 60-90 years; mean MMSE score: 29.25) were selected and matched to the AD group with regard to age and education (both p > 0.30, Table 1). They had undergone medical screening and extensive neuropsychological testing to ensure that they were cognitively healthy (i.e., neurologically and psychiatrically). Sociodemographic characteristics and MMSE scores for NC and AD participants are depicted in Table 1. As expected, the two groups differed significantly in their MMSE score only (t(86) = 6.20, p = 1.89 × 10 −8 ). MRI scanning was performed on a 3-Tesla scanner (MAGNETOM Verio, Siemens, Erlangen, Germany) at the University Hospital Basel, Switzerland (T1-weighted 3D magnetizationprepared rapid acquisition gradient echo (MPRAGE); 12 channel head coil; inversion time = 1000 ms; repetition time = 2000 ms; echo time = 3.37 ms; flip angle = 8 • ; field of view = 256 × 256; acquisition matrix = 256 × 256 mm; voxel size = 1 mm isotropic).

Preprocessing of Structural MRI
Preprocessing of the T1 MPRAGE images was performed using the DARTEL method [30] in the SPM8 software (Wellcome Institute of Cognitive Neurology, www.fil.ion.ucl.ac.uk (accessed on 19 October 2021)) implemented in MATLAB 2010 (Mathworks Inc., Sherborn, MA, USA). Images were reoriented manually by setting the anterior commissure at the origin of the three-dimensional Montreal Neurological Institute (MNI) coordinate system. After automatic segmentation of MRI images into grey matter, white matter, and cerebrospinal fluid volumes, misclassifications of grey matter were manually identified and corrected on each slice. The MPRAGE images were then segmented again, masking the misclassifications as described elsewhere [41], and then co-registered to the DARTEL template and normalized to MNI space, modulated and smoothed with 8 mm FWHM Gaussian kernel.
For the uncorrected MRI data set, steps were analogous with the exception that manual identification and correction of misclassifications was skipped.

California Verbal Learning Task
In the course of neuropsychological testing, all participants completed the German version of the CVLT [13,14] to assess verbal episodic memory. A list of 16 nouns (List A) was read aloud to the participants during 5 trials. Each trial was followed by an immediate free recall. After completion of all trials, an interference list (List B) was presented and recalled, followed by a short-delay free recall and cued recall of List A. Finally, a longdelay free recall and cued recall of the words from List A as well as a recognition task was conducted.
Two neuropsychological measures were used for the analyses: the sum of recalled words from trial 1-5 of List A (CVLT_1-5; immediate recall), and the number of recalled words in the delayed free recall of List A (CVLT_LDFR; long-delay recall). Age-, sex-, and education-corrected z-scores were calculated according to a normative sample [42,43]. Because we only had normed data for age up to 88 years, three participants (one NC and two aMCI participants) had to be classified one year younger than their actual age to calculate the z-score.

Statistical Analyses
Socio-demographic data analysis was done with R version 3.3.3 [44] (https://www.Rproject.org/ (accessed on 19 October 2021)). Voxel-based whole-brain correlations were conducted separately for the corrected and uncorrected data set, using CVLT immediate and delayed free recall as variables of interest (covariate: TIV). We performed group independent (i.e., over all participants) multiple regressions in SPM12 (Wellcome Institute of Cognitive Neurology, www.fil.ion.ucl.ac.uk (accessed on 19 October 2021)) implemented in MATLAB R2010b (Mathworks Inc., Sherborn, MA, USA). Significant voxels were identified using a threshold of p < 0.001 for the correlation with CVLT performance and a cluster size of >10.

Voxel-Based Morphometry
Brain-behavioral analyses showed significant correlations between immediate recall CVLT_1-5 scores and grey matter volume for the manually corrected as well as for the uncorrected MRI data. Table 2a illustrates that one single cluster in the middle occipital gyrus (−37, −87, 38) did not show up in both analyses. Peak voxels with no distinction between corrected and uncorrected data were located in hippocampus, precuneus, temporal gyrus, and occipital gyrus (all p < 0.007; Table 2a, Figure 2). Table 2. Grey matter voxels with significant correlation with (a) CVLT_1-5 and (b) CVLT_LDFR performance and their anatomical locations (Regions determined using MRIcron aal template) for the corrected and uncorrected data set, respectively. Different results in coordinates between the two data sets are displayed by numbers in brackets, whereas the coordinates for the uncorrected data set are enclosed. Missing values in cluster size and p value represent a voxel that appeared significant only in the corrected or uncorrected sample. Anatomical regions with asterisk (*) are undefined voxels in the aal template, and the regions are defined according to neighboring voxels. between corrected and uncorrected data were located in hippocampus, precuneus, t poral gyrus, and occipital gyrus (all p < 0.007; Table 2a, Figure 2).  For the CVLT_LDFR scores, we found significant correlations with voxels in the pocampus, amygdala, precuneus, parietal gyrus, frontal gyrus, and occipital gyrus For the CVLT_LDFR scores, we found significant correlations with voxels in the hippocampus, amygdala, precuneus, parietal gyrus, frontal gyrus, and occipital gyrus. As illustrated in Table 2b, results differed in two clusters in the inferior parietal gyrus (−24, −66, 44) and the middle occipital gyrus (−30, −89, 22) (all p < 0.046; Figure 3).
Cluster differences of corrected and uncorrected MRI data are shown in Figure 4.
illustrated in Table 2b, results differed in two clusters in the inferior parietal gyrus (−24, −66, 44) and the middle occipital gyrus (−30, −89, 22) (all p < 0.046; Figure 3). Cluster differences of corrected and uncorrected MRI data are shown in Figure 4. illustrated in Table 2b, results differed in two clusters in the inferior parietal gyrus (−24, −66, 44) and the middle occipital gyrus (−30, −89, 22) (all p < 0.046; Figure 3). Cluster differences of corrected and uncorrected MRI data are shown in Figure 4.  Displayed is the overlap image, as well as separate images for the corrected (green) and uncorrected (red) data.

Discussion
Comparing studies that use quantitative MRI measures in AD populations is difficult, as no common preprocessing protocol is available. The methodological decisions are highly variable [46,47], and the reporting in scientific papers is not always sufficiently

Discussion
Comparing studies that use quantitative MRI measures in AD populations is difficult, as no common preprocessing protocol is available. The methodological decisions are highly variable [46,47], and the reporting in scientific papers is not always sufficiently transparent [48]. We asked a total of twelve research facilities and clinics from eight different countries whether or not they apply standardized MRI preprocessing protocols and/or perform manual correction before running VBM analyses. Feedback was very scarce, such that reliable interpretation is difficult. However, responses suggest that most facilities do not use standardized MRI preprocessing protocols and even less perform manual correction. There is a need for standardized and optimized (i.e., easy to use and highly efficient) preprocessing protocols for MRI scans to enhance the interpretation and comparison of VBM results across different studies and improve clinical practicability. A first step in this direction is to evaluate the effect on brain-behavioral results if manual voxel corrections are omitted.
As expected and reported by many other authors before, we found that NCs performed significantly better than the AD group in the immediate as well as the delayed free recall of the CVLT. This indicates that the used sample fulfills the requirements for our study (i.e., the sample enables us to investigate cortical structures).
In the immediate as well as the delayed free recall condition, VBM analyses revealed significant associations with five clusters that showed identical peak voxels for manually corrected and uncorrected data. In each condition, peak voxel location of a sixth cluster differed only in the x-axis by 1 mm between the corrected and uncorrected data. Furthermore, one cluster showed up only in the corrected data of the immediate free recall condition (middle occipital gyrus, −37, −87, 38). In the delayed free recall condition, one cluster was significant in the corrected (inferior parietal gyrus, −24, −66, 44) and another one in the uncorrected data only (middle occipital gyrus, −30, −89, 22). We think that these differences between the manually corrected and uncorrected data are only marginal. Significant clusters were located in regions expected to be involved in episodic memory, such as the hippocampus, amygdala, and precuneus. Figure 4 shows that results between corrected and uncorrected MRI data differed mainly at the borders of identified clusters and there was no specific pattern regarding the varying cluster sizes of manually corrected versus uncorrected data (see Table 2). For some regions, significant clusters comprise more voxels in the corrected data, while others contain more voxels in the uncorrected data (clusters differ in a range from 105 until 2972 voxels). One reason for this observation is that the accuracy of automated voxel allocation to grey and white matter differs depending on brain region and diagnosis. In normal aging, for example, studies reported preservation of grey matter volume in structures such as the amygdala and hippocampus [49][50][51]. These are precisely the regions showing more voxels in the uncorrected data. Ashburn and Friston [20] noted that voxels sometimes contain a mixture of tissues and usually appear as grey matter when located at the interface between white matter and ventricles. As we combined images from the NC and AD groups, signal intensity could have affected the primary assignment of voxels to grey or white matter and thus might have manipulated the manual misclassification correction.
Multiple regression analysis was performed by combining the two groups NC and AD to illustrate a continuous range of behavioral functioning and neuroanatomic structural patterns. As we investigated MRI images from very early AD patients, our findings can only be generalized very cautiously to AD patients in later stages of the disease. However, our results suggest that manual correction of grey/white matter voxel misclassifications can be skipped without losing meaningful information, specifically when investigating cortical regions in very early AD subjects. Considering that manual correction is subjective and that even experienced raters usually do not make identical modifications, it might be advisable to avoid the error-prone manual correction and save time by using automated approaches, which would enhance comparability between studies.
Methodological aspects tend to influence VBM results and their reproducibility [23,24,28,33,47]. Our results are restricted to preprocessing protocols using DARTEL in SPM8 (Wellcome Institute of Cognitive Neurology, www.fil.ion.ucl.ac.uk (accessed on 19 October 2021)). Callaert et al. [31] discussed the advantage of normalization and segmentation using SPM8 plus DARTEL over the SPM5/SPM8 algorithm. The updated segmentation procedure implemented in the SPM8 release determines additional tissue classes to reduce misclassification and takes into account the problem of suboptimal probability maps and age effects on grey matter estimates. In combination with DARTEL, errors can be reduced by using a template optimized for the population under study [30,52]. In the case of AD, SPM8 plus DARTEL is highly recommended as it has been shown to display better results in populations with deviant anatomy than conventional VBM by SPM5/SPM8 [31,53]. Therefore, our finding that we did not detect a specific pattern in which manually corrected and uncorrected data differ, may be explained to some extent by the better performance of the DARTEL approach over older versions.
Standardized guidelines for structural MRI preprocessing, especially for the handling of voxel misclassifications, are not only missing for AD populations but also for other diseases. Possibly, our approach can serve as template for the development of preprocessing protocols in other conditions. Ideally, the same standard structural MRI preprocessing protocol can be applied to diverse conditions, e.g., for patients with AD, chronic pain, spinal degeneration, or drug addiction. However, one hurdle to establishing a heterogeneously valid approach may be that involved brain regions vary between diseases. These regions can be differently vulnerable to the occurrence of voxel misclassifications, which can influence the results. It is important to note that our results can only be transferred with great caution to other populations, brain regions, resolutions, or studies with different methodological approaches. We used the CVLT, an episodic memory task, to measure behavior. A generalization to behavior per se is not possible. Furthermore, validation of the results by replication in a larger sample is recommended.

Conclusions
Manual correction of grey/white matter voxel misclassifications in structural MRI did not have considerable impact on brain-behavior correlations for episodic memory in AD patients using DARTEL in SPM8. Skipping manual correction of voxel misclassifications is therefore legitimate when specifically investigating cortical regions in AD populations by using a 3-Tesla MRI scanner, although performing all preprocessing steps remains the gold standard. Importantly, the presented results are restricted to AD patients in an early stage. A follow up study investigating how manual correction of voxel misclassifications influences results in later stages (i.e., when atrophy is more prominent) should be performed. We assume that our study is meaningful, as we noticed that many institutions use no standardized MRI preprocessing protocols and/or do not perform any manual correction of voxel misclassifications. The lack of transparency regarding used preprocessing protocols hampers the comparison and reproducibility of study results. Therefore, public guidelines for efficient data preprocessing are needed. Funding: Parts of the baseline data from a longitudinal project funded by the Swiss National Science Foundation (Ambizione fellowship PZ00P1_126493) were used in this project. In addition to the aforementioned, the herein presented work did not receive any external funding.

Institutional Review Board Statement:
The study was conducted according to the guidelines of the Declaration of Helsinki and the study was approved by the local Ethics Committee EKNZ Ethikkommission Nordwest-und Zentralschweiz (protocol code 257/10 approval date 2 November 2010).

Informed Consent Statement:
Written informed consent was obtained from all subjects involved in the study.

Data Availability Statement:
The data are not publicly available due to privacy restrictions. Data requests can be sent to the corresponding author.