Comparison of Inter-Method Agreement and Reliability for Automatic Brain Volumetry Using Three Different Clinically Available Software Packages

Background and Objectives: No comparative study has evaluated the inter-method agreement and reliability between Heuron AD and other clinically available brain volumetric software packages. Hence, we aimed to investigate the inter-method agreement and reliability of three clinically available brain volumetric software packages: FreeSurfer (FS), NeuroQuant® (NQ), and Heuron AD (HAD). Materials and Methods: In this study, we retrospectively included 78 patients who underwent conventional three-dimensional (3D) T1-weighed imaging (T1WI) to evaluate their memory impairment, including 21 with normal objective cognitive function, 24 with mild cognitive impairment, and 33 with Alzheimer’s disease (AD). All 3D T1WI scans were analyzed using three different volumetric software packages. Repeated-measures analysis of variance, intraclass correlation coefficient, effect size measurements, and Bland–Altman analysis were used to evaluate the inter-method agreement and reliability. Results: The measured volumes demonstrated substantial to almost perfect agreement for most brain regions bilaterally, except for the bilateral globi pallidi. However, the volumes measured using the three software packages showed significant mean differences for most brain regions, with consistent systematic biases and wide limits of agreement in the Bland–Altman analyses. The pallidum showed the largest effect size in the comparisons between NQ and FS (5.20–6.93) and between NQ and HAD (2.01–6.17), while the cortical gray matter showed the largest effect size in the comparisons between FS and HAD (0.79–1.91). These differences and variations between the software packages were also observed in the subset analyses of 45 patients without AD and 33 patients with AD. Conclusions: Despite their favorable reliability, the software-based brain volume measurements showed significant differences and systematic biases in most regions. Thus, these volumetric measurements should be interpreted based on the type of volumetric software used, particularly for smaller structures. Moreover, users should consider the replaceability-related limitations when using these packages in real-world practice.


Introduction
Magnetic resonance imaging (MRI)-based brain volumetry is increasingly being applied for assessing a wide range of neurological diseases, specifically neurodegenerative diseases such as Alzheimer's disease (AD), in clinical practice [1].In contrast to the visual assessment of brain volume changes, the quantitative assessment of brain volume serves as a valid biomarker of the clinical state and disease progression by providing reliable and robust inferences regarding the underlying disease-related mechanisms [1,2].
Various volumetric software packages, regardless of their commercialization status, have been developed and used for clinical and research purposes.These software packages automatically measure regional brain volumes or cortical thicknesses in a simpler and more intuitive manner over time [2,3].With the increasing use of volumetric software, several studies have compared NeuroQuant ® (NQ) and FreeSurfer (FS) as representative products for brain volumetry [4][5][6][7].NQ has several advantages over FS, such as shorter total processing time, user-friendly workflow, and direct interaction with the PACS server.For example, the server can be set up to interface directly with the MRI scanner, allowing automatic sending of information for processing.In addition, NQ has an integrated normal control database that is better suited for routine clinical applications [8].On the other hand, FS is freely available, adaptable, and commonly used in research settings as users can manually edit any errors detected in the automated region of interest (ROI) and perform additional volumetric analyses through further manual segmentation of the ROI [9].In previous studies, significant differences have been observed in certain volume measurements between the two methods, as NQ provided larger volumes of brain regions than FS for large structures such as the intracranial volume, forebrain parenchyma, lateral ventricles, and cerebellum [4][5][6][7]9].In real-world practice, automated brain volumetry improves diagnostic accuracy across various fields, such as clinical psychiatry and neurology, where diagnosis often relies on subjective self-reports and test results [10,11].However, several limitations are associated with potential errors that occur during the quantitative analyses of brain volumes.These limitations can stem from physical constrains, the lack of large-scale normal data, and pathophysiological constrains [12].Reproducibility poses another challenge in the clinical application of different volumetric software packages for interpreting the measured brain volumes.Although the results of recent studies revealed good-to-excellent correlations between different volumetric software packages, significant differences were observed between the measured brain volumes.This underscores the need for careful attention during interpretation [3,7,[12][13][14].
In the past few decades, beginning in early 2000s, various brain volumetry software packages with unique characteristics have been rapidly developed by multiple international vendors to compensate for the shortcomings of existing software products.Heuron AD (HAD) is a recently approved deep learning-based volumetric software package developed by the Ministry of Food and Drug Safety (MFDS, or Korea Food and Drug Administration [K-FDA]).HAD employs a segmentation model that uses deep neural networks.This software package provides information on neurodegeneration by comparing age-adjusted volume and cortical thickness measurements with clinically normative data, and indicates the presence and location of brain atrophy.Additionally, HAD includes a longitudinal analysis function that allows the analysis of repeated MRI scans to measure and monitor the changes in cortical thickness over time.At our institution, HAD was used for brain volume analysis, which was provided to us for a limited time as a product demo.This prompted us to compare the brain volume analysis results of the same patients obtained from FS and NQ with those from HAD.To date, no comparative study has evaluated the inter-method agreement and reliability between HAD and other clinically available software packages.We hypothesized that any differences or systemic bias in volume measurements in certain brain structures between the different software packages could warrant caution regarding the reliability of automated brain segment analysis and the interpretation of the results.Thus, in this study, we aimed to evaluate the inter-method agreement and reliability of three clinically available software packages: the established research software, FS, the commonly used commercial software, NQ, and the recently developed commercial software, HAD.

Patients
This retrospective study involving human participants was conducted in accordance with the ethical standards of the Institutional Research Committee and the 1964 Declaration of Helsinki and its later amendments or comparable ethical standards.This study was approved by the Institutional Review Board of Gyeongsang National University Changwon Hospital (Approved Protocol Code: GNUCH 2022-11-026; Approval Date: 14 December 2022).The requirement for informed consent was waived owing to the retrospective nature of the study.Patient records and information were anonymized and de-identified before data analysis.
We searched the picture archiving and communication systems and electronic medical records of patients who underwent brain MRI, including conventional 3D T1-weighted imaging (T1WI), for the assessment of memory impairment.The characteristics of the study population are listed in Table 1.A total of 78 patients (52 women and 26 men; age range: 21-88 years; mean age: 66.2 ± 17.4 years) were included in this study.Among the 78 patients with subjective cognitive impairment (i.e., memory impairment), 45 were categorized into the non-AD group (normal objective cognitive function [21/45, 46.7%] or mild cognitive impairment [MCI; 24/45, 53.3%]), while the remaining 33 were categorized into the AD group.The diagnoses of MCI and AD were clinically determined by three dementia specialists (two neurologists and one psychiatrist) using the following neuropsychiatric evaluations tools: Mini-Mental State Examination, Clinical Dementia Rating scale, Seoul Neuropsychological Screening Battery, Consortium to Establish a Registry for Alzheimer's Disease, Diagnostic and Statistical Manual of Mental Disorders (5th edition) criteria [15], and the National Institute of Neurological and Communicative Disorders and Stroke and the Alzheimer's Disease and Related Disorders Association criteria [16].

Image Post-Processing Volumetric Procedures
Following the visual inspection of the scans by a faculty neuroradiologist (H.J.B., with 13 years of post-training experience) to identify the presence of artifacts that could affect post-processing, the raw Digital Imaging and Communications in Medicine (DICOM) data were submitted and analyzed using three different processing pipelines: NQ (CorTechs Labs, San Diego, CA, USA), FS (Harvard University, Boston, MA, USA), and HAD (Heuron Co., Ltd., Seoul, Republic of Korea).Two faculty neuroradiologists (H.J. B., with 13 years of post-training experience and Y.J.H., with 8 years of post-training experience) performed the automated analyses using NQ and HAD in all patients, and a software engineer jointly executed the automated analyses using FS in all patients.The software packages used in this study provided the total intracranial volume and the volumes of the cortical gray matter (GM), cerebral white matter (WM), hippocampus, amygdala, caudate nucleus, putamen, pallidum, thalamus, and cerebellum.
NQ is the first FDA-approved volumetry software package and a standalone, fully automatic processing pipeline.In NQ, the brain is inflated to a spherical shape, and mapped to a common spherical space using the Talairach atlas coordinates.The segmented brain regions are then identified, and the brain is deflated to its original shape.The volume of each brain region is corrected for head size differences by normalizing it to the intracranial volume (ICV), and the resulting output is expressed as a percentage.The results are compared with the data from healthy controls, which have been stored in the NQ database.
FS uses a template-driven approach for volumetric and surface-based segmentation, as described in previous studies [9,[16][17][18].All data were batch-processed using an Intel i7-10700 central processing unit (CPU) running VirtualBox (centos7), and the data were initially processed using the recon-all command to produce fully segmentation.
HAD provides information about brain atrophy.The software divides the brain region, calculates the volume and cortical thickness, and compares the calculated results with normative data to provide a brain atrophy index for users.The segmentation engine introduced a deep learning architecture to segment the entire brain into 98 ROIs using a state-of-the-art (SOTA) parcellation model consisting of three fully convolutional neural networks (FCNNs) and an aggregation layer.Each FCNN computes the geometrical features in the axial, coronal, and sagittal slices.The features from the three FCNNs are aggregated by the final layer to create a parcellation mask.All the training data for the parcellation model were manually annotated by expert neurologists.

Statistical Analysis
Normality of data was assessed using the Kolmogorov-Smirnov test, and data were expressed as the mean ± standard deviation.Repeated-measures analysis of variance (ANOVA), followed by post-hoc tests with Bonferroni correction for multiple comparisons, were performed to assess differences in the mean volume measurements between NQ, FS, and HAD.The inter-method agreement across the three software packages was assessed using the intraclass correlation coefficient (ICC) values, which were interpreted as follows: 0.01-0.20,slight; 0.21-0.40,fair; 0.41-0.60,moderate; 0.61-0.80,substantial; and 0.81-1.00,almost perfect [18].The effect sizes were used to evaluate the inter-software agreements across the three software packages in measuring the volume using the following equation: effect size = mean difference/pooled standard deviation [5,19].The effect sizes were categorized as follows: <0.20, negligible; 0.2-0.49,small; 0.50-0.79,medium; and >0.8, large [20].Bland-Altman plots were generated, and the mean bias and 95% limits of agreement (LOA) were obtained for each comparison.All statistical analyses were performed using statistical software packages (SPSS, version 26.0, IBM, Armonk, NY, USA; MedCalc, version 19.8, MedCalc Software, Mariakerke, Belgium), and a p value of <0.05 (two-sided) was considered to indicate significance.

Comparison of Total ICV
A significant difference was observed in the total ICV between the three software packages (Table 2).The total ICV obtained using NQ was the largest among the measures obtained from all patients (NQ: 1427.37 ± 152.94 cm 2 , FS: 1414.41 ± 142 cm 2 , HAD: 1381.46 ± 140.69 cm 2 ) (p < 0.0001).However, no significant difference was observed between NQ and FS in all patients (p = 0.101).All total ICVs showed almost perfect agreement (0.926-0.967), with a negligible-to-small effect size between the three software packages.In the Bland-Altman analysis, the mean bias and 95% LOA between NQ and HAD were greatest across all the software comparisons (45.90 cm 3 [-104.97cm 3 and 196.78 cm 3 ] (Supplementary Material Table S1).

Comparison of the Measured Volumes of Segmented Brain Regions
NQ showed the largest measured volume in the cortical gray matter (GM), cerebral white matter (WM), putamen, thalamus, and cerebellum.However, NQ showed the smallest measured volume in the pallidus.HAD showed the largest measured volume in the hippocampus, amygdala, and caudate.
According to the repeated-measures ANOVA, the three software packages showed significant differences in the measured volume for most brain regions (Table 3).The measured volumes of most brain regions were significantly different between NQ and FS, and two software packages showed almost perfect agreement in most regions (Table 4).With regard to the effect size, the pallidum showed the largest effect size in both hemispheres (Table 5, and Figure 1).The comparison between NQ and HAD also showed significant differences in volumetric measurements for most regions, except for the amygdala (Lt., p = 0.634).NQ and HAD showed substantial to almost perfect agreement for all individual regions, except for both pallidi (ICC: 0.37-0.56),and the pallidum also showed the largest effect size in both hemispheres (Figure 1C and 1D).
Comparison between FS and HAD showed no significant differences in the volumetric measurements of the deep GM.FS and HAD revealed substantial to almost perfect agreement for all individual regions, in contrast to the results between NQ and HAD.The largest effect size was observed in cortical GM (Lt.= 0.97; Rt. = 0.93).The comparison between NQ and HAD also showed significant differences in volumetric measurements for most regions, except for the amygdala (Lt., p = 0.634).NQ and HAD showed substantial to almost perfect agreement for all individual regions, except for both pallidi (ICC: 0.37-0.56),and the pallidum also showed the largest effect size in both hemispheres (Figure 1C,D).
Comparison between FS and HAD showed no significant differences in the volumetric measurements of the deep GM.FS and HAD revealed substantial to almost perfect agreement for all individual regions, in contrast to the results between NQ and HAD.The largest effect size was observed in cortical GM (Lt.= 0.97; Rt. = 0.93).
The results of the Bland-Altman analysis of all software comparisons are summarized in the Supplementary Material Table S2.The mean bias and 95% LOA of the cortical GM were the greatest among the segmented brain regions in the comparison between NQ and HAD and between FS and HAD.However, the mean bias and 95% LOA for cerebral WM were the greatest in the comparison between NQ and FS.

Results of Subgroup Analyses by Presence or Absence of AD
According to the subgroups (non-AD vs. AD), the total ICV also showed significant differences between the three software packages (Table 2).The total ICV obtained using the NQ was the largest in non-AD patients (NQ: 1473.69 ± 158.79 cm 2 , FS: 1446.49± 144.85 cm 2 , and HAD: 1420.02 ± 139.01 cm 2 ; all p < 0.0001).By contrast, the total ICV obtained using FS was the largest in AD patients (FS: 1370.66 ± 127.97 cm 2 , NQ: 1364.21 ± 120.35 cm 2 , HAD: 1328.89 ± 126.94 cm 2 ; p < 0.0001).However, no significant difference was found between the NQ and FS in AD patients (p = 1.000).In addition, the total ICV showed almost perfect agreement (0.894-0.992), with a negligible-to-small effect size, regardless of the disease status or type of software compared.For mean bias with 95% LOA, the values between NQ and HAD were the greatest among all the software comparisons: 53.67 cm 3  [127.12cm 3 to 234.46 cm 3 ] in non-AD patients and 35.32 cm 3 [−59.88 to 130.53] in AD patients (Supplementary Material Table S1).
The comparison between NQ and HAD also showed significant differences in the volumetric measurements of most regions, except the amygdala (Lt.: p = 1.000 in non-AD patients vs. p = 0.136 in AD patients; and Rt., 1.000 in AD patients) and right caudate nucleus (p = 1.000 in AD patients).NQ and HAD showed substantial to almost perfect agreement for all individual regions, except for both pallidi (ICC, 0.38-0.61)and hippocampi (ICC, Lt. = 0.600, Rt. = 0.523 in non-AD patients).Regardless of the AD, the pallidum showed the largest effect size in both hemispheres (d = 2.01-6.17).
The comparison between FS and HAD showed no significant differences in volumetric measurements of the deep GM (the left pallidum and both thalami in non-AD patients; and the right hippocampus, left caudate, left pallidum, both putamina, and both thalami in patients with AD).FS and HAD revealed substantial to almost perfect agreement for all individual regions, except for both pallidi (ICC, 0.23-0.92).The largest effect size was observed in the cortical GM of AD patients with AD (Lt., 1.91; Rt., 1.83) and in the right putamen (1.61) of patients without AD.
Supplementary Material Table S2 summarizes the results of the Bland-Altman analysis for all software comparisons.The mean bias and 95% LOA for the cortical GM were the greatest among the segmented brain regions in the comparison between NQ and HAD and between FS and HAD in both subgroups.However, the mean bias and 95% LOA for the cerebral WM were the greatest in the comparison between NQ and FS, except in the left hemisphere in the non-AD group.

Discussion
In this study, we compared the inter-method agreement and reliability between three clinically available brain volumetry software packages: FS, NQ, and HAD.We found substantial to almost perfect agreement for most segmented brain regions, except for the pallidi.However, the volume measurements for most segmented regions showed significant differences and moderate or large effect sizes across the three volumetric software packages.In particular, both pallidi showed the largest effect size in the comparison between NQ and FS and between NQ and HAD.Meanwhile, the cortical GM showed the largest effect size in the comparison between FS and HAD.In the current study, the favorable inter-method agreement for most segmented brain regions suggested that each software package provided good qualitative information on the brain structure.However, the significant differences and systemic biases in the majority of brain volume measurements, likely stemming from procedural variations in each method, can raise doubts on the reliability of automated brain analysis as a quantitative tool for routine clinical practice.
Various volumetric software packages have recently become clinically available for automatic segmentation in clinical settings.Several previous studies have extensively explored the inter-method reliability for various volumetric software, especially those comparing FS and NQ [5][6][7]12,13,21].These studies reported good-to-excellent correlations in the volumetric results across various brain regions.However, they also observed significant overall differences in the mean volumes for segmented brain regions, which was consistent with the results of the present study [5][6][7]21].Interestingly, they observed larger volumes when using NQ compared with FS in most brain regions, which was similar to the results of this study, except for both hippocampi, left amygdala, and pallidi.However, to the best of our knowledge, no clinical studies have explored HAD, as it was developed more recently.Although NQ provided the largest volume measurements for most brain regions, HAD showed the largest volume measurements in the hippocampus, amygdala, and caudate.Significant mean differences, as shown in our study, have also been previously reported in the volume measurements of most brain regions obtained using different software packages [12,22].Although Inbrain ® (MIDAS Information Technology Co., Ltd., Seongnam, Republic of Korea) and NQ showed good-to-excellent inter-method reliability for all brain regions, they also showed significantly different volume measurements with large effect sizes [13].Another study [23] also demonstrated good-to-excellent inter-method reliability and correlation between vendor-provided volumetry software and NQ for most brain structures.However, a significant difference was found in the measured volumes, except for the right hippocampus.This variation may be attributed to the differences in volumetric results obtained using various software packages based on different atlases [23,24].Furthermore, image noise and heterogeneity in intensity could introduce errors in quantitative measurements and affect the performance of the volumetry software [22].
In this study, NQ showed significantly smaller pallidus volume than FS and HAD.Consistent with our findings, previous studies have also shown a stronger correlation between the measurements of large structures (such as the ICV) and a lower correlation between the measurements of small and deeper structures in NQ compared with other available volumetric software packages [5,6,12,23].Among these small structures, the pallidum showed the lowest correlation across the volumetric software packages and the largest effect size among the segmented regions.Hence, previous studies have proposed two main explanations for the inconsistency in pallidal volume measurements.First, the accurate segmentation of the pallidum from the adjacent WM is challenging owing to its T1 signal intensity, which is influenced by the higher myelination content of the pallidum.This challenge could be addressed by including the adjacent WM and putamen in the calculation [5,12,25].Second, metal deposition associated with the aging or degeneration processes in the pallidum may affect the T1 relaxation time, thereby impacting softwarebased volume measurement [26].However, another previous study demonstrated good reproducibility of pallidum measurements between FS and Inbrain ® owing to their similar segmentation method [27].Furthermore, our analysis revealed that the measured volumes of the pallidum, putamen, and thalamus with HAD were closer to the FS values than the NQ values, regardless of statistical significance.In addition, no significant differences and a higher ICC were observed in the measurements of both pallidi between FS and HAD than between NQ and HAD or between NQ and FS in the present study.
The volume of the hippocampus is considered an important biomarker of AD [28] and has predictive value for the conversion of MCI to AD in clinical practice [7].Although the prognostic value of the hippocampal measurements was not investigated in this study, we measured and compared the volume of this structure obtained using the three software packages and also conducted group comparisons of these values in relation to the presence of AD.The largest mean hippocampal volume was obtained with HAD, followed by FS and NQ, regardless of the disease status.However, the hippocampal volumes in the AD group were consistently smaller than those in the non-AD group, regardless of the volumetric software package.
This study had several limitations.First, we retrospectively evaluated a relatively small and heterogeneous cohort of patients from a single institution, introducing a potential selection bias.Second, we classified patients with MCI into a non-AD group and dichotomized the study according to the presence of AD for group comparison owing to the small number of patients enrolled.Thus, additional studies with a larger sample size are warranted to validate our results regarding the comparison of three clinical groups using three or more volumetric software packages, as the early detection of MCA holds clinical significance as a prodromal state rather than an overt state of AD.Third, we only evaluated the inter-method reliability using a single MR scanner with a homogeneous protocol in a single institution.Therefore, multicenter studies using different scanning environments and protocols are required to confirm our results.Fourth, age could potentially influence the volumes in various brain regions; however, we were unable to obtain age-adjusted values using all software packages.Finally, although FS has shown accuracy and reliability comparable to those of manual segmentation performed by experts in previous studies, the lack of a reference standard for true brain volumes of the various anatomic regions remains a limitation [29][30][31].

Conclusions
In conclusion, we compared the measured volumes of various brain regions using three clinically available volumetric software packages, FS, NQ, and HAD.We observed substantial to almost perfect agreement between the software packages.However, significant differences were observed in the mean volumes for most brain regions and consistent systematic biases with wide LOA between the three software packages in both AD and non-AD groups, potentially limiting reproducibility.Unfortunately, no objective gold standard has been established for measuring brain segment volume, making it challenging to determine which software is closest to reality.Previous studies have used various methods and tools to measure brain volumes, further complicating comparisons.Similar to previous studies, our results are unsuitable for determining which software package is superior for evaluating patient conditions for clinical and research purposes.However, our findings underscore the importance of interpreting the volumetric measurements obtained using different software packages cautiously in real-world practice.All users, including clinicians and researchers, should be aware of these inherent limitations related to the replaceability of various volumetric software packages when using them in clinical settings, e.g., in tracking changes for longitudinal analyses.

Supplementary Materials:
The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/medicina60050727/s1,Table S1: Comparison of the mean bias and 95% limits of agreement (LOA) for total intracranial volume between the three volumetry software packages; Table S2: Comparison of the mean bias and 95% limits of agreement (LOA) for measured volumes in each brain region between the three volumetry software packages

Informed Consent Statement:
The requirement for written informed consent was waived by our institutional review boards.

Figure 1 .
Figure 1.Representative color-coded axial MR images at the level of the basal ganglia.An axial T1weighted image (A) is shown at the basal ganglia level with color-coded images of the FS, NQ, and HAD.In these representative images, the pallidum appears smaller in NQ (C) than in FS (B) or HAD (D).The pallidum is indicated with asterisks.FS, FreeSurfer; HAD, Heuron AD; NQ, NeuroQuant.

Figure 1 .
Figure 1.Representative color-coded axial MR images at the level of the basal ganglia.An axial T1-weighted image (A) is shown at the basal ganglia level with color-coded images of the FS, NQ, and HAD.In these representative images, the pallidum appears smaller in NQ (C) than in FS (B) or HAD (D).The pallidum is indicated with asterisks.FS, FreeSurfer; HAD, Heuron AD; NQ, NeuroQuant.

Author
Contributions: Conceptualization, H.J.B.; methodology, K.H.C. and Y.J.H.; validation, H.J.B. and J.Y.J.; formal analysis, K.H.C., Y.J.H. and J.-H.K.; writing-original draft preparation, K.H.C. and Y.J.H.; writing-review and editing, H.J.B.All authors have read and agreed to the published version of the manuscript.Funding: This study received no external funding.Institutional Review Board Statement: The study involved human participants and was conducted in conformity with ethical and humane principles of research according to the ethical guidelines of the 1964 Declaration of Helsinki and its later amendments or comparable ethical standards.The Institutional Review Board of Gyeongsang National University Changwon Hospital approved this study (Approved Protocol Code: GNUCH 2022-11-026; Approval Date: 2022-12-14).

Table 1 .
Demographic data of the study participants.

Table 2 .
Comparison of total intracranial volume.

Table 3 .
Comparison of measured volumes across the three volumetry software packages.

Table 4 .
Comparison of the inter-method reliability of volumetric measurements across the three volumetry software packages.

Table 5 .
Comparison of the effect size across the three volumetry software packages.