Effects of Patchwise Sampling Strategy to Three-Dimensional Convolutional Neural Network-Based Alzheimer’s Disease Classification

In recent years, the rapid development of artificial intelligence has promoted the widespread application of convolutional neural networks (CNNs) in neuroimaging analysis. Although three-dimensional (3D) CNNs can utilize the spatial information in 3D volumes, there are still some challenges related to high-dimensional features and potential overfitting issues. To overcome these problems, patch-based CNNs have been used, which are beneficial for model generalization. However, it is unclear how the choice of a patchwise sampling strategy affects the performance of the Alzheimer’s Disease (AD) classification. To this end, the present work investigates the impact of a patchwise sampling strategy for 3D CNN based AD classification. A 3D framework cascaded by two-stage subnetworks was used for AD classification. The patch-level subnetworks learned feature representations from local image patches, and the subject-level subnetwork combined discriminative feature representations from all patch-level subnetworks to generate a classification score at the subject level. Experiments were conducted to determine the effect of patch partitioning methods, the effect of patch size, and interactions between patch size and training set size for AD classification. With the same data size and identical network structure, the 3D CNN model trained with 48 × 48 × 48 cubic image patches showed the best performance in AD classification (ACC = 89.6%). The model trained with hippocampus-centered, region of interest (ROI)-based image patches showed suboptimal performance. If the pathological features are concentrated only in some regions affected by the disease, the empirically predefined ROI patches might be the right choice. The better performance of cubic image patches compared with cuboidal image patches is likely related to the pathological distribution of AD. The image patch size and training sample size together have a complex influence on the performance of the classification. The size of the image patches should be determined based on the size of the training sample to compensate for noisy labels and the problem of the curse of dimensionality. The conclusions of the present study can serve as a reference for the researchers who wish to develop a superior 3D patch-based CNN model with an appropriate patch sampling strategy.


Introduction
The study of human brain using neuroimaging technologies (generally including magnetic resonance imaging (MRI) and positron emission tomography (PET)) helps the discovery of brain abnormalities in structure and function [1]. Machine learning-based diagnostic image analysis has been widely applied to assist physicians in achieving efficiency and diagnostic accuracy in clinical practice [2,3]. In general, a machine learning algorithm, which typically employs voxel-wise or regional neuroimaging data as the input features, is used to learn the feature patterns of brain diseases [4]. However, the traditional machine-learning method is laborious and relies on well-designed and handcrafted features. hippocampus region is most commonly selected as an ROI in AD studies [24,25]. By using the hippocampus as an ROI image patch, Liu et al. [26] combined a 3D UNet and a 3D DenseNet to learn the features and realized both the hippocampus segmentation and the disease classification. Huang et al. [27] proposed two 3D VGG-like CNNs to integrate the hippocampus features from both MRI and PET.
Previous studies have applied the image-patch method when investigating various diseases of the nervous system. However, the way in which the choice of patchwise sampling strategy affects 3D CNN-based classification performance remains unknown. In this study, we attempted to investigate the effect of patch sampling strategy on classification performance by classifying AD and cognitively normal (CN) brain images.

Data-Set
Data employed in the preparation of this article were obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database (http://adni.loni.usc.edu/) accessed on 1 January 2022. The ADNI was launched in 2003 as a public-private partnership, led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether MRI, PET, other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment and AD.
High-resolution brain structural MRI (sMRI) scans were collected at multiple ADNI sites using a 1.5-T system from GE Healthcare, Philips Medical Systems, or Siemens Healthcare, depending on the scanning site. T1-weighted volumetric MP-RAGE data were collected for each subject, and the raw DICOM images were downloaded from the public ADNI site (http://www.loni.ucla.edu/ADNI/Data/index.shtml (accessed on 1 January 2022)). Parameter values vary by study scanning site and can be accessed at http://www.loni.ucla.edu/ADNI/Research/Cores/ (accessed on 1 January 2022). Only baseline scans were used for the study. The population in this study included ADNI-1 participants enrolled in the CN or AD cohorts, including 187 AD patients and 229 CNs at baseline. Demographic details of the two groups are provided in Table 1, including age, gender, years of education, and mini-mental state examination (MMSE).

Image Preprocessing
The downloaded data were first converted from DICOM to Neuroimaging Informatics Technology Initiative (NIFTI) format, by using MRIcron software (http://people.cas. sc.edu/rorden/mricron/index.html (accessed on 1 January 2022)). Images were manually reoriented to place their native-space origin at the anterior commissure. Images were then preprocessed by using the Computational Anatomy Toolbox (CAT12) toolbox (http://www.neuro.uni-jena.de/cat/ (accessed on 1 January 2022)), an extended toolbox of SPM12 [27] with default settings. The preprocessing pipeline included realignment, skull stripping, segmentation into gray matter and white matter, and finally, the segmented gray matter images were spatially normalized into the Montreal Neurological Institute (MNI) space by using diffeomorphic anatomical registration by using exponentiated Lie algebra nonlinear normalization and modulated to preserve volume information. The modulated and warped 3D gray matter density maps (GMDMs) were smoothed by using a 3D Gaussian kernel of 2 mm full width at half maximum. The GMDMs had a dimensionality of 121 × 145 × 121 in the voxel space, with the voxel size of 1.5 × 1.5 × 1.5 mm 3 . The background voxels increased the computational complexity of model, but they did not contribute to the classification performance. Thus, we established a new bounding box with the dimension of 91 × 115 × 91 (voxel size of 1.5 × 1.5 × 1.5 mm 3 ), which removed most of the background voxels. The complete preprocessing pipeline is summarized in Figure 1.
gray matter images were spatially normalized into the Montreal Neurological Instit (MNI) space by using diffeomorphic anatomical registration by using exponentiated algebra nonlinear normalization and modulated to preserve volume information. T modulated and warped 3D gray matter density maps (GMDMs) were smoothed by us a 3D Gaussian kernel of 2 mm full width at half maximum. The GMDMs had a dim sionality of 121 × 145 × 121 in the voxel space, with the voxel size of 1.5 × 1.5 × 1.5 m The background voxels increased the computational complexity of model, but they not contribute to the classification performance. Thus, we established a new bounding b with the dimension of 91 × 115 × 91 (voxel size of 1.5 × 1.5 × 1.5 mm 3 ), which removed m of the background voxels. The complete preprocessing pipeline is summarized in Fig  1.

Patch Extraction
The patchwise sampling strategy involved the following three partition methods the whole brain images. (1). Cubic image patches: Twelve 48 × 48 × 48 local image patches, which were partia overlapped, were sampled to cover the whole brain, as shown in Figure 2a. (2). Cuboid image patches: Six 91 × 25 × 91 local image patches, which were also partia overlapped, were sampled along the coronal axis, as shown in Figure 2b. (3). ROI patches: Two 64 × 64 × 64 image patches were sampled to cover the left (or rig hippocampus with certain margins, as shown in Figure 2c. In addition, the whole brain image was sectioned into cubic patches of different si (8 64 × 64 × 64, 28 32 × 32 × 32 and 72 24 × 24 × 24 image patches) from the GMDM respectively, in an effort to determine how the patch size influenced the model's perf mance.

Patch Extraction
The patchwise sampling strategy involved the following three partition methods for the whole brain images.
(1). Cubic image patches: Twelve 48 × 48 × 48 local image patches, which were partially overlapped, were sampled to cover the whole brain, as shown in Figure 2a.  (2). Cuboid image patches: Six 91 × 25 × 91 local image patches, which were also partially overlapped, were sampled along the coronal axis, as shown in Figure 2b. (3). ROI patches: Two 64 × 64 × 64 image patches were sampled to cover the left (or right) hippocampus with certain margins, as shown in Figure 2c.
In addition, the whole brain image was sectioned into cubic patches of different sizes (8 64 × 64 × 64, 28 32 × 32 × 32 and 72 24 × 24 × 24 image patches) from the GMDMs, respectively, in an effort to determine how the patch size influenced the model's performance.

Network Architecture 2.4.1. Subject-Level CNNs
Subject-level CNNs ( Table 2) had VGGNet-like 3D CNN structures. The CNN comprised four convolutional layers with channels of 8, 16, 32, and 64 channels. All convolutional layers had a kernel of 3 × 3 × 3, and a unit stride with zero-padding, followed by L2 regularization and rectified linear unit (ReLU) activation. Each Conv layer was followed by a max-pooling layer. The first max-pooling layer had a size of 3 × 3 × 3, and the stride was set to 3. In addition, the other three max-pooling layers had a size of 2 × 2 × 2 with a stride of 2. As the tail of the CNN model, the number of neurons in three fully connected (FC) layers were 1024, 128, and 2, respectively. The last FC layer determines the final probability score with a Softmax activation. We applied dropout to the first two FC layers to avoid over-fitting. To conveniently compare the performance of the tested patch-level approaches with the subject-level approach, the 3D subject-level CNN was treated as the baseline model. This baseline model was trained by using a grid-search technique in order to find the optimal combination of hyperparameters (learning rate, batch size, dropout ratio, number of epochs) for this architecture. The range of the hyperparameter values was (1 × 10 −8 -1 × 10 −2 for learning rate, 12-48 for batch size, 0.3-0.8 for dropout ratio, and 100-1000 for the number of epochs). In the grid search, fivefold cross-validation (CV) was performed on the training set. While changing the values of the hyperparameter, mean values for accuracy (ACC) were calculated for each value of the hyperparameter. The value of the hyperparameter that maximized the mean ACC value was used. The loss function was binary cross-entropy. The learned hyperparameters are shown below. The Adam optimizer had a learning rate of 1 × 10 −4 . The batch size was 24, dropout ratio was 0.5, and the number of epochs was set to 300. Table 2. CNN architecture of subject-level CNN (baseline model).

Image Patch-Level CNNs
The image patch-level classification framework ( Figure 3) adopted a set of local image patches as the inputs. Moreover, it was based on a cascaded CNN consisting of two components: patch-level subnetworks and a subject-level subnetwork. Briefly put, the patch-level subnetworks were used to generate feature representations and classification scores for these image patches. Then, all the learned feature representations were integrated Brain Sci. 2023, 13, 254 6 of 12 and processed by the subject-level subnetwork, which allowed the classification result to be obtained.
The image patch-level classification framework ( Figure 3) adopted a set of loca age patches as the inputs. Moreover, it was based on a cascaded CNN consisting o components: patch-level subnetworks and a subject-level subnetwork. Briefly pu patch-level subnetworks were used to generate feature representations and classific scores for these image patches. Then, all the learned feature representations were grated and processed by the subject-level subnetwork, which allowed the classific result to be obtained. •

Patch-level subnetworks
For the patch-level subnetworks, the model architecture was basically identical baseline model, although the convolution kernel of the first max-pooling layer was al to 2 × 2 × 2, and the stride was set to 2. The initial learning rate, batch size, and dro ratio of patch-level networks were kept the same as in the baseline model. The e number is set to 200, because patch-level networks have a faster convergence speed The convolutional layers in patch-level subnetworks worked as local feature ex tors that combined low-level features into high-level features. The subject-level su work comprised three FC layers, which were utilized for integrating 3D inform dropout was included to prevent the overfitting of the training model, and the dro ratio was set at 0.5. Deep features from Conv4 learned by the patch-level subnetw were concatenated and fed into the three FC layers for subject-level classification number of neurons in the three FC layers was 2048, 512, and 2, respectively. ReLU a tion and L2 regularization were added to FC1 and FC2. The last FC layer used Sof activation to generate a subject-level classification score.

Experiments and Implementation
The data utilized in the experiment were randomly divided into the trainin (70%), validation (10%), and testing sets (20%) and adopted an undersampling techn was adopted to overcome an imbalance problem present in the testing set. During s vision, the class distributions of those datasets were kept the same as those of the ori •

Patch-level subnetworks
For the patch-level subnetworks, the model architecture was basically identical to the baseline model, although the convolution kernel of the first max-pooling layer was altered to 2 × 2 × 2, and the stride was set to 2. The initial learning rate, batch size, and dropout ratio of patch-level networks were kept the same as in the baseline model. The epoch number is set to 200, because patch-level networks have a faster convergence speed.

•
Subject-level subnetwork The convolutional layers in patch-level subnetworks worked as local feature extractors that combined low-level features into high-level features. The subject-level subnetwork comprised three FC layers, which were utilized for integrating 3D information; dropout was included to prevent the overfitting of the training model, and the dropout ratio was set at 0.5. Deep features from Conv4 learned by the patch-level subnetworks were concatenated and fed into the three FC layers for subject-level classification. The number of neurons in the three FC layers was 2048, 512, and 2, respectively. ReLU activation and L2 regularization were added to FC1 and FC2. The last FC layer used SoftMax activation to generate a subject-level classification score.

Experiments and Implementation
The data utilized in the experiment were randomly divided into the training set (70%), validation (10%), and testing sets (20%) and adopted an undersampling technique was adopted to overcome an imbalance problem present in the testing set. During subdivision, the class distributions of those datasets were kept the same as those of the original class distribution. The mean voxel-wise absolute intensity differences between all GMDMs of the CN class in the testing set and the GMDMs of the AD class in the testing set were computed. For each example in the AD class, one instance from the CN class was selected that had minimal mean voxel-wise absolute intensity differences from it, and undersampling was implemented until the balance of class distribution in the testing set was achieved. The Brain Sci. 2023, 13, 254 7 of 12 process of random division and undersampling was repeated 20 times. The experimental results were averaged over 20 tests.
All classification models used Python 3.7 as the programming language, Tensorflow 2.0 as the deep learning algorithm programming framework, and the models' training and testing were performed on a workstation equipped with an NVIDIA GeForce GTX 1080 GPU in a Windows environment.
Each patch-level subnetwork was trained separately, and the network weight was randomly initialized. In the training process of the subject-level subnetwork, we locked all the convolutional layers of the pretrained patch-level subnetworks. In the cost function calculation, balanced class weights were used to ensure that classes were weighted inversely proportional to their frequency in the training set. We adopt an early stop strategy that stops training when the validation metric does not show improvement for 20 consecutive epochs.
Five evaluation indexes, ACC, sensitivity (SEN), specificity (SPE), F1-score, and AUC were used in this study. TP, TN, FP, and FN denoted the quantity of true positive, true negative, false positive, and false negative, respectively. The calculation formulas are as follows: To check the statistical differences among the ACC for all the models, we computed one-way repeated-measures ANOVA with Tukey post hoc analysis for multiple comparisons. Greenhouse-Geisser sphericity correction was made if Mauchly's test of sphericity indicated a violation of the sphericity assumption for the repeated-measures ANOVA tests.

The Influence of Partition Methods
Classification performance was compared between the CNNs trained with the three partition methods and the baseline model in the AD vs. CN classification task (Table 3). All four classifiers performed well, with ACC exceeding 85%. The one-way repeatedmeasures ANOVA indicated that ACC was significantly affected by the partition methods (one-way ANOVA: F = 8.247, p < 0.001). The model trained with cubic (48 × 48 × 48) image patches showed the best performance in all five evaluation indices, with an ACC of 89.6%, indicating a 2.2% improvement over the baseline model. A post hoc test revealed that the model with 48 × 48 × 48 image patches had a statistically higher ACC than the other models (p < 0.01). Although a 91 × 25 × 91 image patch occupies a similar volume to a 48 × 48 × 48 image patch, the model trained with this patch size did not achieve the same performance (ACC = 86.8%). For ROI-based image patches extracted with the hippocampus as the central region, the performance (ACC = 87.6%) achieved was comparable to wholebrain image patches by using only two 64 × 64 × 64 patches (left hippocampus and right hippocampus).

The Influence of Image Patch Size
The size of the patches plays an important role in the patchwise sampling strategy. We started the experiments with patches of size 24 × 24 × 24, and found that gradually increasing the size of the patches up to 64 × 64 × 64 resulted in improved performance ( Table 4). The 48 × 48 × 48 image patches achieved the best performance (ACC = 89.6%). The 24 × 24 × 24 image patches were comparable to the 32 × 32 × 32 image patches in terms of ACC. There were significant differences in classification performance among

The Relationship between Image Patch Size and Training Sample Size
The image patch size and the training sample size show a complex relationship with model performance. When the training set was reduced by half, the model became more susceptible to overfitting. All patch-level models and the baseline model experienced performance degradation ( Table 5). The performance degradation was smallest for the 24 × 24 × 24 image patches, which maintained an ACC of 87.1%, and largest for the 48 × 48 × 48 image patches, which achieved 4.1%, when comparing the performances for the complete training set and half of the training set. The AUC of the models with 32 × 32 × 32 and 64 × 64 × 64 image patches and the baseline model decreased by about 3%. After halving the training sample, there was no significant difference in ACC t between the models (one-way ANOVA: F = 1.825, p = 0.136). Table 5. Results of models based on different image sizes in AD vs. CN classification after halving the training sample size (mean ± standard deviation).

Discussion
In general, if the whole brain's information is used as the input with a sufficient training sample size, the 3D network with the greatest depth and width will exhibit superior performance [28]. In neuroimaging-based studies, there are normally a limited (e.g., hundreds) number of subjects with a very high number (millions) of dimensional Brain Sci. 2023, 13, 254 9 of 12 features, which greatly increases the risk of overfitting. It is anticipated that an optimized patch-wise sampling strategy may provide a clue for improving the AD classification model, this, however, has not been verified. In this study, we investigated the effect of a patch-wise sampling strategy on the performance of a 3D CNN model for AD classification.

ROI Patches
The empirically defined brain regions have the features with greater discriminability. For example, the temporal lobe cortex, the amygdaloid nucleus, and the hippocampus are the most severely affected regions in AD. Among them, hippocampus atrophy is implicated in both memory and learning. In this study, the diagnostic model that relied on predetermined hippocampal ROIs demonstrated a suboptimal effect when compared with the other models. The interpretation of the results is simple and intuitive. First, although the hippocampus is the most important region in terms of AD, hippocampus ROI cannot cover all possible pathological features of the whole brain. Some potentially important brain regions affected by AD could be ignored. Secondly, diagnostic models based on empirically predetermined ROIs are influenced by the heterogeneity of the disease. AD is a heterogeneous disease [29], the hippocampal area is well preserved in some AD patients [30]. For those patients, the hippocampus ROIs do not represent a good choice for image patches.

The Effect of Patch Shape
When compared with ROI patches, patches generated from random partitioning attempt to learn local-to-global feature representations for whole brain MRI. Although cubic image patches and cuboid image patches basically contain the same quantity of voxels, their classification capacities are quite different. More specifically, cubic image patches exhibit better classification performance. The influence of a patchwise sampling strategy on the classification performance may be due to the regional distribution of the pathology in AD. In a study for AD diagnosis, Liu et al. [31] adopted a data-driven learning approach to discover disease-related anatomical landmarks. They found that many landmarks with high discriminative capability were close to each other, and those landmarks were more concentrated in certain AD related brain regions. When the volume of an image patch is fixed, if its shape has the form of a cube, the average Euclidean distance between any two points in the image patch is the shortest. For example, over 10 million Monte Carlo runs, we found that the average distance between any two points in a 48 × 48 × 48 cubic was 47.63 mm, and the average distance between two points in a 91 × 25 × 91 cuboid was approximately 73.25 mm. That means when the volume of patches is the same, the label of the cuboid cubic division might be more noisy compared with the cubic division. However, it must be acknowledged that the better performance observed in this study on the part of cubic patches might not be maintained in relation to other brain diseases.

The Relationship between Image Patch Size and Training Sample Size
Three-dimensional CNNs have the potential to retrace the success story of 2D-CNNs, but they have two major drawbacks in terms both of high computational cost and the curse of dimensionality. More data are required to create a model that meets the problem requirements and criteria as the problem dimensionality increases. For most of the neuroimaging studies, it is very hard to acquire a large neuorimage dataset. An option to decrease the effect of the curse of dimensionality is to use small image patches. When an image patch is too small, the labels are noisy. When an image patch is too large in size and the number of training samples is insufficient, CNN easily suffers from the curse of dimensionality problem. In terms of the input patches, several different sizes were tested in this study. Unsurprisingly, the patches with a medium size (48 × 48 × 48) were able to achieve the best performance for the whole training set. Furthermore, the performance of the model was also evaluated when the training data were reduced by half. It was determined that the model's performance was degraded no matter what size of image patch was used as the input. The performance degradation amplitude in the case of the smallest image patches (24 × 24 × 24) was the minimum. The image patch size should be reduced to mitigate the curse of dimensionality problem.

Performance Comparison
Although the goal of this study is not to develop a superior model, it is intended to provide guidance to researchers who wish to develop a superior 3D patch-based CNN model with an appropriate patch sampling strategy. Table 6 shows the comparison between this study and state-of-the-art approaches. The model is comparable to a state-of-theart approach, especially considering that we used a more rigorous experimental design, employed a downsampling strategy to balance classes in the test set, and selected only CN subjects that are difficult to distinguish from AD.

Limitations
It must be acknowledged that the present study has a number of limitations. First, some image patches have limited discriminative power, which means that selecting appropriate image patches may boost the performance of the model. Therefore, some of the conclusions drawn in this study may need to be modified by a network pruning strategy. Secondly, the batch size controls the number of samples propagating through CNNs during training. In the current implementation, the training batch size is fixed. Tuning the batch size affects the training loss curve and computation efficiency. As image patches get smaller, the batch size during training can be increased, which could make the training more efficient. Thirdly, given that the structural changes caused by AD vary according to the severity of the disease, it would be useful to boost the model's performance by extending the method to multiscale image patches. Fourth, because many researchers [17,22,27] prefer to choose a relatively old but structurally simple and highly efficient VGG architecture to implement their patch-based 3D CNNs, we investigated VGG-like 3D CNNs in this study. The experimental results can be theoretically supported by the 2D theoretical framework [19], but whether they can be extrapolated to more complex 3D models requires further experimental validation. Fifth, the requirement for more public datasets for AD research is highlighted by the importance of building a robust classification model on new and unseen data.

Conclusions
By data mining on the AD database by using an optimal CNN model, a valuable computer-aided AD diagnosis system is very promising and feasible for clinic use in the future. In this study, we investigated the effect of a patchwise sampling strategy for 3D CNN based AD classification. When the pathological features are concentrated in specific brain regions, the empirically predetermined ROI patches are optimal, which can be verified by studying other homogeneous diseases in the future. The experimental results show that the cubic image patches perform better than the cuboid image patches in classifying AD, which is probably related to the regional distribution of pathology in AD. In this study, the 48 × 48 × 48 image patches performed best for the entire training set, and the 24 × 24 × 24 image patches performed best for one half of the training set. The size of the image patch should be determined based on the size of the training sample to compensate for noisy labels and the curse of dimensionality.
Author Contributions: Conceptualization, L.L. and S.W.; methodology, L.L. and X.S.; validation, X.X.; formal analysis, X.S.; writing-X.S.; supervision, L.L.; funding acquisition, L.L. All authors have read and agreed to the published version of the manuscript. Data Availability Statement: The dataset is owned by a third-party organization; the Alzheimer's Disease Neuroimaging Initiative (ADNI). Data are publicly and freely available from the http:// adni.loni.usc.edu/data-samples/access-data/ accessed on 1 January 2022. Institutional Data Access/Ethics Committee (contact via http://adni.loni.usc.edu/data-samples/access-data/ (accessed on 1 January 2022)) upon sending a request that includes the proposed analysis and the named lead.