1. Introduction
Defining an adequate unit size is a crucial step in brain imaging analysis, where data is inherently complex, high-dimensional, and computationally intensive. Unit size refers to the spatial resolution at which the brain is analyzed, ranging from individual voxels to aggregated regions such as regions of interest (ROIs). The choice of unit size has substantial implications for both the interpretability of results and the computational feasibility of downstream analyses.
In this paper, we investigate a resolution-free metric for data aggregation (combining the data points) in brain imaging, which aggregates large three-dimensional voxel data into more manageable unit sizes. Maintaining distributional properties of data during data aggregation or fission [
1] is not of particular interest of this paper. Rather, we aim to propose a relevant criterion for adjusting resolution to determine an appropriate unit size for subsequent analysis, such as clustering. Ultimately, this approach does not require prior knowledge of brain structures and, thus, it may be suitable for initial exploratory analysis.
While ROI-based approaches efficiently reduce dimensionality by averaging signals within predefined brain regions [
2], they require prior knowledge and may obscure localized effects. This limitation makes them less suitable for initial exploratory analyses, where a more flexible and data-driven approach is needed. Voxel-based methods, such as voxel-based morphometry (VBM) [
3] provide fine-grained spatial resolution, enabling detailed anatomical analysis. However, they often suffer from high computational costs and increased susceptibility to noise. On the other hand, grid-based methods segment the brain into uniform cubic regions, offering a straightforward approach to defining unit sizes. While fine-grained grids preserve detailed information, they significantly increase the number of regions, making clustering, machine learning, and statistical modeling computationally expensive and memory-intensive. Conversely, overly large grids obscure finer-scale features, reducing sensitivity to localized effects.
An optimal unit size must balance the trade-off between preserving relevant signal patterns and maintaining computational efficiency. Thus, we propose two essential conditions for defining an adequate unit size in brain imaging:
- i.
Minimizing signal dilution for detecting meaningful regional differences—The unit size should be small enough to capture localized biological variations that distinguish different phenotypic traits or pathological conditions. Overly large units may obscure these critical differences, reducing the ability to accurately characterize relevant biological patterns.
- ii.
Optimizing computational efficiency—The unit size should be large enough to prevent unnecessary computational overhead. Brain imaging datasets, such as those from functional magnetic resonance imaging (fMRI) or diffusion tensor imaging (DTI), often contain millions of data points. Aggregating adjacent voxels into larger units can reduce the computational burden while preserving data fidelity.
To illustrate the importance of computational efficiency, consider the
-means clustering algorithm, commonly used in brain imaging for segmentation and pattern identification. Each iteration of
-means requires
computations, where
is the number of data points,
is the number of clusters, and
is the data dimensionality. Given the NP-hard nature of
-means [
4], the problem becomes computationally intractable as
increases. Doubling the number of units approximately doubles the computational time, making unit size a critical factor in large-scale imaging studies. Careful aggregation of voxels into appropriate units can significantly improve computational efficiency without losing meaningful information.
Another key challenge in brain imaging is the need for a scale- and sample-size-independent metric for evaluating unit size. Such a metric should remain invariant to the merging or splitting of subunits and be adaptable to variations in spatial resolution due to differences in imaging modalities or preprocessing pipelines. At the same time, it must be sensitive enough to indicate when unit size is optimal for capturing relevant neural activity or structural characteristics.
In this paper, we propose a novel framework for determining a desirable unit size that balances the detection of meaningful patterns relevant to the disease of interest and computational efficiency. We aim to evaluate the Calinski–Harabasz index [
5] as a metric for guiding data aggregation in brain imaging analysis, particularly focusing on its stability with respect to varying sample sizes. Our goal is to develop a preprocessing framework that simplifies high-dimensional brain imaging data by determining proper unit sizes, reducing computational burden while preserving essential information for downstream analyses. By selecting an appropriate unit size, we seek to enhance the efficiency and interpretability of brain imaging studies, ultimately facilitating deeper insights into neural function at meaningful spatial scales.
2. Methods
Suppose we have groups of interest and our objective is to identify brain regions that most clearly distinguish among them in the framework of supervised methods. For instance, we may be interested in identifying brain regions that differentiate case and control groups. To achieve this, we use a clustering approach to uncover regions most strongly associated with group-level differences. Minimizing signal dilution in this context translates to determining an optimal unit size that effectively captures differences between groups. However, as data aggregation is performed, the dataset size continuously changes. Therefore, it is crucial to establish a robust metric that remains minimally influenced by changes in sample size.
Herein, we propose the Calinski–Harabasz index (CHI) as such a metric. The index, in its original role, provides a robust way to quantify the separation between clusters while accounting for within-cluster variance, making it a valuable metric for evaluating clustering performance [
5]. We provide a detailed explanation and demonstrate analytically that, under the absence of true group differences, the CHI remains independent of sample size. Conversely, when genuine group differences exist, the CHI clearly and distinctly highlights these differences at or around appropriate unit sizes.
Let
represent the
-th data point in group
where each data point is a
-dimensional vector with variance
. The full dataset is represented by
of dimension
. We express the transposed data set as
where
is the data in group
(
). Let
denote the set of data points in group
, with cardinality
. Thus, the total number of data points in the data set is
. The mean vector of group
is
, where
is the
-dimensional all-ones vector. Similarly, the overall mean vector of the data set is
. The CHI is expressed as the ratio
where
denotes the Euclidian
norm. Alternatively,
is expressed as
where
, and
denotes the trace of a matrix. Since the CHI is defined using the trace of scatter matrices, it is well known to be invariant under the orthogonal transformation of the data. We state this property in Lemma 1 and a proof is given in the
Appendix A.
Lemma 1. The Calinkski–Harabasz Index value remains unchanged by the orthogonal transformation the data.
For the data aggregation, we consider the following setup. Aggregated data for group are denoted as , where is and is the number of original datapoints without replacement aggregated into a single data point, i.e., , where , is a subset of with cardinality . The total number of aggregated data points across all groups is given by At the element level, is expressed as . Technically, the subset can be chosen randomly from the index set but, for simplicity, we can consider that aggregations are carried out over consecutive indices. If is not divisible by , only the last element is formed by fewer than original data points. Without loss of generality, consider that is divisible by , thus we assume all aggregated data points comprise original data points. Consequently, the variance of the aggregated data point is . For the subsequent argument, we assume that has a multivariate normal distribution with the mean , for all . Now, we state the following result.
Theorem 1. For independent observations following a multinormal distribution , suppose and for all . Then, the -th moments of the CHI for a positive integer is asymptotically independent of the sample size.
The proof is given in the
Appendix A. Theorem 1 states that the rate of convergence of the moments of the CHI is independent of the sample size upon the data aggregation under the assumption that all groups have the same distribution, i.e., aggregation of data will not change the CHI’s distribution asymptotically. Regarding the usage of the CHI, it is important to note that the assumption of independence between observations is not strictly required. In practice, this assumption is often unrealistic, particularly in structured or correlated data settings. However, our theorem demonstrates that the CHI remains relatively stable and less sensitive to variations in sample size, suggesting its usefulness for comparisons across different sample size conditions.
Now, with original
data, consider only one variate
among
variates,
. We define
where
is the unbiased estimator of
Suppose that group
among
groups consists of a mixture of two distributions, i.e.,
and
, where
. Under these conditions, we have the following result.
Theorem 2. Consider the same distributional assumptions as Theorem 1, except that data or image of group consists of the two mixed distributions. Then, for sufficiently large , the expectation of satisfies .
Theorem 2 can be easily provable by the standard ANOVA results [
6]. Specifically, when a group of the data consists of mixed distributions, the estimator
is an estimator based on a reduced model or underspecified model. In that circumstance, the underspecified model provides an inflated variance estimate, which leads to the result stated in Theorem 2, using the fact that the numerator and the denominator of the CHI are independent. The detailed proof is given in the
Appendix A. Now, the following corollary extends the result in Theorem 2 for general
variates.
Corollary 1. Consider the same distributional assumptions as Theorem 1, except that data or image of group consists of the two mixed distributions. Then, for sufficiently large , the expectation of satisfies .
The proof is given in the
Appendix A. Corollary 1 indicates that the CHI statistic yields smaller values in the presence of mixed distributions than when these mixtures are absent. Data aggregation can increase the presence of mixed distributions, as illustrated in
Figure 1. Initially, data consisting of two distributions becomes several distributions after aggregation, causing a decrease in the CHI statistic due to model underspecification. On the other hand, as aggregation progresses, we may observe that the number of mixed distributions gradually declines due to a reduced number of data points.
Conversely, if the data undergoes splitting starting from the optimal unit size, the number of data points increases, preserving the two distinct distributions. This splitting process leads to a sharp increase in the CHI values. The simulation in the next section demonstrates this phenomenon, showing a sharp rise or “elbow point” in the CHI values around the optimal unit size.
3. Simulation
We investigate the behavior of the CHI under data aggregation through simulation. In this simulation, we compare two groups of synthetic 3D image data (a control image and a case image), where the control and case images are different not at the pixel level but at the level of certain blocks of uniformly sized squares. Both image groups consist of synthetic image data containing 3600 pixels arranged in a grid. Every pixel contains 3 features (), independently generated from a multi-normal distribution. Initially, both the control and case images are independently generated with identical distributions. The mean vector is set to (1, 1, 2) and the covariance matrix is constructed with a correlation of 0.2 between features and a diagonal standard deviation of 1. Then, contamination is introduced into the case image as follows. In the first scenario, certain blocks within the case image are replaced with data generated from a shifted mean vector (2, 2, 3). The contaminated regions consist of pixel blocks, resulting in 225 total number of candidate blocks for contamination. From these 225 blocks, 10 to 30% are randomly selected for contamination. This localized disturbance creates differences between the control and case images at the block level.
The visualization (
Figure 2) extracts the first feature from each image and displays it using a grayscale gradient, with dark pixels indicating low values and bright pixels indicating high values. In the figure, the control image appears as a relatively uniform grayscale texture with random noise, while the case image shows some brighter patches corresponding to contaminated blocks.
We perform an analysis by dividing each image into blocks of varying sizes: 1, 2, 4, 5, 10, 15, and 20 pixels per block. For each block size, we calculate the mean values of the features within each block, resulting in reduced data matrices for both image groups. We then compute the CHI for these aggregated blocks to provide insights into how contamination affects the CHI at various aggregation levels.
The changes in the CHI values resulting from varying levels of data aggregations are displayed in
Figure 3A, based on 100 simulations for each plot.
Figure 3A displays the CHI value changes across different contamination rates, with the fixed contamination unit size of
. When there is no difference in distributions between the control and case images (i.e., in the absence of contamination), the CHI values remain relatively low with no notable patterns observed. When contaminated blocks are present in the case images, the figure clearly shows a distinctive pattern, where the CHI values substantially increase during data splitting (moving from right to left on the
x-axis in
Figure 3A). Clear elbows appear at 4 on the
x-axis, even with relatively low contamination rates (e.g., 10%). When the level of contamination is higher, the overall patterns remain similar, except that the elbow points become more pronounced.
As another scenario, we also vary the unit size of contamination, i.e., the vertical lengths of the square contamination units are set to 1, 5, 10, and 20, while the overall contamination rate remains fixed at 30%.
Figure 3B shows the CHI value changes across different contamination unit sizes. Clear elbows appear at respective contamination unit sizes. Notably, when the unit size is large, there is a gradual increase without a distinctive elbow as explained under Corollary 1.
These simulations are based on the assumptions that closely align with those for ANOVA. For more realistic conditions, we also consider scenarios that violate these assumptions. Specifically, we investigate changes in CHI values under a setting similar to that in
Figure 3A, except that the variance between two groups differs (
Figure A1 in
Appendix B) and adjacent observations exhibit spatial correlation (
Figure A2 in
Appendix B). We observe that distinctive elbows still appear at the expected locations, though they are less pronounced, demonstrating some robustness in the results under the violation of ANOVA assumptions. It is notable that, when spatial correlation exists, even in the absence of contamination, the plot may show gradual increases when the unit size decrease (
Figure A2). This indicates that there is potential for the gradual increase of the CHI with decreasing unit size to produce false signals when spatial background correlation exists.
The simulation supports our theoretical findings, demonstrating that units of contamination clearly manifest as elbow points. It provides insights into how contamination and block-level data structure influence the variability of the Calinski–Harabasz Index (CHI), highlighting the detectability of proper unit sizes in image data.
5. Conclusions
In this study, we proposed a data aggregation approach that combines voxel data into manageable units to improve computational efficiency, while preserving data utility relevant to group-level differences. Our approach provides a straightforward guideline for selecting an upper threshold of unit sizes appropriate for clustering and other analyses.
We emphasize the importance of a scale- and sample-size-independent metric for evaluating unit size. This metric must remain invariant to resolution or sample size changes, allowing consistent comparisons across datasets with varying spatial properties. We demonstrate this principle through simulations of image data aggregation, focusing on the behavior of the CHI during the process. Through the simulation, we assessed how the CHI responds to data aggregation across different block sizes. Our findings reveal that, in the absence of contamination, the CHI values remained stable without showing distinct patterns. However, when contaminated regions were introduced, the CHI values showed a clear elbow point at the contamination unit size.
Applying this method to DTI data, we found some consistency in the cluster definitions between the unit size selected by the CHI and a smaller unit size; however, the age relevance of the identified clusters remained inconclusive at the unit size indicated by the CHI values. It shows some limitation in real-world data analysis, particularly when the signal from the DTI metrics is weak. While the unit size determination provides a useful guideline for balancing resolution and computational efficiency, finer-grained analysis may still be worthwhile, as it can yield more detailed and interpretable insights into specific brain regions.
In conclusion, this study presents a method for defining and evaluating unit size in brain imaging analysis, with the goal of enhancing computational efficiency and scalability. Our approach provides a practical guideline: the unit size can be selected below the elbow point guided by the CHI, balancing detail and practical implementation. By proposing data aggregation using the metric CHI, we offer a foundation for effective image analysis, which may support future methodological developments in investigation of neural and structural brain characteristics.