Unit Size Determination for Exploratory Brain Imaging Analysis: A Quest for a Resolution-Invariant Metric

Yu, Jihnhee; Lee, HyunAh; Sternberg, Zohi

doi:10.3390/math13071195

Open AccessFeature PaperArticle

Unit Size Determination for Exploratory Brain Imaging Analysis: A Quest for a Resolution-Invariant Metric

by

Jihnhee Yu

^1,*

,

HyunAh Lee

¹

and

Zohi Sternberg

²

¹

Department of Biostatistics, University at Buffalo, Buffalo, NY 14214, USA

²

Department of Neurology, University at Buffalo, Buffalo, NY 14203, USA

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(7), 1195; https://doi.org/10.3390/math13071195

Submission received: 28 February 2025 / Revised: 31 March 2025 / Accepted: 2 April 2025 / Published: 4 April 2025

(This article belongs to the Special Issue Mathematical Methods for Image Processing and Computer Vision)

Download

Browse Figures

Versions Notes

Abstract

Defining an adequate unit size is often crucial in brain imaging analysis, where datasets are complex, high-dimensional, and computationally demanding. Unit size refers to the spatial resolution at which brain data is aggregated for analysis. Optimizing unit size in data aggregation requires balancing computational efficiency in handling large-scale data sets with the preservation of brain activity patterns, minimizing signal dilution. We propose using the Calinski–Harabasz index, demonstrating its invariance to sample size changes due to varying image resolutions when no distributional differences are present, while the index effectively identifies an appropriate unit size for detecting suspected regions in image comparisons. The resolution-independent metric can be used for unit size evaluation, ensuring adaptability across different imaging protocols and modalities. This study enhances the scalability and efficiency of brain imaging research by providing a robust framework for unit size optimization, ultimately strengthening analytical tools for investigating brain function and structure.

Keywords:

data aggregation; data disintegration; DTI data; fMRI data; resolution free metric; sample size invariant metric

MSC:

62H35

1. Introduction

Defining an adequate unit size is a crucial step in brain imaging analysis, where data is inherently complex, high-dimensional, and computationally intensive. Unit size refers to the spatial resolution at which the brain is analyzed, ranging from individual voxels to aggregated regions such as regions of interest (ROIs). The choice of unit size has substantial implications for both the interpretability of results and the computational feasibility of downstream analyses.

In this paper, we investigate a resolution-free metric for data aggregation (combining the data points) in brain imaging, which aggregates large three-dimensional voxel data into more manageable unit sizes. Maintaining distributional properties of data during data aggregation or fission [1] is not of particular interest of this paper. Rather, we aim to propose a relevant criterion for adjusting resolution to determine an appropriate unit size for subsequent analysis, such as clustering. Ultimately, this approach does not require prior knowledge of brain structures and, thus, it may be suitable for initial exploratory analysis.

While ROI-based approaches efficiently reduce dimensionality by averaging signals within predefined brain regions [2], they require prior knowledge and may obscure localized effects. This limitation makes them less suitable for initial exploratory analyses, where a more flexible and data-driven approach is needed. Voxel-based methods, such as voxel-based morphometry (VBM) [3] provide fine-grained spatial resolution, enabling detailed anatomical analysis. However, they often suffer from high computational costs and increased susceptibility to noise. On the other hand, grid-based methods segment the brain into uniform cubic regions, offering a straightforward approach to defining unit sizes. While fine-grained grids preserve detailed information, they significantly increase the number of regions, making clustering, machine learning, and statistical modeling computationally expensive and memory-intensive. Conversely, overly large grids obscure finer-scale features, reducing sensitivity to localized effects.

An optimal unit size must balance the trade-off between preserving relevant signal patterns and maintaining computational efficiency. Thus, we propose two essential conditions for defining an adequate unit size in brain imaging:

i.: Minimizing signal dilution for detecting meaningful regional differences—The unit size should be small enough to capture localized biological variations that distinguish different phenotypic traits or pathological conditions. Overly large units may obscure these critical differences, reducing the ability to accurately characterize relevant biological patterns.
ii.: Optimizing computational efficiency—The unit size should be large enough to prevent unnecessary computational overhead. Brain imaging datasets, such as those from functional magnetic resonance imaging (fMRI) or diffusion tensor imaging (DTI), often contain millions of data points. Aggregating adjacent voxels into larger units can reduce the computational burden while preserving data fidelity.

To illustrate the importance of computational efficiency, consider the

k

-means clustering algorithm, commonly used in brain imaging for segmentation and pattern identification. Each iteration of

k

-means requires

O (n k d)

computations, where

n

is the number of data points,

k

is the number of clusters, and

d

is the data dimensionality. Given the NP-hard nature of

k

-means [4], the problem becomes computationally intractable as

n

increases. Doubling the number of units approximately doubles the computational time, making unit size a critical factor in large-scale imaging studies. Careful aggregation of voxels into appropriate units can significantly improve computational efficiency without losing meaningful information.

Another key challenge in brain imaging is the need for a scale- and sample-size-independent metric for evaluating unit size. Such a metric should remain invariant to the merging or splitting of subunits and be adaptable to variations in spatial resolution due to differences in imaging modalities or preprocessing pipelines. At the same time, it must be sensitive enough to indicate when unit size is optimal for capturing relevant neural activity or structural characteristics.

In this paper, we propose a novel framework for determining a desirable unit size that balances the detection of meaningful patterns relevant to the disease of interest and computational efficiency. We aim to evaluate the Calinski–Harabasz index [5] as a metric for guiding data aggregation in brain imaging analysis, particularly focusing on its stability with respect to varying sample sizes. Our goal is to develop a preprocessing framework that simplifies high-dimensional brain imaging data by determining proper unit sizes, reducing computational burden while preserving essential information for downstream analyses. By selecting an appropriate unit size, we seek to enhance the efficiency and interpretability of brain imaging studies, ultimately facilitating deeper insights into neural function at meaningful spatial scales.

2. Methods

Suppose we have

K

groups of interest and our objective is to identify brain regions that most clearly distinguish among them in the framework of supervised methods. For instance, we may be interested in identifying brain regions that differentiate case and control groups. To achieve this, we use a clustering approach to uncover regions most strongly associated with group-level differences. Minimizing signal dilution in this context translates to determining an optimal unit size that effectively captures differences between

K

groups. However, as data aggregation is performed, the dataset size continuously changes. Therefore, it is crucial to establish a robust metric that remains minimally influenced by changes in sample size.

Herein, we propose the Calinski–Harabasz index (CHI) as such a metric. The index, in its original role, provides a robust way to quantify the separation between clusters while accounting for within-cluster variance, making it a valuable metric for evaluating clustering performance [5]. We provide a detailed explanation and demonstrate analytically that, under the absence of true group differences, the CHI remains independent of sample size. Conversely, when genuine group differences exist, the CHI clearly and distinctly highlights these differences at or around appropriate unit sizes.

Let

Y_{k j} = {(Y_{1 k j}, \dots, Y_{p k j})}^{T}

represent the

j

-th data point in group

k (k = 1, \dots, K),

where each data point is a

p

-dimensional vector with variance

V a r (Y_{k j}) = Σ

. The full dataset is represented by

Y

of dimension

N \times p

. We express the transposed data set as

Y^{T} = [Y_{1}^{T}, \dots, Y_{K}^{T}],

where

Y_{k}^{T}

is the data in group

k

(

k = 1, \dots, K

). Let

C_{k}

denote the set of data points in group

k

, with cardinality

|C_{k}| = n_{k}

. Thus, the total number of data points in the data set is

N = \sum_{k = 1}^{K} n_{k}

. The mean vector of group

k

is

{\bar{Y}}_{k} = {\frac{1}{n_{k}} Y}_{k}^{T} 1_{n_{k}}^{T}

, where

1_{n}

is the

n

-dimensional all-ones vector. Similarly, the overall mean vector of the data set is

\bar{Y}

{= \frac{1}{N} Y}^{T} 1_{N}^{T}

. The CHI is expressed as the ratio

H = \frac{\sum_{k = 1}^{K} |C_{k}| {‖{\bar{Y}}_{k} - \bar{Y}‖}^{2} / (K - 1)}{\sum_{k = 1}^{K} \sum_{Y \in C_{k}} {‖Y - {\bar{Y}}_{k}‖}^{2} / (N - K)},

(1)

where

{‖\cdot‖}^{2}

denotes the Euclidian

L_{2}

norm. Alternatively,

H

is expressed as

H = \frac{t r (S_{B}) / (K - 1)}{t r (S_{W}) / (N - K)} = \frac{\{\sum_{k = 1}^{K} \sum_{j = 1}^{n_{i}} {({\bar{Y}}_{1 k .} - {\bar{Y}}_{1 . .})}^{2} + \dots + \sum_{k = 1}^{K} \sum_{j = 1}^{n_{i}} {({\bar{Y}}_{p k .} - {\bar{Y}}_{p . .})}^{2}\} / (K - 1)}{\{\sum_{k = 1}^{K} \sum_{j = 1}^{n_{i}} {(Y_{1 k j} - {\bar{Y}}_{1 k .})}^{2} + \dots + \sum_{k = 1}^{K} \sum_{j = 1}^{n_{i}} {(Y_{p k j} - {\bar{Y}}_{p k .})}^{2}\} / (N - K)},

where

S_{B} = \sum_{k = 1}^{K} |C_{k}| ({\bar{Y}}_{k} - \bar{Y}) {({\bar{Y}}_{k} - \bar{Y})}^{T}, S_{W} = \sum_{k = 1}^{K} \sum_{Y \in C_{k}} (Y - {\bar{Y}}_{k}) {(Y - {\bar{Y}}_{k})}^{T}

, and

t r

denotes the trace of a matrix. Since the CHI is defined using the trace of scatter matrices, it is well known to be invariant under the orthogonal transformation of the data. We state this property in Lemma 1 and a proof is given in the Appendix A.

Lemma 1.

The Calinkski–Harabasz Index value remains unchanged by the orthogonal transformation the data.

For the data aggregation, we consider the following setup. Aggregated data for group

k

are denoted as

{\tilde{Y}}_{k 1}, \dots, {\tilde{Y}}_{k {\tilde{n}}_{k}}

, where

{\tilde{n}}_{k}

is

⌈ n_{k} / d ⌉

and

d

is the number of original datapoints without replacement aggregated into a single data point, i.e.,

{\tilde{Y}}_{k i} = \frac{1}{d} \sum_{j \in s_{i}} Y_{k j}

, where

s_{i} (i = 1, \dots

,

{\tilde{n}}_{k})

is a subset of

\{1, \dots, n_{k}\}

with cardinality

d

. The total number of aggregated data points across all groups is given by

\sum_{k = 1}^{K} {\tilde{n}}_{k} = \tilde{N} .

At the element level,

{\tilde{Y}}_{k i}

is expressed as

{\tilde{Y}}_{k i} = {({\tilde{Y}}_{1 k i}, \dots, {\tilde{Y}}_{p k i})}^{T}

. Technically, the subset

s_{i}

can be chosen randomly from the index set but, for simplicity, we can consider that aggregations are carried out over

d

consecutive indices. If

n_{k}

is not divisible by

d

, only the last element

{\tilde{Y}}_{k {\tilde{n}}_{k}}

is formed by fewer than

d

original data points. Without loss of generality, consider that

n_{k}

is divisible by

d

, thus we assume all aggregated data points comprise

d

original data points. Consequently, the variance of the aggregated data point is

V a r ({\tilde{Y}}_{k i}) = \frac{1}{d} Σ

. For the subsequent argument, we assume that

Y_{k i}

has a multivariate normal distribution with the mean

E (Y_{k i}) = μ_{k}

, for all

k = 1, \dots, K

. Now, we state the following result.

Theorem 1.

For independent observations

Y_{k i}

following a multinormal distribution

(k = 1, \dots, K, i = 1, \dots n_{k})

, suppose

E (Y_{k i}) = μ

and

V a r (Y_{k i}) = Σ

for all

k = 1, \dots, K

. Then, the

m

-th moments of the CHI for a positive integer

m

is asymptotically independent of the sample size.

The proof is given in the Appendix A. Theorem 1 states that the rate of convergence of the moments of the CHI is independent of the sample size upon the data aggregation under the assumption that all groups have the same distribution, i.e., aggregation of data will not change the CHI’s distribution asymptotically. Regarding the usage of the CHI, it is important to note that the assumption of independence between observations is not strictly required. In practice, this assumption is often unrealistic, particularly in structured or correlated data settings. However, our theorem demonstrates that the CHI remains relatively stable and less sensitive to variations in sample size, suggesting its usefulness for comparisons across different sample size conditions.

Now, with original

N \times p

data, consider only one variate

l

among

p

variates,

(l = 1, \dots ., p)

. We define

\begin{matrix} S_{B_{l}} = \sum_{k = 1}^{K} \sum_{j = 1}^{n_{i}} {({\bar{Y}}_{p k .} - {\bar{Y}}_{p . .})}^{2} / (K - 1), \\ S_{w_{l}} = \sum_{k = 1}^{K} \sum_{j = 1}^{n_{i}} {(Y_{l k j} - {\bar{Y}}_{l k .})}^{2} / (N - K), \\ H_{l} = \frac{\sum_{k = 1}^{K} \sum_{j = 1}^{n_{i}} {({\bar{Y}}_{l k .} - {\bar{Y}}_{l . .})}^{2} / (K - 1)}{\sum_{k = 1}^{K} \sum_{j = 1}^{n_{i}} {(Y_{l k j} - {\bar{Y}}_{l k .})}^{2} / (N - K)}, \end{matrix}

(2)

where

S_{w_{l}}

is the unbiased estimator of

σ_{l}^{2} .

Suppose that group

k^{*}

among

K

groups consists of a mixture of two distributions, i.e.,

N (μ_{1}, σ_{l}^{2})

and

N (μ_{2}, σ_{l}^{2})

, where

μ_{1} \neq μ_{2}

. Under these conditions, we have the following result.

Theorem 2.

Consider the same distributional assumptions as Theorem 1, except that data or image of group

k^{*}

consists of the two mixed distributions. Then, for sufficiently large

n_{k^{*}}

, the expectation of

H_{l}

satisfies

E (H_{l}) \leq E (S_{B_{l}}) / σ_{l}^{2}

.

Theorem 2 can be easily provable by the standard ANOVA results [6]. Specifically, when a group of the data consists of mixed distributions, the estimator

S_{w_{l}}

is an estimator based on a reduced model or underspecified model. In that circumstance, the underspecified model provides an inflated variance estimate, which leads to the result stated in Theorem 2, using the fact that the numerator and the denominator of the CHI are independent. The detailed proof is given in the Appendix A. Now, the following corollary extends the result in Theorem 2 for general

p

variates.

Corollary 1.

Consider the same distributional assumptions as Theorem 1, except that data or image of group

k^{*}

consists of the two mixed distributions. Then, for sufficiently large

n_{k^{*}}

, the expectation of

H

satisfies

E (H) \leq E (\frac{S_{B}}{K - 1}) / t r (Σ)

.

The proof is given in the Appendix A. Corollary 1 indicates that the CHI statistic yields smaller values in the presence of mixed distributions than when these mixtures are absent. Data aggregation can increase the presence of mixed distributions, as illustrated in Figure 1. Initially, data consisting of two distributions becomes several distributions after aggregation, causing a decrease in the CHI statistic due to model underspecification. On the other hand, as aggregation progresses, we may observe that the number of mixed distributions gradually declines due to a reduced number of data points.

Conversely, if the data undergoes splitting starting from the optimal unit size, the number of data points increases, preserving the two distinct distributions. This splitting process leads to a sharp increase in the CHI values. The simulation in the next section demonstrates this phenomenon, showing a sharp rise or “elbow point” in the CHI values around the optimal unit size.

3. Simulation

We investigate the behavior of the CHI under data aggregation through simulation. In this simulation, we compare two groups of synthetic 3D image data (a control image and a case image), where the control and case images are different not at the pixel level but at the level of certain blocks of uniformly sized squares. Both image groups consist of synthetic image data containing 3600 pixels arranged in a

60 \times 60

grid. Every pixel contains 3 features (

p = 3

), independently generated from a multi-normal distribution. Initially, both the control and case images are independently generated with identical distributions. The mean vector is set to (1, 1, 2) and the covariance matrix is constructed with a correlation of 0.2 between features and a diagonal standard deviation of 1. Then, contamination is introduced into the case image as follows. In the first scenario, certain blocks within the case image are replaced with data generated from a shifted mean vector (2, 2, 3). The contaminated regions consist of

4 \times 4

pixel blocks, resulting in 225 total number of candidate blocks for contamination. From these 225 blocks, 10 to 30% are randomly selected for contamination. This localized disturbance creates differences between the control and case images at the block level.

The visualization (Figure 2) extracts the first feature from each image and displays it using a grayscale gradient, with dark pixels indicating low values and bright pixels indicating high values. In the figure, the control image appears as a relatively uniform grayscale texture with random noise, while the case image shows some brighter patches corresponding to contaminated blocks.

We perform an analysis by dividing each image into blocks of varying sizes: 1, 2, 4, 5, 10, 15, and 20 pixels per block. For each block size, we calculate the mean values of the features within each block, resulting in reduced data matrices for both image groups. We then compute the CHI for these aggregated blocks to provide insights into how contamination affects the CHI at various aggregation levels.

The changes in the CHI values resulting from varying levels of data aggregations are displayed in Figure 3A, based on 100 simulations for each plot. Figure 3A displays the CHI value changes across different contamination rates, with the fixed contamination unit size of

4 \times 4

. When there is no difference in distributions between the control and case images (i.e., in the absence of contamination), the CHI values remain relatively low with no notable patterns observed. When contaminated blocks are present in the case images, the figure clearly shows a distinctive pattern, where the CHI values substantially increase during data splitting (moving from right to left on the x-axis in Figure 3A). Clear elbows appear at 4 on the x-axis, even with relatively low contamination rates (e.g., 10%). When the level of contamination is higher, the overall patterns remain similar, except that the elbow points become more pronounced.

As another scenario, we also vary the unit size of contamination, i.e., the vertical lengths of the square contamination units are set to 1, 5, 10, and 20, while the overall contamination rate remains fixed at 30%. Figure 3B shows the CHI value changes across different contamination unit sizes. Clear elbows appear at respective contamination unit sizes. Notably, when the unit size is large, there is a gradual increase without a distinctive elbow as explained under Corollary 1.

These simulations are based on the assumptions that closely align with those for ANOVA. For more realistic conditions, we also consider scenarios that violate these assumptions. Specifically, we investigate changes in CHI values under a setting similar to that in Figure 3A, except that the variance between two groups differs (Figure A1 in Appendix B) and adjacent observations exhibit spatial correlation (Figure A2 in Appendix B). We observe that distinctive elbows still appear at the expected locations, though they are less pronounced, demonstrating some robustness in the results under the violation of ANOVA assumptions. It is notable that, when spatial correlation exists, even in the absence of contamination, the plot may show gradual increases when the unit size decrease (Figure A2). This indicates that there is potential for the gradual increase of the CHI with decreasing unit size to produce false signals when spatial background correlation exists.

The simulation supports our theoretical findings, demonstrating that units of contamination clearly manifest as elbow points. It provides insights into how contamination and block-level data structure influence the variability of the Calinski–Harabasz Index (CHI), highlighting the detectability of proper unit sizes in image data.

4. Data Analysis

In this section, we first describe the data sources and the preparation process. We then discuss the downstream analysis, focusing specifically on brain partitioning or clustering using a Bayesian clustering approach called the Restricted Dirichlet Mixture of Dependent Partitions method [7]. The final subsection presents the results from both unit determination and downstream analyses.

4.1. Data Source

The data source is the Transforming Research and Clinical Knowledge in Traumatic Brain Injury (TRACK-TBI) project [8], organized by the International Traumatic Brain Injury Research Initiative (https://tracktbi.ucsf.edu/transforming-research-and-clinical-knowledge-tbi (accessed on 31 March 2025). The dataset, which includes contributions from multiple trauma centers across the United States, is available through the Federal Interagency Traumatic Brain Injury Research (FITBIR) informatics platform. The study includes 2539 adult TBI patients enrolled between 2013 and 2018, of whom a subset has available DTI data. The study also features demographic variables such as age and sex.

For this analysis, we use a subset of randomly selected 94 subjects from those for whom baseline DTI data are available. Specifically, we include 64 individuals from a younger age group (

\leq

45 years, mean = 29.3, SD = 7.03) and 30 individuals from an older age group (

>

45 years, mean = 56.6, SD = 5.83). The 45-year cutoff is based on existing literature, which has shown significant structural and cognitive changes in the brain during middle age (roughly between 40 and 59 years) [9,10]. Among these, 23 participants (35.9%) in the younger group and 11 participants (36.7%) in the older group are female, with no significant difference in gender distribution between the groups. Baseline DTI scans for these subjects are provided in the Neuroimaging Informatics Technology Initiative (NIfTI) format, along with demographic details. DTI scans are spatially standardized to align with the MNI152 template [11,12]. Tensor estimates are calculated for each voxel, and then DTI metrics are exported in a compressed NIfTI format using DSI Studio (http://dsi-studio.labsolver.org/ (accessed on 31 March 2025)). DTI metrics used include Fractional Anisotropy (FA), Mean Diffusivity (MD), Axial Diffusivity (AD), and Radial Diffusivity (RD) [13,14].

The resulting data are imported into R software (version 4.3.2) using the package ‘oro.nifti’ [15]. The image dimensions for DTI data are either 128 × 128 × 59 (along the left–right (x), anterior–posterior (y), and superior–inferior (z) axes, totaling 973,312 voxels) or 256 × 256 × 59 (3,875,328 voxels), depending on resolution.

4.2. Unit Size Selection

To reduce computational complexity, neighboring voxels in the DTI data need to be aggregated by calculating the average values of DTI metrics within defined spatial units. DTI metrics (FA, MD, AD, RD) are read for 94 subjects, then the brain mask files are applied to restrict the analysis to relevant brain regions, filtering out non-brain areas. The respective DTI metric data are then converted into a 3D array.

A data frame containing voxel positions (transverse:

x

-axis, anteroposterior:

y

-axis, and superior–inferior directions:

z

-axis) and their corresponding metric values is created. The cleaned matrix forms the basis for the CHI calculation comparing the old and young groups (

K = 2)

. The data are combined by aggregating voxels into larger cubes. For each aggregation level, DTI metric values are averaged, and the CHI is recalculated, assessing the impact on the CHI values at different levels of aggregation. Visualization of the results by plotting the CHI values for respective metrics against aggregation levels is discussed in Section 4.4.

4.3. Method for Partitioning Brain Regions

For an analysis to select features and partition brain regions in terms of the relevance of age groups, we use the Restricted Dirichlet Mixture of Dependent Partitions (RDMDP) method [7], which is a Bayesian clustering approach designed to model complex data that includes both a binary response and associated covariates. The RDMDP method builds on previous work, extending traditional Chinese Restaurant Process clustering [16,17,18,19] by incorporating covariate-dependent regression models as well as maintaining spatial coherence in clustering. A detailed explanation of this method can be found in Park et al. [7].

A brief description of the RDMDP method is as follows. The method as a supervised learning technique involves defining parameters that describe both the relationship between the response variable and the covariates and the distribution of the covariates themselves. Each data point consists of a binary response (age groups) and a set of covariates, and the parameters governing these relationships are clustered using a Dirichlet Process (DP) prior [20]. In the RDMDP framework, clusters are formed dynamically based on both prior information and data observations. Notably, unlike RDMDP, widely used clustering techniques such as

k

-means and hierarchical clustering do not inherently model relationships between variables, and they do not adaptively determine the number of clusters without additional criteria or evaluation methods [7]. The probability of a data point being assigned to a cluster depends on the number of points already in that cluster and a concentration parameter in the DP prior. This concentration parameter controls the formation of new clusters—higher values increase the likelihood of forming new clusters, while lower values favor fewer clusters. A significant feature of the RDMDP method is the use of an adjacency matrix to impose spatial constraints on clustering, ensuring that clusters consist of observations sharing a boundary or other defined neighborhood relationships.

The RDMDP method updates cluster assignments and parameters through a series of iterative steps using Markov Chain Monte Carlo (MCMC) methods. Key parameters, including the regression coefficients, means, and variances of the covariates, are sampled from their respective posterior distributions. The binary response is modeled using a Bayesian probit regression, where a latent variable augmentation technique simplifies posterior computation. The adjacency matrix modifies the clustering process by restricting cluster assignments to neighboring observations, leading to more structured partitions in spatially organized data.

MCMC methods generate numerous cluster assignments from the posterior distribution. To summarize these iterations, we use the mode clustering method [21], which identifies the partition with the highest posterior probability, also known as the maximum a posteriori (MAP) estimate. Since cluster labels can change across iterations (i.e., label switching), deterministic relabeling aligns clusters consistently [22]. Posterior similarity matrices are used to track how often two subjects are assigned to the same cluster, and the final mode partition minimizes the

L_{2}

loss.

4.4. Results

Through the procedure outlined in Section 4.2, we determine an adequate block size for summarizing data. Figure 4 shows how CHI values change with varying unit sizes during data aggregation across different DTI metrics. Specifically, Figure 4 presents individual CHI values for each metric (left plot), a magnified view of these individual CHI values (middle plot) and multivariate CHI (

p = 3)

values (right plot). In the middle plot of Figure 4, FA displays a distinctive elbow point around the unit size of 2000 voxels (a cube with dimensions

20 \times 20 \times 5

) for images with dimensions

256 \times 256 \times 59

. This corresponds to 500 voxels (a cube with dimensions

10 \times 10 \times 5

) for images with dimensions

128 \times 128 \times 59

. For the other DTI metrics, clear elbow points are not observed, suggesting that changes in the CHI may reflect spatial background noise rather than true group separation, similar to the scenario shown in Figure A2. Based on these findings, we select the unit size guided by the CHI pattern in FA. This identified size suggests that an appropriate unit size may be chosen below this threshold. Note that some cubes toward the boundaries of the brain images may have fewer voxels than the selected unit sizes.

Consequently, we consider two unit sizes, large and small, and assess the consistency of the results across them. The large unit size refers to the selected value by the CHI, while the small unit size is substantially smaller, measuring one-quarter the size of the large unit. The large unit size yields a total of 1440 cubes with dimensions

12 \times 12 \times 10

, enabling full coverage of the brain while significantly reducing the dimensionality of the data. For the small unit size, we select a unit size of 500 voxels (a cube with dimensions

10 \times 10 \times 5

) for

256 \times 256 \times 59

image dimensions and 150 voxels (a cube with dimensions

5 \times 5 \times 5

) for

128 \times 128 \times 59

image dimensions. The small unit size yields a total of 5760 cubes with dimensions 24 × 24 × 10. For each of these cubes, average DTI metrics such as FA, MD, AD, and RD are computed for all 94 subjects.

For the large unit, we further select 359 cubes from the initial 1440 cubes, which contain data for at least 10 subjects. These cubes are then used in cube-based analyses. The predictive performance for each cube is assessed using the entropy [7] based on the confusion matrix that compares clusters formed by unsupervised

k

-means clustering (

k = 2

) with the age group (younger or older). Higher entropy indicates greater mismatch between the unsupervised clustering and the true groups. Then a classification index (CI) is created for each cube based on its entropy values. If the entropy for a cube exceeds a predefined threshold (set at 0.5), the CI is assigned a value of 1; otherwise, it is set to 0. The final dataset comprises 359 cubes and their associated four DTI metrics (covariates) from 94 subjects, and the classification index (outcome). The same analysis is conducted for the small unit, where the number of final units comprises 1169 cubes with data from at least 10 subjects.

The RDMDP method involves 1000 iterations, with the first 100 iterations designated as burn-in. For both the small and large unit sizes in Table 1, a certain degree of consistency is observed in the results. Cluster C3 shows the highest entropy in both unit sizes, suggesting that it may have the weakest relevance to the age group. This difference is more pronounced with the small unit size. In both unit sizes, the largest FA values are observed in cluster C3 (

p < 0.001

, based on pairwise t-tests with other clusters), while no consistent trend is found in the other metrics.

Figure 5 illustrates the spatial distribution of clusters for both large unit size (Figure 5A) and the small unit size (Figure 5B) across 10 x-y slices. Each slice contains 12 × 12 cubes for the large unit and 24 × 24 cubes for the small unit along the z-axis. These slices span from the brain’s base (slice 1) to the top (slice 10). In both Figure 5A (large unit) and Figure 5B (small unit), cluster C1 (red), identified by RDMDP, consistently dominates lower parts of the brain, and also appears in some boundary regions, particularly toward the upper part. In Figure 5A, clusters C2 (blue) and C3 (green) are mostly spread across the middle and upper parts of the brain without distinctive patterns. In Figure 5B, cluster C2 is adjacent to cluster C1, whereas cluster C3 occupies the more central regions of the brain. This suggests that the brain’s core regions may be less associated with age group differences, as indicated in Table 1.

The results from the smaller unit size (Figure 5B) exhibit a more distinct and clearer separation of clusters than the larger unit size, while still presenting some of the consistent patterns observed in Figure 5A. While the large unit size offers a broader clustering structure, the small unit size provides finer-grained resolution, allowing for sharper differentiation of brain regions. This indicates that larger units do not eliminate the need for fine-grained data analysis, especially when the separation indicated by the DTI metrics is weak. Whereas clustering with large units may capture large-scale trends and assess the capacity to identify relevant clusters of interest in imaging data, small unit clustering may be more advantageous when precise regional distinctions are needed, complementing the broader insights gained from larger-scale analyses.

As the analysis is exploratory, further investigation is necessary to ascertain the anatomical importance of these clusters. This can be achieved through comparative analysis with established brain atlases [23], to determine the specific brain regions encompassed by each cluster. However, such comparisons involve additional processing steps—such as spatial normalization and specialized visualization techniques—that are beyond the current scope of this paper. We acknowledge this as a limitation of our exploratory analysis. Well-controlled experimental designs could provide stronger evidence by assessing the functional relevance of identified regions (e.g., [24]). If neuroimaging measures independent of the data used in cluster analysis are available, statistical analysis such as correlation between the measures in the identified regions and neurocognitive assessments [25] such as memory, executive function, and attention, could further validate the significance of the identified clusters in relation to brain function and pathology.

5. Conclusions

In this study, we proposed a data aggregation approach that combines voxel data into manageable units to improve computational efficiency, while preserving data utility relevant to group-level differences. Our approach provides a straightforward guideline for selecting an upper threshold of unit sizes appropriate for clustering and other analyses.

We emphasize the importance of a scale- and sample-size-independent metric for evaluating unit size. This metric must remain invariant to resolution or sample size changes, allowing consistent comparisons across datasets with varying spatial properties. We demonstrate this principle through simulations of image data aggregation, focusing on the behavior of the CHI during the process. Through the simulation, we assessed how the CHI responds to data aggregation across different block sizes. Our findings reveal that, in the absence of contamination, the CHI values remained stable without showing distinct patterns. However, when contaminated regions were introduced, the CHI values showed a clear elbow point at the contamination unit size.

Applying this method to DTI data, we found some consistency in the cluster definitions between the unit size selected by the CHI and a smaller unit size; however, the age relevance of the identified clusters remained inconclusive at the unit size indicated by the CHI values. It shows some limitation in real-world data analysis, particularly when the signal from the DTI metrics is weak. While the unit size determination provides a useful guideline for balancing resolution and computational efficiency, finer-grained analysis may still be worthwhile, as it can yield more detailed and interpretable insights into specific brain regions.

In conclusion, this study presents a method for defining and evaluating unit size in brain imaging analysis, with the goal of enhancing computational efficiency and scalability. Our approach provides a practical guideline: the unit size can be selected below the elbow point guided by the CHI, balancing detail and practical implementation. By proposing data aggregation using the metric CHI, we offer a foundation for effective image analysis, which may support future methodological developments in investigation of neural and structural brain characteristics.

Author Contributions

Conceptualization, J.Y.; methodology, J.Y. and H.L.; formal analysis, H.L.; data curation, J.Y. and H.L.; writing—original draft preparation, J.Y., H.L. and Z.S.; writing—review and editing, J.Y., H.L. and Z.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Acknowledgments

The data used in this manuscript were obtained from a controlled-access dataset provided by the Federal Interagency Traumatic Brain Injury Research (FITBIR) Informatics System, supported by both the Department of Defense (DOD) and the National Institutes of Health (NIH), USA. The data originate from the TRACK-TBI prospective study (Study DOI: 10.23718/FITBIR/1518881, ORCID: 0000-0002-0926-3128, Grant ID: 1U01NS086090-01), funded by the National Institute of Neurological Disorders and Stroke (NINDS), USA.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

We provide the proofs of Lemma 1 and other results as follows.

Proof of Lemma 1.

We rewrite

S_{W}

and

S_{B}

as

S_{W} = \frac{1}{N - K} \{Y^{T} Y - Y^{T} {I_{1}^{T} I}_{1} Y\} and S_{B} = \frac{1}{K - 1} \{Y^{T} {I_{1}^{T} I}_{1} Y - {\frac{1}{N^{2}} Y}^{T} 1_{N} 1_{N}^{T} Y\},

where

I_{1} = D i a g \{{1_{n}}_{1}^{T} / n_{1}, \dots {, 1}_{n}_{K}^{T} / n_{K}\}

. We first show that the traces of

S_{W}

and

S_{B}

do not change by the orthogonal transformation. Using the spectral decomposition,

Σ

can be written as

Σ = P Λ P^{T},

where

P

is an orthonormal matrix satisfying

P^{T} P = P P^{T} = I

, the identity matrix. Consider the linear transformation

Z = Y P

. We show

t r (Z^{T} Z) = t r (Y^{T} Y)

as

t r (Z^{T} Z) = t r ({P^{T} Y}^{T} Y P) = t r (Y^{T} Y P P^{T}) = t r (Y^{T} Y)

Also, in

S_{B}

, we can show

t r (Z^{T} {I_{1}^{T} I}_{1} Z) = t r ({P^{T} Y}^{T} I_{1}^{T} I_{1} Y P) = t r (Y^{T} I_{1}^{T} I_{1} Y P P^{T}) = t r (Y^{T} I_{1}^{T} I_{1} Y) .

Similarly, we can show

t r (\frac{1}{N^{2}} Z^{T} 1_{N} 1_{N}^{T} Z) = t r (\frac{1}{N^{2}} {P^{T} Y}^{T} 1_{N} 1_{N}^{T} Y P) = t r (\frac{1}{N^{2}} Y^{T} 1_{N} 1_{N}^{T} Y) .

These results demonstrate that

{t r (S}_{W})

and

t r (S_{B})

are invariant under the orthogonal transformation of the data

Y

.

□

Proof of Theorem 1.

By Lemma 1, the CHI is invariant under the orthogonal transformation. Thus, without loss of generality, we assume that

Σ = D i a g (σ_{1}^{2}, \dots, σ_{p}^{2})

is a diagonal matrix, indicating

p

variates are indepedent. Under this assumption,

\sum_{k = 1}^{K} \sum_{j = 1}^{{\tilde{n}}_{k}} {({\bar{\tilde{Y}}}_{l k .} - {\bar{\tilde{Y}}}_{l . .})}^{2} (l = 1, \dots, p)

and

\sum_{k = 1}^{K} \sum_{j = 1}^{{\tilde{n}}_{k}} {({\tilde{Y}}_{l k j} - {\bar{\tilde{Y}}}_{l k .})}^{2} (l = 1, \dots, p)

are pair-wise independent. Let

χ_{a}^{2}

represent the chi-square distribution with degrees of freedom

a

, and let

χ_{a, λ}^{2}

denote the non-central chi-square distribution with degrees of freedom

a

and a noncentrality parameter

λ

. We note that

λ = 0

under the assumption of Theorem 1. By the standard Analysis of Variance (ANOVA) arguments, each component in the numerator and denominator of the CHI

H

for variate

l

or

q

(l, q = 1, \dots, p)

follows a chi-square distribution, i.e.,

\begin{matrix} \sum_{k = 1}^{K} \sum_{j = 1}^{{\tilde{n}}_{k}} {({\bar{\tilde{Y}}}_{l k .} - {\bar{\tilde{Y}}}_{l . .})}^{2} ~ \frac{σ_{l}^{2}}{d} χ_{K - 1, λ}^{2}, \\ \sum_{k = 1}^{K} \sum_{j = 1}^{{\tilde{n}}_{k}} {({\tilde{Y}}_{q k j} - {\bar{\tilde{Y}}}_{q k .})}^{2} ~ \frac{σ_{q}^{2}}{d} χ_{\tilde{N} - K}^{2} \end{matrix}

(A1)

In (A1),

\sum_{k = 1}^{K} \sum_{j = 1}^{{\tilde{n}}_{k}} {({\bar{\tilde{Y}}}_{l k .} - {\bar{\tilde{Y}}}_{l . .})}^{2}

and

\sum_{i = i}^{K} \sum_{j = 1}^{{\tilde{n}}_{k}} {({\tilde{Y}}_{q i j} - {\bar{\tilde{Y}}}_{q i .})}^{2}

are independent for

l = q

, but also for

l \neq q

, due to independence among variates. Without loss of generality, we consider

σ_{1}^{2} = \dots = σ_{p}^{2} = 1

for simplicity of the notations. This setting is appropriate since

σ_{l}^{2} (l = 1, \dots, p)

are the parameters of the original data and thus independent of changes in sample sizes. Under this setting, we have

\begin{matrix} t r (S_{B}) ~ \frac{1}{d} χ_{\tilde{N} - K}^{2} + \dots + \frac{1}{d} χ_{\tilde{N} - K}^{2} ~ \frac{1}{d} χ_{p (\tilde{N} - K)}^{2}, \\ t r (S_{W}) ~ \frac{1}{d} χ_{K - 1}^{2} + \dots + \frac{1}{d} χ_{K - 1}^{2} ~ \frac{1}{d} χ_{p (K - 1)}^{2} . \end{matrix}

Since the factor

d

cancels out in the ratio defining the CHI statistic, we omit it in the subsequent discussion. Due to independence of the entries in summations of the numerator and the denominator of the CHI, we have the

m

-th moment of the CHI

(H)

as

E (H^{m}) = {\{\frac{1}{K - 1}\}}^{m} E {\{\sum_{l = 1}^{p} \sum_{k = 1}^{K} {{\tilde{n}}_{k} ({\bar{\tilde{Y}}}_{l k .} - {\bar{\tilde{Y}}}_{l . .})}^{2}\}}^{m} E \{\frac{{(\tilde{N} - K)}^{m}}{{(\sum_{l = 1}^{p} \sum_{k = 1}^{K} \sum_{j = 1}^{{\tilde{n}}_{k}} {({\tilde{Y}}_{l k j} - {\bar{\tilde{Y}}}_{l k .})}^{2})}^{m}}\} .

(A2)

In the first expectation of the right-hand side of the Equation (A2),

\sum_{k = 1}^{K} {{\tilde{n}}_{k} ({\bar{\tilde{Y}}}_{l k .} - {\bar{\tilde{Y}}}_{l . .})}^{2}

has an independent

χ_{K - 1, λ}^{2}

distribution, and thus its moments are independent of the sample size.

Now, for the second expectation in (A2), using the Taylor expansion, we can show

E \{\frac{{(\tilde{N} - K)}^{m}}{{(t r (S_{W}))}^{m}}\} = \frac{{(\tilde{N} - K)}^{m}}{E \{{(t r (S_{W}))}^{m}\}} + \frac{{(\tilde{N} - K)}^{m} V a r \{{t r (S_{W})}^{m}\}}{E {\{{(t r (S_{W}))}^{m}\}}^{3}} + R_{3},

(A3)

where

R_{3}

denotes higher order remaining terms. Given that

p

and

K

are finite, a multinomial expansion of

{t r (S_{W})}^{m}

is expressed as

\begin{matrix} {t r (S_{W})}^{m} = \sum_{i_{1} + \dots + i_{p} = m} \frac{m!}{i_{1}! i_{2}! \dots i_{p}!} {(\sum_{k = 1}^{K} \sum_{j = 1}^{{\tilde{n}}_{k}} {({\tilde{Y}}_{1 k j} - {\bar{\tilde{Y}}}_{1 k .})}^{2})}^{i_{1}} \\ \dots {(\sum_{k = 1}^{K} \sum_{j = 1}^{{\tilde{n}}_{k}} {({\tilde{Y}}_{p k j} - {\bar{\tilde{Y}}}_{p k .})}^{2})}^{i_{p}}, \end{matrix}

(A4)

that is, the highest degree of the polynomials in the expansion (A4) is

m

. We can also show that the highest degree of the polynomials for evaluating

V a r \{{t r (S_{W})}^{m}\}

or

E \{{t r (S_{W})}^{2 m}\}

is

2 m

. Since all terms inside the parentheses in the Equation (A4) are independent, its expectation is

\begin{matrix} E \{{t r (S_{W})}^{m}\} = \sum_{i_{1} + \dots + i_{p} = m} \frac{m!}{i_{1}! i_{2}! \dots i_{p}!} E \{{(\sum_{k = 1}^{K} \sum_{j = 1}^{{\tilde{n}}_{k}} {({\tilde{Y}}_{1 k j} - {\bar{\tilde{Y}}}_{1 k .})}^{2})}^{i_{1}}\} \\ \dots E \{{(\sum_{k = 1}^{K} \sum_{j = 1}^{{\tilde{n}}_{k}} {({\tilde{Y}}_{p k j} - {\bar{\tilde{Y}}}_{p k .})}^{2})}^{i_{p}}\} . \end{matrix}

Thus, since each component in

t r (S_{W})

has

{χ_{{\tilde{n}}_{k} - K}}^{2}

distribution with noncentrality parameter 0, we have

E \{{t r (S_{W})}^{m}\} = O ({\tilde{N}}^{m})

. Similar argument leads to

V a r \{{t r (S_{W})}^{m}\} = O ({\tilde{N}}^{2 m})

. Thus, the first and second expectations in the Taylor expansion (A3) is expressed as

\begin{matrix} \frac{{(\tilde{N} - K)}^{m}}{E \{{t r (S_{W})}^{m}\}} = {(\tilde{N} - K)}^{m} O ({\tilde{N}}^{- m}) = O (1) \\ \frac{{(\tilde{N} - K)}^{m} V a r \{{t r (S_{W})}^{m}\}}{E {\{{t r (S_{W})}^{m}\}}^{3}} = {(\tilde{N} - K)}^{m} O ({\tilde{N}}^{2 m}) / O ({\tilde{N}}^{3 m}) = O (1) . \end{matrix}

Since, from (A2), the first expectation is independent of

\tilde{N}

, the result follows.

□

Proof of Theorem 2.

Let group

k^{*}

consist of the mixture of two distributions. For ease of notation, let the observations

Y_{l k^{*} j}

following the distribution

N (μ_{1}, σ_{l}^{2})

be denoted as

X_{1 j} (j = 1, \dots, n_{k^{*} 1})

, and the observations

Y_{l k^{*} j}

following the distribution

N (μ_{2}, σ_{l}^{2})

be denoted as

X_{2 j} (j = 1, \dots, n_{k^{*} 2})

, where

n_{k^{*} 1} + n_{k^{*} 2} =

n_{k^{*}}

. In Equation (2),

S_{w_{l}}

can be expressed as

\begin{matrix} (N - K) S_{w_{l}} = \sum_{k = 1}^{K} \sum_{j = 1}^{n_{i}} {(Y_{l k j} - {\bar{Y}}_{l k .})}^{2} = \sum_{\begin{matrix} k = 1 \\ k \neq k^{*} \end{matrix}}^{K} \sum_{j = 1}^{n_{i}} {(Y_{l k j} - {\bar{Y}}_{l k .})}^{2} + \sum_{j = 1}^{n_{k^{*}}} {(Y_{l k^{*} j} - {\bar{Y}}_{l k^{*} .})}^{2} \\ = \sum_{\begin{matrix} k = 1 \\ k \neq k^{*} \end{matrix}}^{K} \sum_{j = 1}^{n_{i}} {(Y_{l k j} - {\bar{Y}}_{l k .})}^{2} + \sum_{i = 1}^{2} \sum_{j = 1}^{n_{k^{*} i}} {({\bar{X}}_{i .} - \bar{X})}^{2} + \sum_{i = 1}^{2} \sum_{j = 1}^{n_{k^{*} i}} {(X_{i j} - {\bar{X}}_{i .})}^{2} . \end{matrix}

(A5)

In (A5), the expectation of the last term is

E \{\sum_{i = 1}^{2} \sum_{j = 1}^{n_{k^{*} i}} {(X_{i j} - {\bar{X}}_{i .})}^{2}\} = \sum_{i = 1}^{2} E \{\sum_{j = 1}^{n_{k^{*} i}} {(X_{i j} - {\bar{X}}_{i .})}^{2}\} = (n_{k^{*}} - 2) σ_{l}^{2}

. Thus, the expectation of (A5) can be expressed as

\begin{matrix} E \{(N - K) S_{w_{l}}\} = E \{\sum_{\begin{matrix} k = 1 \\ k \neq k^{*} \end{matrix}}^{K} \sum_{j = 1}^{n_{i}} {(Y_{l k j} - {\bar{Y}}_{l k .})}^{2} + \sum_{i = 1}^{2} \sum_{j = 1}^{n_{k^{*} i}} {(X_{i j} - {\bar{X}}_{i .})}^{2}\} + E \{\sum_{i = 1}^{2} \sum_{j = 1}^{n_{k^{*} i}} {({\bar{X}}_{i .} - \bar{X} . .)}^{2}\} \\ = (N - K - 1) σ_{l}^{2} + E \{\sum_{i = 1}^{2} \sum_{j = 1}^{n_{k^{*} i}} {({\bar{X}}_{i .} - \bar{X} . .)}^{2}\} \\ = (N - K) σ_{l}^{2} + E \{\sum_{i = 1}^{2} \sum_{j = 1}^{n_{k^{*} i}} {({\bar{X}}_{i .} - \bar{X} . .)}^{2}\} - σ_{l}^{2}, \end{matrix}

where

\sum_{i = 1}^{2} \sum_{j = 1}^{n_{k^{*} i}} {({\bar{X}}_{i .} - \bar{X})}^{2} = n_{k^{*} 1} {({\bar{X}}_{1 .} - \bar{X})}^{2} +

n_{k^{*} 2} {({\bar{X}}_{2 .} - \bar{X})}^{2}

is non-negative and increasing, when

n_{k^{*}}

increases. This leads to

E \{\sum_{i = 1}^{2} \sum_{j = 1}^{n_{k^{*} i}} {({\bar{X}}_{i .} - \bar{X} . .)}^{2}\} - σ_{l}^{2} \geq 0

, as

σ_{l}^{2}

is a fixed value. Thus, for a large

n_{k^{*}}

, we have

E \{S_{w_{l}}\} \geq σ_{l}^{2}

. In this case, since

S_{B_{l}}

and

S_{w_{l}}

are independent by the standard ANOVA result, we have

E \{H_{l}\} = \frac{E \{S_{B_{l}}\}}{E \{S_{w_{l}}\}} \leq \frac{E \{S_{B_{l}}\}}{σ_{l}^{2}} .

This completes the proof.

□

Proof of Corollary 1.

Based on the assumptions on the data, we can show that

t r (S_{B})

and

t r (S_{W})

in the CHI are independent as follows. Without loss of generality, consider a bivariate normal distribution, i.e.,

Y_{k j} = {(Y_{1 k j}, Y_{2 k j})}^{T}

, representing the

j

-th data point in group

k (k = 1, \dots, K) .

A bivariate normal distribution can be generated from two independent standard normal random variables [26]. Thus, let

Y_{1 k j} = μ_{1} + σ_{11} Z_{1 k j} + σ_{12} Z_{2 k j}

and

Y_{2 k j} = μ_{2} + σ_{21} Z_{1 k j} + σ_{22} Z_{2 k j}

, where

Z_{1 k j}

and

Z_{2 k j}

are independent standard normal random variables. Equivalently, we can express

Y_{2 k j} = μ + X_{2 k j} + a Y_{1 k j}

for appropriate real values of

μ

and

a

, and mean 0 normal random variable

X_{2 k j}

that is independent of both

Y_{1 k j}

and

Y_{2 k j}

. Let

Y_{1}

denote the vector of all

Y_{1 k j}

values and

Y_{2} = μ 1_{N} + X_{2} + a Y_{1}

, where

X_{2}

denotes the data vector for

X_{i k j}

. Also, let

S_{n} = I_{n} - J_{n} / n

, where

I_{n}

is the

n \times n

dimensional identity matrix and

J_{n}

is its conformable matrix of ones. Then, we can express

{(K - 1) S}_{W_{1}} = Y_{1}^{T} A Y_{1}, (N - K) S_{B_{2}} = {(μ 1_{N} + X_{2} + a Y_{1})}^{T} B (μ 1_{N} + X_{2} + a Y_{1})

(A6)

where

A = D i a g (S_{n_{i}})

and

B = D i a g (\frac{1}{n_{i}} J_{n_{i}}) - \frac{1}{N} J_{N}

.

D i a g (S_{n_{i}})

is the block diagonal matrix with

S_{n_{1}}, \dots, S_{n_{K}}

as its diagonal elements and

D i a g (\frac{1}{n_{i}} J_{n_{i}})

is similarly defined. Expanding the right-hand side of the Equation (A6) produces quadratic or linear terms such as

X_{2}^{T} B X_{2}

,

Y_{1}^{T} B X_{2}

,

Y_{1}^{T} B Y_{1}

, and

Y_{1}^{T} B 1_{N}

. Since

Y_{1}

and

X_{2}

are independent,

Y_{1}^{T} A Y_{1}

and

{X_{2}^{T} B X}_{2}

are independent. Also,

Y_{1}^{T} A Y_{1}

is independent of

Y_{1}^{T} {B Y}_{1}

as well as

Y_{1}^{T} B

, satisfying conditions for independence of quadratic or linear terms [27]. Thus,

S_{W_{1}}

and

S_{B_{2}}

are independent. Similarly,

S_{W_{2}}

and

S_{B_{1}}

are independent. This leads to independence between the numerator and the denominator of the CHI.

Now, the expectation of the CHI can be expressed as

E (H) = E \{\frac{\sum_{l = 1}^{p} S_{B_{l}}}{\sum_{l = 1}^{p} S_{w_{l}}}\} = \frac{E \{\sum_{l = 1}^{p} S_{B_{l}}\}}{E \{\sum_{l = 1}^{p} S_{w_{l}}\}},

where

E \{\sum_{l = 1}^{p} S_{w_{l}}\} = \sum_{l = 1}^{p} E \{S_{w_{l}}\} \geq \sum_{l = 1}^{p} σ_{l}^{2} = t r (Σ)

by Theorem 2. Thus, the result follows.

□

Appendix B

In Figure A1, the control images are generated in the same way as the first scenario to generate Figure 3A, whereas the case images are generated with the variance of 2 and the data contamination is applied. In Figure A2, the control and case images are generated as in the first scenario, then spatial correlation is added. The spatial correlation is generated over a 60 × 60 grid covering the unit square domain

[0, 1] \times [0, 1]

, using R package ‘fields’ [28]. An exponential covariance function with a short-range parameter of 0.05 is used to induce weak spatial dependence. The resulting Gaussian random fields are scaled by a factor of 0.5 to further reduce the magnitude of background variation relative to contamination signal. As a result, the simulated Gaussian fields have an approximate mean of 0 and a variance of 0.25.

Figure A1. Variation of CHI values under differing variances of control and case images. Contamination levels of 0%, 10%, 20%, and 30%) are applied to the case image relative to the control image, with a fixed contamination unit size of 4².

Figure A2. Variation of CHI values with added spatial correlation. Contamination levels of 0%, 10%, 20%, and 30%) are applied to the case image relative to the control image, with a fixed contamination unit size of 4².

References

Leiner, J.; Duan, B.; Wasserman, L.; Ramdas, A. Data fission: Splitting a single data point. J. Am. Stat. Assoc. 2023, 118, 1–12. [Google Scholar] [CrossRef]
Nilsson, M.; Kalckert, A. Region-of-interest analysis approaches in neuroimaging studies of body ownership: An activation likelihood estimation meta-analysis. Eur. J. Neurosci. 2021, 54, 7974–7988. [Google Scholar] [CrossRef] [PubMed]
Ashburner, J.; Friston, K.J. Voxel-based morphometry—The methods. Neuroimage 2000, 11, 805–821. [Google Scholar] [CrossRef] [PubMed]
Mahajan, M.; Nimbhorkar, P.; Varadarajan, K. The planar k-means problem is NP-hard. Theor. Comput. Sci. 2012, 442, 13–21. [Google Scholar] [CrossRef]
Caliński, T.; Harabasz, J. A dendrite method for cluster analysis. Commun. Stat.-Theory Methods 1974, 3, 1–27. [Google Scholar] [CrossRef]
Hocking, R.R. Methods and Applications of Linear Models: Regression and the Analysis of Variance; John Wiley & Sons: New York, NY, USA, 1996; pp. 401–428. [Google Scholar]
Park, S.; Yu, J.; Sternberg, Z. Multi-Dimensional Clustering Based on Restricted Distance-Dependent Mixture Dirichlet Process for Diffusion Tensor Imaging. J. Data Sci. 2024, 22, 4. [Google Scholar] [CrossRef]
Yue, J.K.; Vassar, M.J.; Lingsma, H.F.; Cooper, S.R.; Okonkwo, D.O.; Valadka, A.B.; Gordon, W.A.; Maas, A.I.; Mukherjee, P.; Yuh, E.L.; et al. Transforming research and clinical knowledge in traumatic brain injury pilot: Multicenter implementation of the common data elements for traumatic brain injury. J. Neurotrauma 2013, 30, 1831–1844. [Google Scholar] [CrossRef] [PubMed]
Elliott, M.L.; Belsky, D.W.; Knodt, A.R.; Ireland, D.; Melzer, T.R.; Poulton, R.; Ramrakha, S.; Caspi, A.; Moffitt, T.E.; Hariri, A.R. Brain-age in midlife is associated with accelerated biological aging and cognitive decline in a longitudinal birth cohort. Mol. Psychiatry 2021, 26, 3829–3838. [Google Scholar] [CrossRef] [PubMed]
Park, D.C.; Festini, S.B.; Cabeza, R.; Nyberg, L. The middle-aged brain: A cognitive neuroscience perspective. Cogn. Neurosci. Aging: Linking Cogn. Cerebral Aging 2017, 2, 606. [Google Scholar]
Talairach, J. 3-Dimensional proportional system: An approach to cerebral imaging. Co-planar Stereotaxic Atlas Hum Brain 1988. [Google Scholar]
Evans, A.C.; Janke, A.L.; Collins, D.L.; Baillet, S. Brain templates and atlases. Neuroimage 2012, 62, 911–922. [Google Scholar] [CrossRef] [PubMed]
Park, S.; Yu, J. Introduction of diffusion tensor imaging data: An overview for novice users. In Modern Inference Based on Health-Related Markers; Elsevier: Amsterdam, The Netherlands, 2024; pp. 315–354. [Google Scholar]
Alexander, A.L.; Hasan, K.; Kindlmann, G.; Parker, D.L.; Tsuruda, J.S. A geometric analysis of diffusion tensor measurements of the human brain. Magn. Reson. Med. 2000, 44, 283–291. [Google Scholar] [PubMed]
Whitcher, B.; Schmid, V.J. Quantitative analysis of dynamic contrast-enhanced and diffusion-weighted magnetic resonance imaging for oncology in R. J. Stat. Softw. 2011, 44, 1–29. [Google Scholar] [CrossRef]
Ferguson, T.S. A Bayesian analysis of some nonparametric problems. Ann. Stat. 1973, 1, 209–230. [Google Scholar]
Escobar, M.D.; West, M. Bayesian density estimation and inference using mixtures. J. Am. Stat. Assoc. 1995, 90, 577–588. [Google Scholar] [CrossRef]
Neal, R.M. Markov chain sampling methods for Dirichlet process mixture models. J. Comput. Graph. Stat. 2000, 9, 249–265. [Google Scholar]
Teh, Y.; Jordan, M.; Beal, M.; Blei, D. Hierarchical Dirichlet processes. J. Am. Stat. Assoc. 2006, 101, 476. [Google Scholar] [CrossRef]
Antoniak, C.E. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. Ann. Stat. 1974, 2, 1152–1174. [Google Scholar] [CrossRef]
Dahl, D.B. Modal clustering in a class of product partition models. Bayesian Anal. 2009, 4, 243–264. [Google Scholar] [CrossRef]
Rodríguez, C.E.; Walker, S.G. Label switching in Bayesian mixture models: Deterministic relabeling strategies. J. Comput. Graph. Stat. 2014, 23, 25–45. [Google Scholar] [CrossRef]
Tzourio-Mazoyer, N.; Landeau, B.; Papathanassiou, D.; Crivello, F.; Etard, O.; Delcroix, N.; Mazoyer, B.; Joliot, M. Automated anatomical labeling of activations in SPM using a macroscopic anatomical parcellation of the MNI MRI single-subject brain. Neuroimage 2002, 15, 273–289. [Google Scholar] [CrossRef] [PubMed]
Hassall, C.D.; Hunt, L.T.; Holroyd, C.B. Task-level value affects trial-level reward processing. Neuroimage 2022, 260, 119456. [Google Scholar] [CrossRef] [PubMed]
Roalf, D.R.; Ruparel, K.; Gur, R.E.; Bilker, W.; Gerraty, R.; Elliott, M.A.; Gallagher, R.S.; Almasy, L.; Pogue-Geile, M.F.; Prasad, K.; et al. Neuroimaging predictors of cognitive performance across a standardized neurocognitive battery. Neuropsychology 2014, 28, 161. [Google Scholar] [CrossRef] [PubMed]
Kenney, J.F.; Keeping, E.S. Mathematics of Statistics, Part 2, 2nd ed.; Van Nostrand: Princeton, NJ, USA, 1951; p. 92. [Google Scholar]
Graybill, F.A. Theory and Application of the Linear Model; Wadsworth Publishing Company: Belmont, CA, USA, 1976; pp. 137–139. [Google Scholar]
fields: Tools for Spatial Data. Available online: https://cran.r-project.org/package=fields (accessed on 26 March 2025).

Figure 1. Illustration of data aggregation. It illustrates that the data aggregation results in three distinct populations (solid square, unfilled square, and shaded square), despite initially starting with two populations (solid square and unfilled square).

Figure 2. Generated 60 × 60 gray scale images illustrating control (Group 1) and 30% contaminated (Group 2) plots.

Figure 3. Variation of CHI values across (A) different contamination levels (0%, 10%, 20%, and 30%) in the case image relative to the control image, using a fixed contamination unit size of 4², and (B) different contamination unit sizes with a fixed contamination rate of 30%. On the x-axis, the block size represents the square root of the total number of voxels in each block. Colors are randomly assigned to lines to represent different data generation runs.

Figure 4. Individual CHI values for DTI metrics (left), magnified individual CHI values (middle), and combined CHI value (right). FA = Fractional Anisotropy; MD = Mean Diffusivity; RD = Radial Diffusivity; AD = Axial Diffusivity.

Figure 5. (A) RDMP clustering on 395 cubes displayed by each layer. Each slice, arrange from left to right and bottom to top, depicts a series of two-dimensional representations of cubes spanning from the bottom to the top of the brain. White areas indicate brain part not covered. Blue, red, and green areas indicate clusters C1, C2, and C3, respectively. The gray area represents cubes that were excluded due to containing fewer than 10 subjects. (B) Same as (A), but for RDMP clustering on 1169 cubes.

Table 1. (A) Clustering results for 359 cubes (whole-brain scans). C1, C2, and C3 represent the three clusters identified by the proposed method. For each cluster, the table shows the average values of four DTI metrics across all cubes: AveFA (Fractional Anisotropy), AveMD (Mean Diffusivity), AveRD (Radial Diffusivity), and AveAD (Axial Diffusivity). (B) Clustering results for 1,169 cubes, with the same description as in (A).

(A) 359 Cubes (Unit Size: 12 × 12 × 10)
Clusters	AvgFA	AvgMD	AvgRD	AvgAD	Entropy
C1 (Red)	0.1055	0.6532	0.5905	0.7787	0.1548
C2 (Blue)	0.1593	1.0932	1.0086	1.2622	0.1122
C3 (Green)	0.2416	0.9139	0.8058	1.1299	0.1613
(B) 1169 Cubes (Unit Size: 24 × 24 × 10)
C1 (Red)	0.1699	1.0419	0.9566	1.2123	0.1058
C2 (Blue)	0.0983	0.6338	0.5736	0.7542	0.1063
C3 (Green)	0.2793	0.8526	0.7320	1.0939	0.2203

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yu, J.; Lee, H.; Sternberg, Z. Unit Size Determination for Exploratory Brain Imaging Analysis: A Quest for a Resolution-Invariant Metric. Mathematics 2025, 13, 1195. https://doi.org/10.3390/math13071195

AMA Style

Yu J, Lee H, Sternberg Z. Unit Size Determination for Exploratory Brain Imaging Analysis: A Quest for a Resolution-Invariant Metric. Mathematics. 2025; 13(7):1195. https://doi.org/10.3390/math13071195

Chicago/Turabian Style

Yu, Jihnhee, HyunAh Lee, and Zohi Sternberg. 2025. "Unit Size Determination for Exploratory Brain Imaging Analysis: A Quest for a Resolution-Invariant Metric" Mathematics 13, no. 7: 1195. https://doi.org/10.3390/math13071195

APA Style

Yu, J., Lee, H., & Sternberg, Z. (2025). Unit Size Determination for Exploratory Brain Imaging Analysis: A Quest for a Resolution-Invariant Metric. Mathematics, 13(7), 1195. https://doi.org/10.3390/math13071195

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Unit Size Determination for Exploratory Brain Imaging Analysis: A Quest for a Resolution-Invariant Metric

Abstract

1. Introduction

2. Methods

3. Simulation

4. Data Analysis

4.1. Data Source

4.2. Unit Size Selection

4.3. Method for Partitioning Brain Regions

4.4. Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

Appendix B

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI