Age Estimation from fMRI Data Using Recurrent Neural Network

Finding a biomarker that indicates the subject’s age is one of the most important topics in biology. Several recent studies tried to extract a biomarker from brain imaging data including fMRI data. However, most of them focused on MRI data, which do not provide dynamics and lack attempts to apply recently proposed deep learning models. We propose a deep neural network model that estimates the age of a subject from fMRI images using a recurrent neural network (RNN), more precisely, a gated recurrent unit (GRU). However, applying neural networks is not trivial due to the high dimensional nature of fMRI data. In this work, we propose a novel preprocessing technique using the Automated Anatomical Labeling (AAL) atlas, which significantly reduces the input dimension. The proposed dimension reduction technique allows us to train our model with 640 training and validation samples from different projects under mean squared error (MSE). Finally, we obtain the correlation value of 0.905 between the predicted age and the actual age on 155 test samples. The proposed model estimates the age within the range of ±12 on most of the test samples. Our model is written in Python and is freely available for download.

The relationship between the brain and age through neuroimages has been explored [32][33][34][35][36]. Franke et al. [37] showed the relationship between the human brain and age through MRI data, which is a static image of the brain. The authors used principal component analysis to reduce the size of the MRI data and performed regression analysis by training the relevance vector machine (RVM) to estimate the age of the subject. Recently, deep learning networks have been used to analyze MRI images. Huang et al. used the VGGNet [38] to extract features from MRI images [39]. Qi et al. proposed a 3D convolutional neural network (CNN) for the 3D level of MRI images [40]. Jiang et al. conducted a comparative study of T1-weighted brain images at different parts (gray matter, white matter, and cerebrospinal fluid) through a DenseNet [41] based on transfer learning [42]. However, the authors used a still image of the brain at rest and did not consider the brain function network dynamics over time.
Brain function network dynamics suggest that human brain activity is not only an activity of a single functional brain area but also a time-varying interaction between multiple areas [43,44]. Furthermore, fMRI is more sensitive to detecting aging-related neurodegenerative diseases [45]. Thus, an age estimation based on fMRI has an advantage over structural MRI. Yao et al. showed the relationship between the entropy of the brain and age [46] including time-varying dynamics. The authors first divided the fMRI image into 90 regions using the Automated Anatomical Labeling atlas (AAL) [47], then transformed the fMRI image to time series data. Finally, they extracted correlation coefficients among all the pairs of the time series and computed the entropy of the empirical distribution of coefficients. However, the correlation between the entropy and the actual age is not strong, and it is hard to interpret the meaning of the entropy of coefficients.
Another area related to time series data is natural language processing (NLP). Recurrent neural networks (RNN) [48], long short-term memory (LSTM) [49], and gated recurrent units (GRU) [50] dominated the field due to their structure being similar to time series. Later, Transformer [51], based on a simple and clear attention structure [52,53], was proposed, which showed more concise and efficient performance than traditional deep networks. Transformer achieves better results due to its ability to capture long-term dependencies.
In this paper, we propose a deep neural network (DNN) model to estimate the age from fMRI. Since fMRI data can be viewed as a time series of 3D images, it is natural to use network models from NLP, such as recurrent neural networks (RNN) or Transformer. The proposed model is based on GRU since fMRI data have weaker long-term dependencies.
We would like to emphasize that applying an existing DNN model is not trivial, mainly due to the data shortage issue. The cost of collecting fMRI data is high, and it is hard to obtain a sufficient amount of data to train the DNN model. Moreover, the dimension of fMRI data is huge (e.g., 64 × 64 × 34 × 100) compared to the number of trainable data. To overcome the data shortage, we propose a new dimension reduction technique for fMRI data. More precisely, we use fMRI data from 795 publicly available samples. We first apply the clustering algorithm to the data to reduce the dimension. This provides time series data that are a 94 dimension vector at each repetition time point. Then, we train DNN models that estimate the age, where the input is preprocessed fMRI data. We compare the proposed model with other DNN models including vanilla RNN, Transformer, and multilayer perceptron (MLP). The correlation coefficient between the estimated age and the actual age is r = 0.905, which outperforms the other DNN models.
The paper is organized as follows. In Section 2, we provide detailed information about fMRI samples and introduce the preprocessing step that reduces the input dimension. The proposed recurrent neural network-based model is described in Section 3. We provide an experimental setup and results in Section 4. In Section 5, we investigate the overall contribution of components to the proposed model, and we conclude the paper in Section 6.

Data Acquisition
We collect a total of 1450 human brain fMRI samples from 26 publicly available projects, where informed consent was obtained from all subjects. Defective samples are removed including samples with no age information and samples with incomplete data. Finally, we use a total of 795 samples of fMRI data consisting of 437 women and 358 men. The range of ages is from 10 to 80. Table 1 provides the statistics of data. SALD project has 369 samples from the Southwest University Adult Lifespan Dataset (SALD) [54]. The sample includes 229 women and 140 men, who are students and residents from the Southwest University of China. All other projects are from the 1000 Functional Connectomes Project (http://fcon1000.projects.nitrc.org, accessed on 20 November 2021), with a total of 426 brain samples, including 208 women and 218 men.
Since the data are obtained from various projects, some inconsistencies are hard to be normalized. Specifically, the repetition times (TR) vary from 1 to 3 s. Among 795 samples, 629 samples have a TR of 2 s, and the majority of the remaining samples have TRs close to 2 s. Since most of them have similar TRs, we ignore the impact of TR differences in this work.
For the positional variation, we normalize the data, which is described in the following section.

Data Preprocessing
The fMRI scanner often performs environmental adaptation at the beginning of the scan. It is also unstable during the adaptation, which may provide inconsistent data. To avoid this issue, we discard the earliest part of fMRI data (the first 10 repetition time points).
Since we collect the data from various projects, each project may have its own setups or imaging parameters. In other words, each project may have its own bias, and therefore we need to remove the bias from each project. Moreover, each subject may have different sizes of brains, and the brain positions might vary while scanning. These variations in the sample can impact our model since the size of the dataset is not large enough. To resolve the issue, we apply a normalization technique for the individual data.
First, we use the FMRIB software library (FSL) to preprocess fMRI data. FSL is a widely used tool for brain imaging data [55,56]. More precisely, we employ the FEAT toolbox in the FSL software library to normalize the fMRI data. To remove the bias of the project, we extract imaging parameters from each project and center the images of the project with corresponding parameters.
The second part of the preprocessing is the registration of the normalized image. The registration step clusters the image to several regions accordingly and provides a "mask" that indicates the coverage of regions. Among various registration methods, we use AAL2 from the Montreal Institute of Neurology (MNI) brain space Automated Anatomical Labeling atlas (AAL) [57]. FMRIB Linear Image Registration Tool (FLIRT) is used for registration, which is an automation tool for brain image registration in the FSL software library. This registration step divides the brain into 94 regions of interest (ROI). Note that the above steps do not reduce the dimension of data and the original data can be recovered by reversing the process. An example of normalization and registration of fMRI images is shown in Figure 1. After the registration step, we average the voxel values of each region to reduce the dimension of data. Thus, after the preprocessing, we obtain a time series data X of length t. More precisely, let C be the fMRI sample and f pre be the preprocessing procedure including normalization, registration, and averaging. Then, we obtain the time series data X = (x 1 , x 2 , · · · , x t ) = f pre (C) ∈ R 94×t , where the preprocessed brain image at the i-th repetition time point X i ∈ R 94 is a 94-dimensional vector. Note that fMRI data from different projects may have different repetition time points. The whole preprocessing steps are depicted in Figure 2. Data preprocessing procedures. We use FEAT to normalize the fMRI data (4D data), and then use FLIRT to register the data. Finally, we average voxel values for each 94 brain region.

Preliminaries
Since the preprocessed data is a time series that reflects brain function network dynamics, it is natural to consider the neural network (NN) models designed for sequence data, such as recurrent neural networks (RNN) and Transformer. RNN recursively summarizes the sequential data and analyzes its dynamic behavior [48]. However, RNN has vanishing and exploding gradient issues. Long short-term memory (LSTM) introduced gates to solve this issue [49,58]. The gated recurrent unit (GRU) is a variant of the LSTM network, which has a simpler and more effective structure [50,59]. We propose a GRU-based model for age estimation. We also provide several other DNN models for baseline. In Section 4, we provide a comparison between the proposed model and the other baseline models.

Proposed Model
The proposed GRU-based model takes an input X, where the last GRU is followed by fully connected (FC) layers, including a batch normalization (BN) layer and ReLU activation. We describe the structure of the proposed model in Figure 3a. The model hyperparameters, including the number of layers and hidden dimensions, are optimized as described in Section 4.  Other Baseline Models Another estimator we considered is a Transformer, which shows extraordinary performances in NLP such as translation. The Transformer generally includes two parts. The "encoder" takes an input sequence then maps to a vector where the "decoder" outputs a sequence based on the encoded vector. Since we want to estimate age (real number), we are only interested in the Transformer encoder. Similar to Transformer-based models in NLP, we first map the preprocessed data X to a vector using positional encoding. Then, we apply multiple Transformer encoder layers that consist of self-attention and feed-forward layers. The final fully connected layers estimate the age. The structure of Transformer-based model is described in Figure 3a.
We also considered the other RNN-based models, including vanilla RNN and LSTM, where vanilla RNN and LSTM also have similar structures to the GRU-based model.
Finally, we constructed a multilayer perceptron (MLP) that consists of FC layers only. We treat the preprocessed fMRI data as data with two dimensions: time series and brain regions. We constructed a multi-layer model learning data with the same structure. Each layer contains an FC layer, a batch normalization (BN) layer, a ReLU activation, and a dropout [60].

Model Parameters
Hyperparameters of models are optimized via cross-validation as described in Section 4.2. The proposed GRU-based model has three internal layers where each recurrent unit has 300 hidden states, and the width of the final FC layer is 300. The Transformer network consists of three Transformer encoder layers, where each layer consists of two sub-layers. The first is the multi-head self-attention mechanism: we set the number of heads to 2. The second is a position-wise feed-forward layer that sets its network dimension to 8 times the input dimension. We optimize other RNN-based models (vanilla RNN and LSTM) and achieve the same hyperparameters as the GRU-based model. The MLP network consists of three FC layers. Standard dropout probability p = 0.5 is used in RNN-and MLP-based models. In the Transformer model, the dropout probability of p = 0.1 is used in the positional encoding and the encoder layer.

Training
The data are divided into three independent sets: training, validation, and test sets. We first divide all the data into two parts, with a ratio of 8 to 2 as shown in Table 2. In all, 20% of the total data, namely 155 samples, are assigned for the test. The remaining 80% of the total data, namely 640 samples, are used for training and validation. We use k-fold cross-validation with k = 10 due to an insufficient amount of data. The data except for the test set (total 640 data points) are divided into 10 subsets, one subset is used as the validation dataset and the other 9 subsets are used as the training dataset. This procedure is repeated 10 times, then the average performance is computed. We choose the model hyperparameters that provide the best average performance.
For model training, we use the Adam optimizer with learning rate attenuation. The initial learning rate value is set to 0.001 for the GRU model and 0.0001 for the Transformer model. The learning rate is attenuated by 0.99 every 100 steps. In our experiment, the training batch size is set to 144 where the number of epochs is 10,000. The mean squared error (MSE) is employed to compute the loss while training where (X (1) , y (1) ), . . . , (X (n) , y (n) ) denotes the preprocessed data sample and label pair, and f NN (·) denotes a proposed neural network. Thus, the training is to minimize MSE.

Implementations
Our neural network model and data preprocessing procedures are implemented in Python, which is freely available for download at https://github.com/gyfbianhuanyun/ brain-data-with-age (accessed on 20 November 2021). The machine used to perform the experiments has the following specifications: 8 GB RAM, Intel(R) Core(TM) i5-8400 @2.80GHz CPU processor, and Nvidia Geforce GTX 1050 graphics card. Table 3 shows the performance of the proposed model and other baseline models on the test dataset (155 data points). It provides the correlation between estimated age and the actual age. In addition to MSE, we compute the mean absolute error (MAE) to measure the accuracy of age estimation

Results
We also provide the intraclass correlation coefficient (ICC) to evaluate the model's performance. The ICC evaluates the consistency of the same number of measurement results measured by multiple observers. Here, we take the actual age and the predicted age as two measurement results and judge the prediction model's performance by the degree of consistency.
The GRU model shows the best performance while the Transformer model has comparable results. Figure 4a shows the performance of the GRU model where the x-axis shows the age of the subject and the y-axis represents the estimated age by the proposed model. The gender of the subjects are indicated where blue dots represent male subjects, and red triangles represent female subjects. It can be seen that more than 80% of the estimations are within the ±12 range of the actual age. The correlation coefficient between the estimation and the actual age is r = 0.905.   The training results using the Transformer model are shown in Figure 4b. These results will be discussed in Section 4.5. Since most previous works [37,39,40,42] focus on structural MRI input and codes are not available online, we compare our results with Yao et al. [46], which computes a correlation between entropy and actual age based on fMRI data samples. Figure 4c shows the relation between entropy and age. The correlation coefficient between the entropy and the age is r = 0.292, which is weaker than the proposed estimation model.
In addition, we provide Bland-Altman plots to visualize the results. The Bland-Altman plot shows the consistency between two different measurements. Here, we use the actual age of the test data sets as a known measurement and the predicted age as another measurement value. Figure 4d,e show Bland-Altman plots of the GRU model and Transformer model. What needs attention here is the processing of the entropy results. Since the entropy value is not an age estimation, we first find a linear regression model that estimates an age from the entropy value. Figure 4f shows a Bland-Altman plot of estimated age from entropy and the actual age.

Age Bias
As we can see in Figure 4a, our model tends to underestimate the age of subjects. This is mainly due to a bias of our training dataset where we have more young subjects in the training set, as we showed in Table 1. More precisely, the number of subjects younger than 30 is 256, which is 32.20% of the dataset, while the number of subjects older than 60 is 144, which is 18.11% of the dataset. Figure 4a shows the gender difference of our model. For female subjects, the estimation is more accurate for samples of those aged around 20, but the estimation becomes less accurate when the age of a sample is more than 60. On the other hand, for male subjects, the estimation loss is consistent throughout the ages. Note that Yao et al. [46] also reported the gender difference. The authors mentioned that the functional entropy increases with age where the mean entropy of female subjects increases slower than male subjects.

Ablation Study
We also trained the GRU-based model with variations.

Effect of Brain Regions
In order to analyze the influence of different brain regions on the results, we extracted the parts with left or right hemispheres in the brain region named in the AAL2 atlas and conducted the training. Table 3 shows the training results. We found that under the same conditions, it is beneficial to fully utilize all the available brain regions to estimate the age of subjects. Interestingly, we observed that the data from the left hemisphere have a higher correlation with age.

Effect of Dropout
As we mentioned in the above section, we employ the dropout technique to prevent overfitting. Figure 5a,d show the estimation result of the exact same model without dropout. It can be seen that some estimations deviate from the actual ages, where the proposed method with dropout shows a robust estimation. The correlation coefficient without dropout is r = 0.849, which is smaller than the correlation coefficient r = 0.905 of the proposed model in the presence of dropout. This clearly shows that the dropout suppresses the generalization error, especially in our case where the size of the dataset is small.

Choice of Loss Function
Since the 1 loss function and the MSE ( 2 ) loss function are relatively common loss calculation methods, in this section, we investigate how the choice of loss function affects the result. Figure 5b,e show the estimation result when the 1 loss function is employed. The estimation with 1 loss is less accurate than the model with the MSE loss function. This is because the MSE loss penalizes the large errors more while the 1 loss function computes the absolute deviation. In particular, we can see some outliers when we train the model with the 1 loss function. Note that the correlation coefficient of the model is r = 0.749.

Effect of Scan Time
To see the effect of time dynamics on brain imaging data, we train the model with samples that are fixed length time series. Figure 5c,f show the estimated result when the model uses the first 100 repetition time points only. In this case, the model correlation coefficient is r = 0.889, which is lower than the result of using all times data. This shows that the brain function network dynamics are indeed helpful in age estimation. . The x-axis is the average of each sample's actual and predicted age, and the y-axis is the difference between the exact age and the predicted age of the sample. The solid line is the average of the difference. The dashed line is the upper and lower limits of 95% agreement, i.e., ±1.96 standard deviations.

Conclusions
To estimate the age from fMRI images, we proposed the deep learning model based on GRU and Transformer. The preprocessing technique is also provided to reduce the data dimension. Since we have only 795 pieces of data, we use cross-validation to determine hyperparameters during training. Despite the lack of data, our model provides a reasonable estimation of age compared to the previous entropy-based method. Throughout the research, we found a relationship between brain activity and age, which can be extended to other brain disease research.
Limitations: Since the proposed GRU-based model takes a preprocessed input, the region extraction method is crucial. The proposed framework is based on the AAL atlas, an anatomical (structural) parcellation. We would like to mention that other representations such as functionally-defined brain regions of interest [61][62][63] might be more sensitive to analyzing fMRI. However, our framework is universal in the sense that we can apply the same normalization technique and GRU-based model even with functionally-defined regions of interest. Another limitation is the diversity of data samples obtained from different projects. Although we normalized samples during preprocessing, this may cause inter-project errors. In future work, we plan to collect samples from controlled experiments. Informed Consent Statement: Informed consent was obtained from all subjects involved in the study. Data Availability Statement: All 26 projects are from the 1000 Functional Connectomes Project (http://fcon1000.projects.nitrc.org, accessed on 20 November 2021). Furthermore, SALD project is from the Southwest University Adult Lifespan Dataset (SALD) (http://fcon_1000.projects.nitrc.org/ indi/retro/sald.html, accessed on 20 November 2021) [54].