Brain Decoding of Multiple Subjects for Estimating Visual Information Based on a Probabilistic Generative Model

Brain decoding is a process of decoding human cognitive contents from brain activities. However, improving the accuracy of brain decoding remains difficult due to the unique characteristics of the brain, such as the small sample size and high dimensionality of brain activities. Therefore, this paper proposes a method that effectively uses multi-subject brain activities to improve brain decoding accuracy. Specifically, we distinguish between the shared information common to multi-subject brain activities and the individual information based on each subject’s brain activities, and both types of information are used to decode human visual cognition. Both types of information are extracted as features belonging to a latent space using a probabilistic generative model. In the experiment, an publicly available dataset and five subjects were used, and the estimation accuracy was validated on the basis of a confidence score ranging from 0 to 1, and a large value indicates superiority. The proposed method achieved a confidence score of 0.867 for the best subject and an average of 0.813 for the five subjects, which was the best compared to other methods. The experimental results show that the proposed method can accurately decode visual cognition compared with other existing methods in which the shared information is not distinguished from the individual information.


Introduction
Brain decoding estimates human cognition from brain activities and has been actively studied. There has been recent research progress in measuring human brain activities. Certain measuring methods have been used in this regard, including the implantable microelectrode array (MEA) [1] and other noninvasive measuring methods, such as nearinfrared spectroscopy [2], electroencephalogram [3], functional magnetic resonance imaging (fMRI) [4][5][6][7][8][9][10], and magnetoencephalography (MEG) [11,12]. MEA is an invasive measurement method, and the merit of MEA is its robustness to noise during brain activity measurement. However, it necessitates the implantation of microelectrodes in a subject's body, which is its demerit, as this imposes a significant physical burden on the subject. Therefore, noninvasive methods, such as fMRI and MEG, which are less likely to directly harm a subject, are more widely used than invasive methods. fMRI in particular is frequently used to measure brain activities and can obtain brain activities with high-spatial resolutions. Compared with MEA, the demerits of fMRI are its sensitivity to noise and the large size of the measurement equipment. MEG is superior to fMRI in terms of temporal resolution and has reasonable spatial resolutions [13,14].
Several positive results have been reported by using machine learning methods to analyze brain activities from these measuring methods. For example, emotion analysis methods have been proposed from brain activities [15][16][17]. Some techniques have been proposed to generate an image caption and reconstruct an image using brain activities while seeing the image [9,[18][19][20][21][22]. In addition, image reconstruction is attempted on the basis of brain activities while imagining [20]. Researchers believe that the advancement of machine-learning-based brain decoding will reveal the human brain mechanism. The revelation of the human brain's mechanism is expected to contribute to a society in which everyone lives comfortably by realizing effective devices that use brain activities as input. For example, a brain-machine interface (BMI) aims for humans to directly operate and communicate with external machines without physical movement, which can assist the daily lives of people with handicaps [23][24][25].
The estimation of visual perception from fMRI data has been actively researched in the field of brain decoding to take advantage of excellent spatial resolutions [6][7][8]26]. fMRI data vary depending on the imaging object [6]. Previous studies [7,26] have attempted to analyze visual perception using classical methods, such as a support vector machine [27] and a Gabor wavelet filter [28]. There is a relationship between visual information extracted from a convolutional neural network (CNN) [29] and fMRI data obtained while seeing an object [8,30]. This relationship suggests that a CNN mimics the visual perception system in the human brain, and visual features extracted by a CNN are essential in estimating visual perception. Previous studies have attempted to estimate CNN-based visual features of images using fMRI data collected while subjects see the images [8,[31][32][33][34]. In a previous study [8], the authors constructed a decoder that learns the relationship between each subject's fMRI data and the visual features of a seen image, and the decoder can estimate the visual features of the seen image from the fMRI data. Their method is based on linear regression. Although this decoder can estimate visual features, the estimation accuracy strongly depends on the size of the training set that consists of fMRI data. However, it is essential to lie in a closed and narrow space for a long period to measure fMRI data. Preparing a large amount of fMRI data also places a psychological and time burden on a subject.
Meanwhile, it is still challenging to estimate visual perception using a limited amount of data. Some studies [31,32] have used multi-subject fMRI data obtained when multiple subjects see the same image. These methods construct the latent space to extract the common features (shared features) from multi-subject fMRI data. These methods are based on the generative model, and they can stably train models using multiple inputs, even though the size of training data is limited. They emphasized the concept of shared features as the common information between multi-subject fMRI data. However, they have not considered individual features on which the researchers [8,35] focused as the subjectspecific information in single-subject fMRI data. Each feature contains different information, and it is possible that accuracy is improved by combining shared and individual features in terms of differences in expressive ability.
In this study, we propose a novel method for estimating the visual features of a seen image from multi-subject fMRI data. To improve the estimation accuracy, we introduce the idea of focusing on multi-subject and single-subject fMRI data. We calculate the latent space based on fMRI data and extract shared and individual features using a generative model. The generative model assumes the distributions of the extracted features and can extract effective features from a limited amount of data. We can train decoders to estimate visual features using the extracted shared and individual features. In addition, it is possible to use common cognitive information from multi-subject fMRI data and subject-specific cognitive information from single-subject fMRI data.
The remainder of this paper is organized as follows. In Section 2, we explain the proposed estimation method using multi-subject fMRI data. Section 3 presents the experimental results using an fMRI dataset when multiple subjects see an image. Finally, Section 4 presents the conclusion.

Estimation of Visual Features of Seen Image Using Shared and Individual Features
In this section, we explain the proposed method. The overview of the training phase is illustrated in Figure 1. We constructed a probabilistic generative model (PGM) to extract shared and individual features from multiple and single subjects. In addition, a visual decoder was constructed to estimate visual features from both extracted features. The overview of the test phase is shown in Figure 2. We extracted shared and individual features from single-subject fMRI data using the PGM and estimate the visual features of a seen image using the trained visual decoder. The trained PGM can extract shared features from only single-subject fMRI data. The training and test phases are explained in Sections 2.1 and 2.2, respectively. A list of the variables used is presented in Appendix A.

Training Phase: Construction of PGM and Visual Decoder
The training phase consisted of two steps. In the first step, we constructed the PGM to extract shared and individual features from fMRI data separately. This model can extract features robust to the noise in fMRI data. In the second step, we trained a visual decoder to estimate the visual features of a seen image. The visual decoder can transform shared and individual features into visual features using a projection matrix.  Overview of the test phase in the proposed method. The PGM corresponding to each feature was used to extract features from the target subject's fMRI data. The trained visual decoder can estimate visual features using shared and individual features, and this scheme realizes our approach.

Step 1: Construction of PGM
In step 1, we constructed the PGM for extracting of shared and individual features from fMRI data B i = [b i,1 , · · · , b i,N ] ∈ R d i ×N (i = 1, . . . , J; here, J represents the number of subjects, d i denotes the dimension of the fMRI data in ith subject, and N represents the number of fMRI data training corresponding to each seen image). First, we describe the scheme for extracting the shared features C = [c 1 , · · · , c N ] ∈ R d com ×N (d com being the dimensions of the shared features). An algorithm table is shown in Algorithm 1. The Gaussian distribution is introduced as a prior distribution of shared features C into the following minimization problem: where P i ∈ R d i ×d com denotes the projection matrix that transforms the fMRI data B i into the shared features C, and I represents the identity matrix. The prior distribution of the shared features c n (n = 1, · · · , N) and the conditional Gaussian distribution p(b i,n |c n ) are given as follows: where Σ c ∈ R d com ×d com denotes the covariance matrix of the shared features c n , µ i = 1 N ∑ N n=1 b i,n ∈ R d i represents the mean of the fMRI data B i , and ρ 2 i represents the variance of B i . The fMRI data of each subject are combined under the assumption that multiple subjects see the same image, and multi-subject fMRI data corresponding to an image b n = [b 1,n , · · · , b J,n ] are defined. Moreover, the marginal probability distribution p(b n ) and b n are represented as follows: where P = [P 1 , · · · , P J ] , µ = [µ 1 , · · · , µ J ] and Ψ = diag(ρ 2 1 I, · · · , ρ 2 J I) ∈ R d all ×d all are combined parameters (d all = ∑ J i=1 d i ), and ∼ N (0, Ψ) is an error term. To calculate the marginal probability distribution in Equation (4), we define the joint distribution of the shared features c n and fMRI data b i,n and take the logarithm. The mean and covariance matrix of p(b n ) can be calculated by computing the exponential part of the joint matrix distribution for the second-order and first-order terms [36].

Initialize:
33: We introduce the expectation maximization (EM) algorithm [37] for updating the model parameters P i , ρ 2 i , and Σ c . The posterior distribution p(c n |b n ) is calculated in the expectation step of the EM algorithm. The posterior distribution p(c n |b n ) follows the Gaussian distribution, and we can analytically calculate the expected value E c|b [c n ] and the variance Var c|b [c] as follows: The expected value E c|b [c n ] and the variance Var c|b [c] are calculated using the joint distribution of the shared features c n and fMRI data b n , similarly to the calculation in Equation (4). However, the joint distribution is defined on the basis of the marginal probability distribution p(b n ). The joint distribution is taken by logarithm. We calculate the expected value E c|b [c n ] and the variance Var c|b [c] using the exponential part of the joint matrix distribution for the second-order and first-order terms [36]. Furthermore, the second-order moment E c|b [c n c n ] is calculated as follows: The parameters P i , ρ 2 i , and Σ c are updated to maximize the expected value R(θ, in the maximization step of the EM algorithm. Note that θ old is a fixed parameter in the expectation step. The expected value R(θ, θ old ) is expressed as follows: In the above equation, the transformation is performed with respect to θ. A term with only θ old as parameters can be regarded as a constant and is excluded from the expectation value R in the maximization step. The expected value R(θ, θ old ) is calculated using partial derivatives with respect to the parameters {P i , ρ 2 i , Σ c } and maximized. The updated parameters P new i , ρ new 2 i and Σ new c are defined as follows: In addition, shared features of each subject can be extracted as follows: where c i,n denotes the shared features in the ith subject. We can extract shared features following the Gaussian distribution, and the extracted features are effective for estimating visual features due to the robustness of noise. Similarly, we used the PGM to extract individual features H i = [h i,1 , · · · , h i,N ] ∈ R d ind ×N from the fMRI data B i (d ind being the dimension of the individual features). We also introduce the Gaussian distribution as a prior distribution of the individual features H i into the following minimization problem, and the prior distribution of the individual features h i,n is given as follows: where P i ∈ R d i ×d ind denotes the projection matrix that transforms the fMRI data B i to the individual features H i . The prior distribution of the shared features h i,n and the conditional Gaussian distribution p(b i,n |h i,n ) are given as follows: where Σ h i ∈ R d ind ×d ind denotes the covariance matrix of h i,n , and ρ i 2 denotes the variance of B i . The marginal probability distributions p(b i,n ) and b i,n are represented as follows: where i ∼ N (0, Ψ i ) represents an error term. Note that Ψ = diag(ρ i 2 I) ∈ R d i ×d i , and ρ i 2 denotes the variance of B i . We can extract the individual features h i,n following the calculation steps of the shared features c n using the EM algorithm. The obvious difference between individual features and the shared features c n is that it is not necessary to combine these parameters in multiple subjects. Finally, the individual features h i,n can be calculated using the following equation: 2.1.2.
Step 2: Construction of Visual Decoder In step 2, we trained the visual decoder that convert shared and individual features into visual features V = [v 1 , · · · , v N ] ∈ R d v ×N (d v being the dimensions of the visual features). We calculated projection matrices P com,i ∈ R d v ×d com and P ind,i ∈ R d v ×d ind . The following minimization problem was computed with respect to the projection matrices P com,i and P ind,i : where λ com,i and λ ind,i represent regularization parameters. By repeating the partial derivatives with respect to P com,i and P ind,i , we can simply obtain the following optimal projections:

Test Phase: Estimation of Visual Features of Seen Image
The test phase consisted of two steps. In the first step, we extracted shared and individual features using the developed PGM. In the second step, we estimated visual features using shared and individual features using the constructed visual decoder.

Step 1: Extraction of Shared and Individual Features
We extracted the shared features c test,i using the parameters of the PGM as follows: where b test,i ∈ R d i denotes fMRI data for the ith subject in the test phase. The shared features could be extracted as each subject's feature. Similarly, individual features h test,i were extracted as follows: Both features followed a Gaussian distribution, and we can extract effective features from fMRI data.

Step 2: Estimation of Visual Features
Visual features were estimated using shared and individual features using the trained visual decoder as follows: where v est,i denotes the visual features estimated using visual decoders in the ith subject.
In Equation (27), visual features are estimated from shared and individual features using each projection matrix corresponding to the features.

Experimental Results
This section presents the experimental results of the image category estimation. In Section 3.1, the datasets used in constructing the proposed method are explained. In Section 3.2, the experimental conditions are described. In Section 3.3, the comparison methods are explained. In Section 3.4, the experimental results are presented.

Dataset
In this experiment, we used the fMRI data (approximately 4500-dimensional vectors) published in a previous study [8]. fMRI data comprise data on visual cortex activities of five subjects while observing images with measuring equipment (Siemens MAGNE-TOM Prisma (https://www.siemens-healthineers.com/jp/magnetic-resonance-imaging/ research-systems/magnetom-prisma (accessed on 10 August 2022))). To obtain fMRI data in [8], four males and one female between the ages of 23 and 38 were chosen as the subjects, and the functional localizer [38][39][40] and the standard retinotopy [4,41] experiments were conducted to identify the visual cortex of each subject. There were 1200 seen images of 150 categories collected in ImageNet [42] (eight images per category). We used these images as pairs of fMRI data.
We performed cross-validation to examine the effectiveness of the proposed method through unbiased experiments. Due to the significant burden on the subject during brain activity acquisition, preparing several samples for the fMRI dataset is difficult. Therefore, as shown in Figure 3, we divided these 1200 pairs into 900, 150, and 150 pairs as training, validation, and test data, respectively. All categories were equally divided into the training, validation, and test data. In addition, we applied 7-fold cross-validation to 1050 pairs consisting of training and test data, respectively. If the training and validation data are interchanged, as is widely done in the machine learning field, the validity of the proposed method on small amounts of test data would be verified. Overview of fMRI datasets of five subjects. We divided a total of 1200 seen images corresponding to measured fMRI data into 900, 150, and 150 images as training, test, and validation data, respectively. The validation data were fixed, and 7-fold cross-validation was applied to 1050 pairs of the training and test data. For category estimation, we used the candidate visual features averaged from other images belonging to the same seen category. Thus, the seen images were not included in the test and validation data.
First, 4096-dimensional visual features were extracted from VGG19 [43]. VGG19 was generally pre-trained for the 1000 categories in ImageNet, and the fully connected layers that extracted visual features were selected farther from the output. Furthermore, the principal component analysis (PCA) [44] was applied to the visual features. The visual features have high dimensions, and we used PCA to prevent overfitting. We selected the cumulative contribution ratio of PCA as 0.8 (the dimensions of visual features applied PCA d v , being approximately 70).

Experimental Conditions
The estimated accuracy was evaluated by image category estimation. CNNs are mostly trained for image categorization, and category estimation is an appropriate evaluation metric for the representation ability of visual features extracted from a pre-trained CNN. Among the categories of seen images of the fMRI data used in this experiment, some were not used for the pre-trained CNN classifiers. Therefore, we evaluated the estimated accuracy using visual features. Figure 4 shows an overview of category estimation. The image category ranks of the estimated and candidate visual features indicate categories via VGG19 based on the correlations. We selected from 10,000 categories in the fMRI dataset and calculated averaged visual features in each category as candidate visual features. Note that of the 10,000 categories, 150 image categories were used in the fMRI dataset in the test phase. Candidate visual features were 5-10 images chosen at random from each category. The ranks of the estimated visual features and 10,000 candidate visual features were calculated and rearranged in descending order; the ground truth (GT) rank was defined as the image category rank. Finally, the confidence category score S was calculated from the image category rank G as follows: where M represents the total number of image categories, and we set M to 10,000 in this experiment. The confidence category score S approaches 1 for better image category ranks G and 0 for worse. The confidence category scores were averaged in 150 test data points, 7-fold cross-validation sets, and five subjects. We used this metric for the experiment evaluation. Figure 4. Overview of the scheme of category estimation. In the training phase, the relationship between a seen image and the corresponding fMRI data in each subject was learned. In the test phase, we estimated visual features using fMRI data based on the learned relationship. However, the visual features to be compared were computed from images chosen at random from ImageNet. These other images were 5-10 samples in each category, and we selected 10,000 categories from ImageNet. The 10,000 categories included 150 image categories belonging to the fMRI data in the test phase. We defined candidate visual features, averaged visual features, extracted from these other images in each category, and compared them with estimated visual features from fMRI data. Finally, we calculated the correlations between the estimated and candidate visual features, and the accuracy of the estimations was evaluated with the seen image category as the ground truth (GT).

Comparison Methods
We compared the proposed method (hereafter denoted as PM) with several comparison methods (hereafter denoted as CMs) based on the evaluation metric to validate the effectiveness of the PM. The CMs have seven patterns. Two CMs use multi-subject fMRI data and five CMs use single-subject fMRI data.

Multi-subject probabilistic generative model (MSPGM):
MSPGM is a method based on the PGM, and PM uses multi-subject fMRI data. Visual features are estimated from shared features using ridge regression [45]. We set the number of dimensions in the latent space to the same number of dimensions of the PM and searched for {0.1, 1, 10} in the regularization parameter of ridge regression.

Multi-view Bayesian generative model for multi-subject fMRI Data (MVBGM-MS):
MVBGM-MS [46] exhibited state-of-the-art performance in the field of brain decoding for visual cognitive contents. MVBGM-MS uses multi-subject fMRI data, and the generative model estimates visual features via the latent space. MVBGM-MS uses visual features, multi-subject fMRI data, and semantic features extracted by inputting image category names into Word2vec [47] to improve accuracy. Therefore, for a fair evaluation of the PM, we used MVBGM-MS without semantic features. We set the number of dimensions in the latent space in the same manner as in the previous study [46].

Single-subject probabilistic generative model (SSPGM):
The SSPGM method is based on fMRI data and uses single-subject fMRI data. Visual features are estimated from individual features using ridge regression. We set the number of dimensions in the latent space to the same number of dimensions of the PM and searched for {0.1, 1, 10} in the regularization parameter of the ridge regression.

Sparse linear regression (SLR):
SLR [8] is a baseline method in the field of brain decoding for visual cognitive contents and directly estimated visual features from fMRI data. We estimated visual features by using voxels consisting of fMRI data with a high correlation to the features. Voxels were selected in the order of increasing correlation, and the total number of voxels to be selected was set as a hyperparameter. We searched the number of voxels for {50, 100, 200, 400, 500, 1000}.

Canonical correlation analysis (CCA):
CCA [48] is a baseline method for calculating the latent space from multi-modal features. Visual features and fMRI data are converted into features belonging to the latent space, and accuracy is evaluated in the space. We searched for {10, 20, 30, 40, 50, d v } in the number of dimensions in the latent space.

Bayesian CCA (BCCA):
The BCCA [49] method is an extension of CCA that adopts Bayesian learning. BCCA is a generative model. The latent space consists of visual features and fMRI data, and visual features can be estimated from fMRI data via the space. We searched for {10, 20, 30, 40, 50, d v } in the number of dimensions in the latent space.

Deep CCA (Deep CCA):
The Deep CCA [50] method is also an extension of CCA that adopts deep learning. Similarly to CCA, visual features and fMRI data are converted into features belonging to the latent space, and accuracy is evaluated in the space. We searched for {10, 20, 30, 40, 50, d v } in the number of dimensions in the latent space. Table 1 shows the accuracy of the category estimation in the PM and CMs. Note that the average scores of 150 test images were calculated according to each subject as the evaluation metric. These scores range from 0 to 1, and large values indicate superiority. In the PM, MSPGM, and SSPGM, we set d com , d ind , and the number of iterations in the EM algorithm to 100, 100, and 10. In addition, in the PM's visual decoder, we searched each regularization parameter λ com,i and λ ind,i for {0.1, 1, 10}. In Table 1, the scores, of most subjects and the averages of five subjects in the PM are superior to those in MVBGM-MS and MSPGM based on multi-subject fMRI data. The PM's superior scores indicate its effectiveness in distinguishing between shared and individual features in multi-subject fMRI data. Furthermore, MVBGM-MS is state-of-the-art, but the PM outperformed it. The scores of all subjects and the averages of the PM are superior to those of SSPGM, SLR, CCA, Deep CCA, and BCCA based on single-subject fMRI data. The SSPGM method is based on the PGM and PM, and its effectiveness in combining shared and individual features is exhibited. The effectiveness of the PM in estimating the visual features of seen images from fMRI data was confirmed compared with SLR, and based on the quantitative accuracy of the PM, it is reliable. Moreover, compared with CCA and Deep CCA, our method can derive their latent spaces successfully. In particular, the PM significantly outperformed Deep CCA, which is the only model that incorporates deep learning among the CMs. Deep CCA is also inferior to simple CCA in terms of score. Deep learning may not be compatible with fMRI data, for which only a small sample size is available. Furthermore, compared with BCCA as a generative model, we can confirm the superiority of our generative model for extracting shared and individual features. Figure 5 shows the qualitative evaluation of PM and MVBGM-MS, SLR, CCA, BCCA, and Deep CCA, which are methods based on other studies [8,46,[48][49][50][51]. For the image categories of "shirt" and "saddle", the PM has the best confidence category scores for most subjects. These results demonstrate the effectiveness of the PGM as a feature extractor and the idea of using shared and individual features. Figure 6 shows the qualitative evaluation of the PM, MSPGM, and SSPGM based on our PGM. For the image category of "hand calculator", PM has the best confidence category scores for most subjects. However, for the image category of "obelisk", MSPGM has the best confidence category scores for most subjects. These results indicate that although the PGM can extract valid shared features, there exists a possibility that it cannot extract individual features. Due to its characteristics, the PGM is superior in extracting shared features from multi-subject fMRI data. For the image categories "spectacles" and "camera tripod", all methods did not achieve sufficient confidence category scores compared with the quantitative evaluation in most subjects in Table 1. These images contained multiple objects, and a subject's gaze may not be focused on a single object during fMRI data acquisition. In addition, for the category of "camera tripod", a part of a human face also appears in the image, which may have affected the subjects' cognition. Category estimation may still be a difficult task when seeing images containing multiple objects or objects not related to the image categories.

Conclusions and Future Work
In this article, we proposed a method for estimating visual information from multisubject fMRI data obtained while subjects observed images. The PM estimated visual features using shared features in multi-subject fMRI data and individual features in singlesubject fMRI data. The PGMs were constructed with respect to each feature from fMRI data and used as effective feature extractors. In addition, we constructed the visual decoder using the shared and individual features to estimate visual features. The experimental results verified the effectiveness of the proposed approaches. Although fMRI data tend to contain measured noises and large individual differences compared with other biological activities, such as an eye gaze, this experiment confirmed the effectiveness of combining multi-subject fMRI data. These findings validated the use of machine learning for biological activity analysis with time and physical factor constraints.
Apart from the increase in the sample size due to the expansion of the fMRI dataset, using modalities other than fMRI data may provide a hint as to how to improve the accuracy. In particular, introducing other information that represents an image, such as image captions, is expected to improve the results. For example, some theories suggest estimating visual features directly from fMRI data and caption features using an image captioning model (caption features) or constructing a latent space combining fMRI data and visual and caption features to improve expressive ability. The human brain contains regions specialized for object recognition related to an image category and regions related to lower-order information, such as object color and shape [41]. Although visual features extracted using a CNN contain information specific to image category classification, they may not contain sufficient information, such as an image color and shape. Therefore, the introduction of image captions that can represent the colors and shapes in images is considered an effective method for extracting information related to images in fMRI data obtained while subjects see the images. Data Availability Statement: Publicly available dataset was analyzed in this study. This data can be found here: https://github.com/KamitaniLab/GenericObjectDecoding (accessed on 10 August 2022).

Conflicts of Interest:
The authors declare no conflict of interest. Table A1 presents a list of the variables used in Section 2.