Next Article in Journal
Eye Movement Dynamics Differ between Encoding and Recognition of Faces
Next Article in Special Issue
Distinguishing Hemodynamics from Function in the Human LGN Using a Temporal Response Model
Previous Article in Journal / Special Issue
Assessing Lateral Interaction in the Synesthetic Visual Brain
 
 
Article
Peer-Review Record

Reliability and Generalizability of Similarity-Based Fusion of MEG and fMRI Data in Human Ventral and Dorsal Visual Streams

by Yalda Mohsenzadeh 1,*,†, Caitlin Mullin 1,†, Benjamin Lahner 1,2, Radoslaw Martin Cichy 3 and Aude Oliva 1
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Submission received: 22 October 2018 / Revised: 7 January 2019 / Accepted: 13 January 2019 / Published: 10 February 2019
(This article belongs to the Special Issue Visual Perception and Its Neural Mechanisms)

Round  1

Reviewer 1 Report

Mohsenzadeh and colleagues present results from two experiments that both combined MEG and fMRI measures of neural responses to sets of visual stimuli. Using a representational similarity analysis, the authors characterized the pattern of responses to the image sets with either high temporal (MEG) or spatial (fMRI) precision. By correlating representational similarity matrices (RSMs) across time and space, they were able to characterize when in time patterns of representational similarity from MEG matched those from particular anatomical regions of interest (ROIs) as measured during fMRI.

The analyses build upon earlier work from their group using this method, in order to demonstrate reproducibility (comparing response profiles to paired image sets in experiment 1) and generalizability (comparing response patterns from experiment 1 with those from a previously published data set, obtained using different images in a separate group of subjects). The methods and results seem both interesting and robust, but there are some issues with the manuscript that require attention.


Major points:

·       There does not seem to be an assessment of the statistical significance of the test-retest reliability between responses to twin sets 1 & 2 (Figures 2-4). The time course of the similarity effect is indeed closely aligned between the two, but it would seem appropriate to examine this statistically. Perhaps some kind of sliding window correlation to examine when the two similarity time courses do or do not match? This also applies to the comparisons between twin set & ImageNet set (Figures 5-7).

·       Further, are the twin-1 and twin-2 response patterns more similar than the twin vs. ImageNet sets, as might be expected? If so, what does this tell us about the generalizability of this method?

·       It's not clear enough why the authors chose to perform both "voxel-wise" (if I understand correctly, RDM was calculated per voxel via searchlight method, then averaged within an ROI) and "ROI-based" analyses (ROI time courses were averaged, then the RDM was calculated). How did they expect the results to differ between these two analyses, and what was learned by comparing them?

·       Early on, the authors could state more explicitly which aspects of the current study are novel, and how they build upon the previous work of Cichy et al (2016). After reading through the manuscript, it is clear that there are many novel contributions in the current study. But given that this is a direct extension of an earlier paper, highlighting the novelty (perhaps near the end of the introduction) would be helpful for the reader.


Minor points:

·       A bit more information is needed to understand where the twin sets of images came from. I inferred that they were from the Natural Image Statistical Toolbox (reference 36), and that twin pairs were chosen manually, but this does not seem explicit in the text. Also, a bit more should be said about which low level statistics were matched, and whether this was done on a pair-by-pair basis, or for each set as a whole.

·       It seems worth mentioning in section 2.4 that MEG data were averaged across subjects, to reduce noise, as noted in the appendix.

·       In Appendix B it mentions “To overcome computational complexity and reduce noise, trials were randomly sub-averaged in groups of 3 in Experiment 1.”  How does this sub-averaging procedure interact with the twin image analysis? Was the same averaging performed for each twin set (e.g., averaging responses for horse, flower, and sandwich in both twin sets 1 & 2)?

·       In the Results (line 114) the authors state “these results reveal that the neural responses start in occipital pole around 70-90 msec after stimulus onset, followed by neural responses in anterior direction along the ventral stream (Figure 2ac), and across the dorsal stream up to the inferior parietal cortex.” However, given their data, it would seem more accurate to describe this in terms of the spatiotemporal profile of representational similarity, rather than neural responses per se.

·       The peak latency plots (Figures 4 & 7) are somewhat difficult to interpret, because of the y-axis scaling. I would suggest scaling between ~75 - 150 ms.


Author Response


We thank Reviewer 1 for their helpful comment. Reviewer 2 also made this important assessment. We thus assessed reproducibility of the full brain 4D spatiotemporal maps (time x voxel x voxel x voxel) between Twin set 1 and 2. For each participant, we compared the voxel-wise fMRI-MEG correlation time series of Twin set 1 with Twin set 2 by computing their Pearson correlation. This resulted in a 3D reliability (correlation) map per individual. We used permutation tests to test for significance and corrected for multiple comparisons using the cluster correction method (n=15; cluster-definition threshold of p<0.01, and cluster size threshold of P<0.01). The significant clusters of the reliability map are depicted in Figure 5 of the revised manuscript (Figure 5 of this response letter, see below). We find relatively high reliability across both the ventral and the dorsal stream. The new results are added to the revised manuscript (Lines 145-151) of the Result Section:

“To assess reproducibility of the full brain 4D spatiotemporal maps (time x voxel x voxel x voxel) between Twin set 1 and 2, for each participant separately, we compared the voxel-wise fMRI-MEG correlation time series of Twin set 1 with Twin set 2. This resulted in one 3D reliability (correlation) map per participant. We assessed significance by permutation tests and corrected for multiple comparisons using the cluster correction method (n=15; cluster-definition threshold of p<0.01, and cluster size threshold of P<0.01). This revealed significant clusters in the reliability map across both the ventral and the dorsal stream (Figure 5).”

The detail of this new analysis is added to the Appendix of the revised manuscript (Lines 473-479):

“Appendix E
Reliability Maps
To assess the reproducibility of the full brain fMRI-MEG fusion method, we computed participant-specific reliability maps. For this, at each voxel we compared the MEG-fMRI correlation time series of Twin set 1 with the corresponding time series of Twin set 2 by computing Pearson’s correlations. This yielded a 3D reliability (correlation) map per individual. We determined significance with permutation tests and corrected for multiple comparisons using cluster correction.”


While the above analysis tests for similarity of time courses in the 3D voxel-wise map, we further investigated whether there are any significant dissimilarities. However, statistical tests investigating the difference between correlation time courses (of spatially restricted searchlight voxel-wise fusion analysis) in Figures 2b, 3b, 5b, and 7b revealed no significant differences (n=15; cluster-definition threshold of p<0.01, and cluster size threshold of P<0.01). This is clarified the revised manuscript Lines 132-137:
“While the analyses reported in Figures 2b and 3b tests for similarity of time courses, we further investigated whether there are any significant dissimilarities. This explicit test however revealed no significant differences (permutation tests, n=15; cluster-definition threshold of p<0.01, and cluster size threshold of p<0.01). This demonstrates the reliability of the fusion method in reproducing similar patterns of neural activity across similar visual experiences.”

and Lines 163-166:

“Qualitatively, figures 6b and 7b show similar MEG-fMRI time courses for the two datasets (Twins-Set and ImageNet-Set) in EVC, ventral regions (VO and PHC) and dorsal regions (IPS0 and IPS1). Further explicit testing revealed no significant difference between the two sets (permutation tests, n=15; cluster-definition threshold of p<0.01, and cluster size threshold of p<0.01).”< em="">

See Figure 5.


·Further, are the twin-1 and twin-2 response patterns more similar than the twin vs. ImageNet sets, as might be expected? If so, what does this tell us about the generalizability of this method?

The reviewer is correct in noting that correlation time series for the twin set 1&2, are more similar than twin sets vs. the imageNet set. Twin sets 1 and 2 were equalized for low and high level visual features, while the twins set and imageNet set were not. This allowed us to assess which aspects of representations are generalizable across distinct visual experiences. We found that that despite the differences in stimulus material the fusion method captured the consistent pattern of the spatiotemporal dynamics of vision. This is clarified in the revised manuscript in discussion:

Lines 313-319:
“Additionally, qualitative comparison of correlation time series in Figures 2b and 3b with 6b and 7b reveals that the response patterns between Twin set 1 and 2 are qualitatively more similar than between the Twins sets and the ImageNet set. This is most likely due to the fact that Twin set 1 and 2 had similar low- and high-level features by design, and both were different from the features of the ImageNet set. Nevertheless, the similarity in spatiotemporal dynamics in the two experiments suggests that the neural dynamics in processing natural images are similar.”


· It's not clear enough why the authors chose to perform both "voxel-wise" (if I understand correctly, RDM was calculated per voxel via searchlight method, then averaged within an ROI) and "ROI-based" analyses (ROI time courses were averaged, then the RDM was calculated). How did they expect the results to differ between these two analyses, and what was learned by comparing them?

The idea behind performing two analyses is to highlight that the results are consistent and replicable across different possible analysis pipelines that each have their respective pros and cons. We used the ROI-based fMRI-MEG fusion here because it had been established in the original publication (Cichy et al. 2014 and 2016), and using it here allowed us to compare results directly with the original manuscript. This is clarified in the revised manuscript in Lines 460-461:
“ROI-based fMRI-MEG Fusion: This analysis is the previously established way of performing ROI based fusion and is directly comparable with the original fusion studies [8,9].”


Note that the spatially restricted searchlight voxel-wise fMRI-MEG fusion method is novel and better matches the whole brain searchlight analysis, allowing us to focus on results in certain regions of the full brain analysis. In this manner we used this method in Figures 2, 3, 6, and 7 The motivation is explained in Lines 455-457.
“This ROI analysis method picks out the spatiotemporal dynamics in particular cortical regions as resolved with the searchlight fMRI-MEG fusion method.”


Early on, the authors could state more explicitly which aspects of the current study are novel, and how they build upon the previous work of Cichy et al (2016). After reading through the manuscript, it is clear that there are many novel contributions in the current study. But given that this is a direct extension of an earlier paper, highlighting the novelty (perhaps near the end of the introduction) would be helpful for the reader.


We now added a paragraph to the end of introduction (Lines 53-57) describing the results and their respective novelty.

“Results revealed that the established fusion method is reliable and generalizable within and across image sets and participant groups. In addition, novel approaches to region-of-interest based analyses validate the replicability of the spatio-temporal dynamics for highly similar visual content, demonstrating the robustness of the technique for tracing representational similarity in the brain over space and time.”


Minor points: · A bit more information is needed to understand where the twin sets of images came from. I inferred that they were from the Natural Image Statistical Toolbox (reference 36), and that twin pairs were chosen manually, but this does not seem explicit in the text. Also, a bit more should be said about which low level statistics were matched, and whether this was done on a pair-by-pair basis, or for each set as a whole.


The stimulus pairs were selected manually from the LaMem dataset (Khosla et al. 2015) Low-level statistics were matched using the Natural Image Statistical Toolbox (Bainbridge and Oliva, 2015), while high level semantics (a verbal semantic description of the main object shown) were matched based on consensus of the authors. This is clarified in the revised manuscript in the Methods (Lines 70-75):

“In Experiment 1, the stimulus set consisted of twin sets of 78 real-world natural images each (156 images total) from the LaMem dataset [36]. Twin-set 1 and Twin-set 2 each contained an image identifiable by the same verbal semantic description of the main object shown (based on consensus among the authors; see Figure 1a for examples).The sets were not significantly different on a collection of low level image statistics [37-39] (See Appendix A for more information about stimulus set). ”

The information about the stimulus set and the statistical tests are added to the revised manuscript in Appendix A and Tables S1 and S2 (Lines 352-364).


“Appendix A
Twin sets features control: As described in the Methods section the twin sets consists of 78 pairs of natural images. The images were controlled for high-level verbal semantics by matching the verbal semantic description of the main object shown (determined by consensus among authors). Further, images sets were matched for low-level image features (color, luminance, brightness, and spatial frequency) using the natural image statistical toolbox [37-39]. In detail, the toolbox statistically compares several stimulus features such as spatial frequency at different energy levels (spectral energy, Table S1), color (Table S2), brightness and contrast. There were no significant differences between twin set 1&2 in any feature.”


See Table S1 and S2.


It seems worth mentioning in section 2.4 that MEG data were averaged across subjects, to reduce noise, as noted in the appendix.


We clarified the sub-averaging process in section 2.4 of the revised manuscript in Lines 88-89.

“To overcome computational complexity and reduce noise, trials per image condition were randomly sub-averaged in groups of 3 in Experiment 1, and groups of 5 in Experiment 2.”


In Appendix B it mentions “To overcome computational complexity and reduce noise, trials were randomly sub-averaged in groups of 3 in Experiment 1.” How does this sub-averaging procedure interact with the twin image analysis? Was the same averaging performed for each twin set (e.g., averaging responses for horse, flower, and sandwich in both twin sets 1 & 2)?


We see that the method description in the previous version of the manuscript was ambiguous. We never average across experimental conditions. Instead, the process of random assignment to subgroups and averaging was performed on the level of trials for each condition (i.e. image) separately. In other words, the trials for the picture of horse were randomly assigned into groups of 3 and sub-averaged, before entering the classifier.This is clarified in Appendix B of the revised manuscript (Line 401-403).


“To overcome computational complexity and reduce noise, we subaveraged trials for each experimental condition (image separately) with random assignment of trials to bins ...”


· In the Results (line 114) the authors state “these results reveal that the neural responses start in occipital pole around 70-90 msec after stimulus onset, followed by neural responses in anterior direction along the ventral stream (Figure 2ac), and across the dorsal stream up to the inferior parietal cortex.” However, given their data, it would seem more accurate to describe this in terms of the spatiotemporal profile of representational similarity, rather than neural responses per se.

Agreeing with the reviewer’s comment, we now describe the results in terms of neural representations in the revised manuscript.

· The peak latency plots (Figures 4 & 7) are somewhat difficult to interpret, because of the y-axis scaling. I would suggest scaling between ~75 - 150 ms.

Following the reviewer’s suggestion, we rescaled the peak latency plots in the revised version of the manuscript.


Reviewer 2 Report

The authors of this manuscript set out to test the reproducibility of results form fusing fMRI and MEG data using two approaches. (1) In experiment 1 they included two separate but closely matched sets of images to allow for the comparison of results within subjects but across stimuli. (2) Experiment 2 was performed with a different set of images and a different set of subjects, allowing for the comparison across subject groups and across less well-matched stimulus sets.

 

1. The mission of testing the reproducibility and consistency of a newly developed method is laudable and important. The authors have chosen a sensible experimental setup for this mission. Unfortunately, in the end, the comparison of results across stimulus sets or experiments is left to the reader by eye-balling similar-looking graphs and brain maps. I strongly recommend that the authors add a quantitative comparison of the 4D maps resulting from the three setups (Exp. 1, set 1; Exp 1, set 2; Exp. 2). This could be done, for instance, by correlating the 4D maps, similar to how voxel activations. Since it is not ad-hoc clear what correlation coefficients to expect in such an analysis, boot-strapping could be used to establish a null distribution. In my view, the paper is incomplete without such a quantitative assessment of reproducibility.

 

I have a few additional, less serious questions about the methods used here.

 

2. Why is Spearman rank correlation used here? I know that the authors reproduce what has been used for RSA before. But perhaps having a review paper such as this would be a good place to explain this choice. In my view, rank correlation introduces non-linearities in the analysis, which can lead to small differences in RDMs entries having a large effect on R or relatively large differences in RDMs having no effect at all. Let’s take the following hypothetical example for RDM entries for “shoe” (correct response) with “hair brush”, “hammer”, and “icepick” (confusions):

Pattern A: 0.4, 0.11, 0.09

Pattern B: 0.3, 0.29, 0.01

Pattern C: 0.4, 0.09, 0.11

 

By just inspecting these patterns, I would argue that A has higher similarity with C than with B. This is in fact what Pearson correlation would yield: corr(A,C) = 0.99; corr(A,B) = 0.57. On the other hand, Spearman correlation yields: corr(A,C) = 0.5; corr(A,B) = 1. This illustrates the point that a small change in the pattern from A to C can have a large effect on Spearman correlation, and a large change from A to B can have no effect at all.

 

So why is Spearman R used for the fusion of the RDMs? I understand that this question wasn’t the main point of the paper, but why not include the justification in this review paper?

 

 

3. What gets averaged and why?

In Appendix C (searchlight analysis), the authors state MEG RDMs are averaged over subjects and then correlated to each individual’s fMRI RDM. Why? Why not correlate each subject’s MEG RDM to their own fMRI RDM? In appendix D (ROI-based analysis) the authors state that “FMRI RDMs were averaged over subjects and then compared with subject-specific MEG RDMs.” Why is it the other way around for the ROI analysis than for the searchlight analysis?

 

In appendix D, the authors state that RDM patterns are averaged over voxels within ROIs “to create a single RDM matrix per ROI. In the results section (lines 120-121) they write that they “averaged the correlation values over the voxels within each ROI resulting in one correlation time series per individual." So what was being averaged over ROIs, RDMs or correlations? Since correlation is non-linear, these two operation do not commute, and the order is important for the results.

 

4. Appendix E:

Please provide more details of the cluster analysis. Given the importance of this correction for avoiding false alarms, and given recent concerns over this type of correction in some fMRI analysis packages, it would be comforting to know that the analysis here is not prone to those problems. I don’t believe that it is, but it would be good to have this laid out more clearly.

 

5. Several minor points:

 

Line 114: should be “at the occipital pole”

 

Line 115: “in the anterior direction” (article missing)

 

Line 116: “along the dorsal stream” (instead of “across”)

 

Line 142: ditto

 

Lines 298-301:

I don’t like this last sentence. The relationship of object recognition to attention and memory has not come up earlier in the paper, nor is it the focus of this paper. Why not have a concluding sentence that actually highlights the reproducibility and therefor trustworthiness of the fusion method?

 

Line 322: “SOA”

 

Line 360: should probably be: “3 x 3 x 3 mm” (3 mm^3 would be a lot smaller!)

 

Line 364: ditto

 


Author Response


The authors of this manuscript set out to test the reproducibility of results form fusing fMRI and MEG data using two approaches. (1) In experiment 1 they included two separate but closely matched sets of images to allow for the comparison of results within subjects but across stimuli. (2) Experiment 2 was performed with a different set of images and a different set of subjects, allowing for the comparison across subject groups and across less well-matched stimulus sets

  1. The mission of testing the reproducibility and consistency of a newly developed method is laudable and important. The authors have chosen a sensible experimental setup for this mission. Unfortunately, in the end, the comparison of results across stimulus sets or experiments is left to the reader by eyeballing similar-looking graphs and brain maps. I strongly recommend that the authors add a quantitative comparison of the 4D maps resulting from the three setups (Exp. 1, set 1; Exp 1, set 2; Exp. 2). This could be done, for instance, by correlating the 4D maps, similar to how voxel activations. Since it is not ad-hoc clear what correlation coefficients to expect in such an analysis, boot-strapping could be used to establish a null distribution. In my view, the paper is incomplete without such a quantitative assessment of reproducibility.


Better quantitative assessment of reproducibility is the aim of this paper, so we are thankful to the reviewer for suggesting better way to analyze the data.


To assess reproducibility of the full brain 4D spatiotemporal maps (time x voxel x voxel x voxel) between Twin set 1 and 2. For each participant, we compared the voxel-wise fMRI-MEG correlation time series of Twin set 1 with Twin set 2 by computing their Pearson correlation. This resulted in a 3D reliability (correlation) map per individual. We used permutation tests to test for significance and corrected for multiple comparisons using the cluster correction method (n=15; cluster-definition threshold of p<0.01, and cluster size threshold of P<0.01). The significant clusters of the reliability map are depicted in Figure 5 of the revised manuscript (Figure 5 of this response letter, see below). We find relatively high reliability across both the ventral and the dorsal stream. The new results are added to the revised manuscript (Lines 145-151) of the Result Section:


“To assess reproducibility of the full brain 4D spatiotemporal maps (time x voxel x voxel x voxel) between Twin set 1 and 2, for each participant separately, we compared the voxel-wise fMRI-MEG correlation time series of Twin set 1 with Twin set 2. This resulted in one 3D reliability (correlation) map per participant. We assessed significance by permutation tests and corrected for multiple comparisons using the cluster correction method (n=15; cluster-definition threshold of p<0.01, and cluster size threshold of P<0.01). This revealed significant clusters in the reliability map across both the ventral and the dorsal stream (Figure 5)”


The detail of this new analysis is added to the Appendix of the revised manuscript (Lines 473-479):

“Appendix E
Reliability Maps
To assess the reproducibility of the full brain fMRI-MEG fusion method, we computed participant-specific reliability maps. For this, at each voxel we compared the MEG-fMRI correlation time series of Twin set 1 with the corresponding time series of Twin set 2 by computing Pearson’s correlations. This yielded a 3D reliability (correlation) map per individual. We determined significance with permutation tests and corrected for multiple
comparisons using cluster correction.”


While the above analysis tests for similarity of time courses in the 3D voxel-wise map, we further investigated whether there are any significant dissimilarities. However, statistical tests investigating the difference between correlation time courses (of spatially restricted searchlight voxel-wise fusion analysis) in Figures 2b, 3b, 5b, and 7b revealed no significant differences (n=15; cluster-definition threshold of p<0.01, and cluster size threshold of P<0.01). This is clarified the revised manuscript Lines 132-137:


      “While the analyses reported in Figures 2b and 3b tests for similarity of time courses, we further investigated whether there are any significant dissimilarities. This explicit test however revealed no significant differences (permutation tests, n=15; cluster-definition threshold of p<0.01, and cluster size threshold of p<0.01). This demonstrates the reliability of the fusion method in reproducing similar patterns of neural activity across similar visual experiences.”

and Lines 163-166:


“Qualitatively, figures 6b and 7b show similar MEG-fMRI time courses for the two datasets (Twins-Set and ImageNet-Set) in EVC, ventral regions (VO and PHC) and dorsal regions (IPS0 and IPS1). Further explicit testing revealed no significant difference between the two sets (permutation tests, n=15; cluster-definition threshold of p<0.01, and cluster size threshold of p<0.01).”< span="">


See Figure 5.


I have a few additional, less serious questions about the methods used here.

2. Why is Spearman rank correlation used here? I know that the authors reproduce what has been used for RSA before. But perhaps having a review paper such as this would be a good place to explain this choice. In my view, rank correlation introduces non-linearities in the analysis, which can lead to small differences in RDMs entries having a large effect on R or relatively large differences in RDMs having no effect at all. Let’s take the following hypothetical example for RDM entries for “shoe” (correct response) with “hair brush”, “hammer”, and “icepick” (confusions):
Pattern A: 0.4, 0.11, 0.09
Pattern B: 0.3, 0.29, 0.01
Pattern C: 0.4, 0.09, 0.11

By just inspecting these patterns, I would argue that A has higher similarity with C than with B. This is in fact what Pearson correlation would yield: corr(A,C) = 0.99; corr(A,B) = 0.57. On the other hand, Spearman correlation yields: corr(A,C) = 0.5; corr(A,B) = 1. This illustrates the point that a small change in the pattern from A to C can have a large effect on Spearman correlation, and a large change from A to B can have no effect at all.
So why is Spearman R used for the fusion of the RDMs? I understand that this question wasn’t the main point of the paper, but why not include the justification in this review paper?


The main reason to use Spearman R for the fusion of the RDMs is that the the conditions for using Spearman,but not Pearson R are met. While Pearson R assumes linearity, Spearman R assumes only monotony. The relationship between the correlated measures in MEG-fMRI as used
here is not linear, but merely monotonous. This can be demonstrated by attending to the bounded nature of the values. MEG RDMs here are decoding accuracies (bound 0-100%), and fMRI RDMs are 1-Pearson R (bounded between 0 and 2).

Independent of this, we in principle fully agree that Pearson correlation would be preferable, and that using Spearman correlation in cases such as the one described by the reviewer can lead to undesirable results which badly capture the true state. However, the probability of the situation
described by the reviewer decreases with the number of conditions compared. The RDMs used here comprise thousands of values, such that the chances of Spearman vs. Pearson correlation yielding different results is very small. Anecdotally, when calculating MEG-fMRI fusion with Pearson correlation we in every case observed qualitatively indistinguishable results compared to Spearman-based correlation. We clarified the rationale behind the choice of Spearman’s rank correlation in Lines 444-446 of the revised manuscript.


“We used Spearman rather than Pearson correlation to compare MEG-fMRI responses because we assume that the values compared are monotonously, rather than linearly related to each other.”


3. What gets averaged and why?
In Appendix C (searchlight analysis), the authors state MEG RDMs are averaged over subjects and then correlated to each individual’s fMRI RDM. Why? Why not correlate each subject’s MEG RDM to their own fMRI RDM? In appendix D (ROI-based analysis) the authors state that “FMRI RDMs were averaged over subjects and then compared with subject-specific MEG RDMs.” Why is it the other way around for the ROI analysis than for the searchlight analysis?


Thank you for the insightful comment. Indeed, if both MEG and fMRI data has been recorded in the same set of subjects, there are three possible (and valid) forms of comparison 1) subject-specific MEG to fMRI comparison, 2) subject-specific MEG to subject-averaged fMRI comparison, and 3)
subject-specific fMRI to subject-averaged MEG comparison. Here, for the purpose of replication, we followed the pipeline used in previous work (by Cichy et al. 2014, 2016). Further, please note that having the same set of subjects in both fMRI and MEG is not a requirement for successful
MEG-fMRI fusion (e.g. Mohsenzadeh et al. 2018; Hebart et al. 2018)


Furthermore, in the current manuscript, to show the robustness and replicability of the method independent of specific analysis choices, we have performed two ROI based analyses:(1) in the spatially restricted searchlight voxel-wise analysis the MEG data is averaged and thus matches
the whole brain searchlight analysis (in which MEG data is averaged); (2) in the ROI based MEGfMRI based fusion instead fMRI data is averaged (also see response to point below)). Importantly, all these different analyses yield a highly consistent spatiotemporal dynamic which speaks to the
robustness and replicability of the method independent of particular analysis choices.


In appendix D, the authors state that RDM patterns are averaged over voxels within ROIs “to create a single RDM matrix per ROI. In the results section (lines 120-121) they write that they “averaged the correlation values over the voxels within each ROI resulting in one correlation time series per individual." So, what was being averaged over ROIs, RDMs or correlations? Since correlation is non-linear, these two operations do not commute, and the order is important for the results.


We see that our description of the method was ambiguous and apologize for the confusion. We do not average the RDMs within an ROI. In the two methods used, we either make one ROI with pairwise comparison of fMRI responses within an ROI (second method below) or we run a searchlight within and ROI and create voxel-specific RDMs within the ROI (first method below).

In the current version of the manuscript, in appendix D, we describe the two ROI analysis approaches with more clarity:

1-Spatially Restricted Searchlight Voxel-wise fMRI-MEG Fusion: This analysis builds on the voxelwise full brain fMRI-MEG fusion, but is performed in spatially restricted regions (ROIs) of the brain. In detail, for every participant separately, for every voxel of the ROI we created a searchlight
centered at the voxel, and calculated all pairwise (1-Pearson’s R) comparisons between conditionspecific fMRI pattern responses within the searchlight sphere. This yielded one RDM per voxel. We then compared (Spearman’s R) the subject-and voxel-specific fMRI RDMs with time-resolved subject-averaged MEG RDMs. This resulted in a correlation time-course per voxel within the ROI for each fMRI subject. We then averaged results over voxels in a ROI, resulting in one correlation time-course per ROI and per fMRI subject. This ROI analysis method is well comparable with the full brain fusion analysis and is a convenient way to quantitatively compare the spatiotemporal dynamics of fusion movies within specific brain regions.


2-ROI-based fMRI-MEG Fusion: This analysis is the established way of performing ROI based fusion [8,9]. For each fMRI subject independently, and for each ROI, we extracted the voxel activation patterns within each ROI and then computed the condition-specific pairwise dissimilarities (1-Pearson’s R) between fMRI activation patterns. This yielded one RDM matrix per subject and ROI. The ROI-specific fMRI RDMs were averaged over subjects and then compared with subject-specific MEG RDMs time courses computing their Spearman’s R correlations at every time point. This process results in a time-course of MEG-fMRI fusion in each ROI per subject.


The first method is novel, and it matches well the full brain analysis. Thus, we use this method in Figures 2, 3, 6, and 7 where we show whole brain spatiotemporal map and ROI time series. In lines 120-121 (of the previous version of the manuscript), we describe the first approach, spatially restricted searchlight voxel-wise fMRI-MEG fusion. We clarified this in the revised manuscript (Lines 129-130).


“We averaged the correlation values over the voxels within each ROI resulting in onecorrelation time series per individual (see Appendix D, spatially restricted searchlight  voxel-wise fMRI-MEG fusion method).”


4. Appendix E:
Please provide more details of the cluster analysis. Given the importance of this correction for avoiding false alarms, and given recent concerns over this type of correction in some fMRI analysis packages, it would be comforting to know that the analysis here is not prone to those problems. I don’t believe that it is, but it would be good to have this laid out more clearly.


We used non-parametric permutation-based cluster-size inference as used also previously in the original fusion studies (Cichy et al. 2014 and 2016). In detail, we first averaged the 4D subjectspecific spatiotemporal maps over subjects resulting in a mean 4D correlation map. We then
constructed an empirical distribution of the null hypothesis using the voxel values of the mean correlation map in the baseline period, i.e. from -200 to 0 msec for Twin Set 1 and 2, and from -100 to 0 msec for ImageNet set. The right-sided 99.9% threshold (i.e. p=0.001) was determined as the
cluster-definition threshold. To estimate a permutation-based maximal cluster size threshold, the sign of subject-specific data points was shuffled randomly (1000 iterations), the data was averaged over subjects, and 4D clusters were determined by spatial and temporal contiguity at the
cluster-definition threshold. We then constructed an empirical distribution of maximal cluster size over permutation sampling and determined the 99% threshold (i.e., cluster size threshold at p = 0.01). We reported clusters as significant if they exceed this threshold. We clarified this procedure
in the revised version of the manuscript in Appendix F (Lines 481-492).


5. Several minor points:
Line 114: should be “at the occipital pole”
Line 115: “in the anterior direction” (article missing)
Line 116: “along the dorsal stream” (instead of “across”)
Line 142: ditto
Lines 298-301:
I don’t like this last sentence. The relationship of object recognition to attention and memory has not come
up earlier in the paper, nor is it the focus of this paper. Why not have a concluding sentence that actually
highlights the reproducibility and therefor trustworthiness of the fusion method?
Line 322: “SOA”
Line 360: should probably be: “3 x 3 x 3 mm” (3 mm^3 would be a lot smaller!)
Line 364: ditto

Thank you, we corrected the typos in the revised manuscript. We also revised the conclusion sentence to (Lines 332-335):


“This confirmatory replication within and between image-sets and subject-groups demonstrates the replicability of the method in the study of visual processing. By establishing replicability our study increases the trustworthiness of the method, and we hope that this motivates further researchers to use the method in the study of human brain function.

Round  2

Reviewer 1 Report

The authors have addressed all of the points raised in my original review. Below are some very minor issues I noted in the revised version:

1) Line 446 "monotonously" should be "monotonically."

2) Cluster size thresholds are reported as P < 0.01; I believe reporting at N voxels > # is more traditional.

Author Response

Response to Reviewer 1:

The authors have addressed all of the points raised in my original review. Below are some very minor issues I noted in the revised version:

We thank the reviewer for the constructive and helpful comments that greatly benefitted the manuscript. We are happy that the revised manuscript satisfied the reviewer’s concerns. Here, reviewer comments are marked in gray outline, and authors’ responses in bold face.


1) Line 446 "monotonously" should be "monotonically."

Thank you, we corrected the word in the revised manuscript.

2) Cluster size thresholds are reported as P < 0.01; I believe reporting at N voxels > # is more traditional.

Here, we used cluster size threshold in a statistical test setting. The number of voxels is not fixed, but inferred based on non-parametric permutation tests (Maris and Oostenveld, 2007). The method is described in detail in Appendix F of the manuscript. Therefore, we decided to report cluster size thresholds in terms of p-values (which are fixed) rather than N (which varies in this formulation).

 

Maris E, Oostenveld R. 2007. Nonparametric statistical testing of EEG- and MEG-data. Journal of Neuroscience Methods 164:177–190.


Reviewer 2 Report

I am satisfied with the revision of the paper. Adding the quantitative analysis definitely made the paper stronger. 


Thank you also for your explanation of using Spearman's rank correlation. I agree that it makes little difference in practice. The reasons for not having linearity do not convince me. Both accuracy and correlation are bounded. But I'm not going to pursue this further.


Regarding the issue of averaging: I assume that you average correlations in their Fisher-z transform. Please specify in the manuscript.


Otherwise I'm happy with the revision.

Author Response

Response to Reviewer 2:

I am satisfied with the revision of the paper. Adding the quantitative analysis definitely made the paper stronger. 

Thank you also for your explanation of using Spearman's rank correlation. I agree that it makes little difference in practice. The reasons for not having linearity do not convince me. Both accuracy and correlation are bounded. But I'm not going to pursue this further.

We thank the reviewer for their constructive and helpful comments that greatly benefitted the manuscript. We are happy that the revised manuscript and explanations satisfied the reviewer’s concerns. Here, reviewer comments are marked in gray outline, and authors’ responses in bold face.

Regarding the issue of averaging: I assume that you average correlations in their Fisher-z transform. Please specify in the manuscript.

We do not apply Fisher-z-transform. This is due to very small correlations that makes the nonlinearity introduced by Fisher z-transform negligible.

 

Otherwise I'm happy with the revision.


Back to TopTop