Processing of an Audiobook in the Human Brain Is Shaped by Cultural Family Background

Perception of the same narrative can vary between individuals depending on a listener’s previous experiences. We studied whether and how cultural family background may shape the processing of an audiobook in the human brain. During functional magnetic resonance imaging (fMRI), 48 healthy volunteers from two different cultural family backgrounds listened to an audiobook depicting the intercultural social life of young adults with the respective cultural backgrounds. Shared cultural family background increased inter-subject correlation of hemodynamic activity in the left-hemispheric Heschl’s gyrus, insula, superior temporal gyrus, lingual gyrus and middle temporal gyrus, in the right-hemispheric lateral occipital and posterior cingulate cortices as well as in the bilateral middle temporal gyrus, middle occipital gyrus and precuneus. Thus, cultural family background is reflected in multiple areas of speech processing in the brain and may also modulate visual imagery. After neuroimaging, the participants listened to the narrative again and, after each passage, produced a list of words that had been on their minds when they heard the audiobook during neuroimaging. Cultural family background was reflected as semantic differences in these word lists as quantified by a word2vec-generated semantic model. Our findings may depict enhanced mutual understanding between persons who share similar cultural family backgrounds.


Introduction
People raised and living in one cultural milieu and speaking the same language largely share experiences, knowledge, beliefs, values, attitudes, and communication rules that together facilitate shared understanding and foster smooth cooperation. These similarities create "homophily"-love of the same-that has been studied in sociology for decades [1]. Neuroimaging evidence for homophily was provided only recently with findings showing that brain activity is more similar among friends-in contrast to acquaintances-when they are viewing audiovisual film clips [2]. Interest in the effects of cultural homophily Here, we utilized ISC and word2vec to investigate the influence of cultural family background on interpretation and processing of an audiobook in the human brain. Based on previous studies that have found culture-specific experiences to shape perception and cognition [3,4,24,34], we hypothesized that word2vec reveals cultural family-backgroundspecific enhancements of similarity in how different parts of the story are interpreted. Further, we expected that similar interpretations of the narrative result in higher similarity of brain hemodynamic activity in brain areas known to be involved in the processing of natural speech. These results will extend previous international/intercontinental comparative studies [3,4,24,34] by showing that even slight differences in cultural background within a modern multi-ethnic nation can modify perception. An audiobook as a stimulus allowed us to investigate neural dynamics in closer-to-real-life settings compared to previous studies in the field of cultural neuroscience that have typically used simple, controlled paradigms [3,4,24,34].

Participants
Forty-eight right-handed (Edinburgh Handedness Inventory, [35]) participants participated in the study. Among them, 24 participants had a Finnish family background and 24 participants had a Russian family background. The participants with the Finnish family background had Finnish parents and were born and raised in Finland. They were aged between 19 and 35 years (mean age: 24.7 years). For 12 of the Russian-background participants, both parents were Russian; of the remaining participants, 11 had a Russian mother and one had a Russian father. The participants with Russian family background were aged between 18 and 35 years (mean age: 24.7 years). Twenty of the Russian-background participants were born in Finland; one participant had moved to Finland at the age of eight, one at the age of one, and two at the age of three years. Ten of the participants with the Russian family background were exposed to both Finnish and Russian in the first years of their lives, thirteen started to acquire Finnish at the age of three years when they went to preschool, and one started to acquire Finnish at the age of eight years. The Finnish fluency of the Russian-background participants was evaluated with questions in Appendix A. Further, during the recruitment and experimental procedures, Finnish fluency of all participants was first-hand confirmed as all communications took place in Finnish. Finnish fluency would have been measured more accurately with a language test and more detailed questionnaires, but these were not used to keep the duration of the experiment bearable to the participants. All participants were Finnish residents (11 Russian background participants also had Russian citizenship) and had graduated from a Finnish primary school. Half of the participants in both groups were females. In both groups, one of the participants had a vocational degree and the others had a Matriculation, undergraduate or university degree. None of the participants self-reported any neurological or hearing deficits. Prior to participation, all participants gave their written voluntary informed consent. The study was approved by the Aalto University Research Ethics Committee and conducted in accordance with the Declaration of Helsinki. Note that the participants did not constitute a representative sample of Finnish or Russian family background residents of Finland, and the results can therefore not be generalized to the population level.
It should be noted that the goal of this study was not to investigate the Finnishand Russian-background populations in Finland per se. Rather, the current study is basic research that intends to answer the intriguing question of whether family cultural background can influence how continuous speech is perceived and processed in the brain. Thus, the participants were not recruited with population-sampling methods, the groups are not representative of Finnish-and Russian-background populations in Finland, and our results cannot be generalized to these populations. For example, the gender, age, and education of the participants are not representative of these populations. Instead, these factors were matched between the sample groups so that they did not confound the results of between-group comparisons.

Stimulus
During the fMRI measurements, the participants were presented with a 71 min fictional audiobook in Finnish (custom written by author IPJ; Supplementary Materials: the English-translated textual version). The audiobook was presented during ten measurements (durations: 4.8-8.4 min) between which the participant had a short break in the scanner. The narrative included episodes related to and occasionally contrasting Finnish and Russian cultures (e.g., religion, food, literature, history, and traditions). The protagonists of the narrative are two young adult friends living in Helsinki (Finland) who have a Finnish or a Russian background, respectively. In short, the Finnish man is dating a Russian-background woman, whereas the Russian man is in a relationship with a Finnish woman. The characters are described as having personality and behavioral characteristics stereotypically associated with Finns or Russians (e.g., Russian characters were more emotional and conservative than Finns), but without exaggeration to present the protagonists as realistic. Cultural differences in expressing emotions, style of thinking, behavioral patterns, values, and attitudes result in both difficulties in understanding each other and in conflict between the characters. At the end of the story, the Russian-background man ends up in a relationship with the Russian-background woman and the Finnish man with the Finnish woman. The aim of the story was that the Finnish-only background participants would identify more with the Finnish protagonists and understand the Finnish culture-specific elements, while the participants with a Russian background would identify more with the Russian protagonists and have greater understanding with the Russian-specific elements. The periods of the narrative describing interactions between protagonists were interspersed with periods without social interactions or culture-specific elements. The audiobook was recorded with a sampling frequency of 44,100 Hz in a professional recording studio. For the fMRI experiment, the auditory file of the audiobook was divided into 10 segments and, for the free association task, into 101 segments.

Experimental Procedures
The participants filled out the background questionnaire ( Table 1) as part of the behavioral questionnaires. Thereafter, participants engaged in the fMRI measurement and a magnetoencephalography (MEG) measurement with concurrent electroencephalography (EEG). The brain measurement sessions were separated by at least one month to reduce possible learning effects. The order of the neuroimaging sessions was counterbalanced across the cultural family backgrounds and genders of the participants. MEG/EEG results will be reported separately. After the brain measurements, the participants performed the implicit association test (IAT) which aims to measure automatic associations or beliefs that the participant is not aware of or willing to report [36,37]. However, given that the IAT is controversial [38][39][40], we reported IAT results only in the appendixes as tentative support of the main results of this study and our conclusions (see Appendices B and D). The participants also filled out an online free association task at home within two weeks of neuroimaging. In this task, the audiobook was presented once again to the participant in 101 segments, and at the end of each segment, the participants were instructed to type a few words that best describe what was on their mind at the end of the segment when they had heard that segment during brain measurements. The segments are indicated in the English-translated version of the audiobook in the Supplementary Materials. Boosted decision tree regression was used to study which of the background variables most accurately predicted the results of the free association experiment. The details and results of this analysis are presented in Appendices C and E.

fMRI Acquisition
The fMRI session consisted of the following 13 measurements: (1) a T1-weighted anatomical measurement, (2) a measurement comprising four repetitions of the introduction of the narrative (5.93 min), (3) the narrative presented in ten measurements (durations: 4.8-8.4 min), (4) a measurement comprising four repetitions of the introduction of the narrative (5.93 min), (5) a resting state measurement (3 min) and (6) a T2-weighted anatomical measurement. In this study, we analyzed only the narrative (i.e., ten fMRI runs) and T1-weighted anatomical images. Each functional measurement started with a period of 12.3 s and ended with a period of 15 s without auditory stimulation. The repetitions of the introduction were separated by a period of 16 s without auditory stimulation. The participant was shown a white fixation cross on a black background during the functional measurements (including resting state). Images from Helsinki were shown between the functional measurements and during the anatomical measurements. The participants had short breaks in the scanner between the measurements to avoid fatigue, eye strain, auditory strain from the repetitive noise of the scanning, and muscle strain from laying still. We also inquired about their fatigue between the measurements and offered a possibility to have a break outside of the scanner, during which they were offered refreshments. Among them, 24 participants wanted to have 1, 6 participants 2-4, and 18 participants 0 breaks outside of the scanner between measurements.
Anatomical and functional MRI data were acquired with a 3T MRI whole-body scanner (MAGNETOM Skyra, Siemens Healthcare, Erlangen, Germany) using a 32-channel receiving head coil array. The anatomical images were measured using a T1-weighted MPRAGE sequence (TR = 2530 ms, TE = 3.30 ms, field of view = 256 mm, flip angle = 7 degrees, slice thickness = 1 mm). Whole-brain fMRI data were measured using an ultra-fast simultaneous multislice (SMS) inverse imaging (InI) sequence [41]. Instead of relying only on gradient coils in spatial encoding, InI achieves spatial encoding by solving the inverse problems by utilizing the spatial information from channels in a radio-frequency coil array and gradient coils. In this study, InI-encoding direction was superior-inferior, whereas frequency and phase encoding were used to recover spatial information in anterior-posterior and left-right directions, respectively. Twenty-four axial slices (7 mm) were first collected without a gap between the slices. Thereafter, the slices were divided into two groups of 12 slices, and each of the slice groups was excited and read in 50 ms, resulting in a TR of 100 ms. Simultaneous echo refocusing [42] was used to separate adjacent slices in each group, and aliasing was further controlled with blipped controlled aliasing in parallel imaging [43]. Other measurement parameters were: TE = 27.5 ms, flip angle = 30 • , FOV = 210 × 210 × 210 mm 3 , and in-plane resolution = 5 mm × 5 mm.
Solving an inverse problem in InI reconstruction requires a sensitivity map of the channels in the coil array [41]. This information was included in a 6 s reference scan measured before each functional measurement. In this reference scan, partition-encoding steps were added after slice group excitation in an InI-encoding direction. The reference scan and functional scans were acquired with the same imaging parameters. Before each reference scan, shimming was used to minimize inhomogeneity in the magnetic field. SMSInI should, due to its faster sampling rate, enable more accurate removal of physiological artifacts compared to the echo planar imaging (EPI) sequence [44]. SMSInI has higher spatial resolution with lower signal leakage and higher time-domain signalto-noise ratio than inverse imaging without SMS, and it detects subcortical fMRI signals with similar sensitivity and localization accuracy as EPI [41]. The in-plane resolution of about 5 × 5 of SMSInI is lower compared to the resolution of about 3 × 3 mm of the EPI sequence. However, EPI data are usually smoothed spatially and, therefore, the eventual spatial resolution is typically closer to 5 × 5 mm. Further, recent studies have provided evidence that fMRI can measure faster neural oscillations than previously thought [45,46]. This evidence has challenged the assumption that decreasing TR does not provide any additional neural information since the hemodynamic response is too slow to measure neural oscillations.
During the fMRI measurements, the stimulus presentation was controlled with the Presentation software (Neurobehavioral Systems, ver. 18.1, Albany, NY, USA). The auditory stimuli were presented to the participant through earphones (Klaus A. Riederer ADU2a). The intensity level of the stimuli was adjusted on an individual participant basis at a comfortable listening level that was clearly audible above the scanner noise. The audio out of the sound card of a computer was recorded with the BIOPAC MP150 Acquisition System (BIOPAC System, Inc., Goleta, CA, USA). In the data analysis, this allowed us to determine the exact times when the auditory stimulus started. The BIOPAC MP150 system was also used to record heart rate and respiration signals during the fMRI measurement. Heart rate was measured using two BIOPAC TSD200 pulse plethysmogram transducers placed on the palmar surfaces of the participant's left and right index fingers. Respiratory movements were measured using a respiratory-effort BIOPAC TSD201 transducer attached to an elastic respiratory belt, which was placed around the participant's chest. Heart rate and respiratory signals were sampled simultaneously at 1 kHz using RSO100C and PPG100C amplifiers, respectively, and BIOPAC AcqKnowledge software (version 4.1.1).

fMRI Reconstruction and Preprocessing
The anatomical images were reconstructed using Freesurfer's automatic reconstruction tool (recon-all; http://surfer.nmr.mgh.harvard.edu/, accessed on 23 August 2018) and functional images using the regularized sensitivity encoding (SENSE) algorithm with a regularization parameter of 0.005 [47,48]. The reconstructed images were registered to the Montreal Neurological Institute 152 (MNI152) standard space template by first calculating transformation parameters from structural to standard space and from the reference scan to structural space. Thereafter, these transformations were concatenated and used to co-register functional images to the MNI152 standard space with 3 mm resolution. The co-registrations were performed by Freesurfer and FSL tools [49,50]. A period of 12.3 s of fMRI data measured before the start of the audiobook was removed from each fMRI measurement. To remove the scanner drift, the data were detrended using a Savitzky-Golay filter (order: 3, frame length: 240 s). Physiological and movement artifacts were suppressed using the MaxCorr method [51]. Specifically, from the data of each participant, we regressed out ten components that correlated maximally within the white matter and cerebrospinal fluid of that participant but were minimal in the other participants' white matter and cerebrospinal fluid. Since these components were participant-specific, they were assumed to be artifacts rather than brain activity elicited by the audiobook. Thereafter, DVARS, defined as the spatial standard deviation of successive difference images, was used to identify the data potentially affected by head motions [52]. The fMRI data were filtered between 0.08 and 4 Hz using a zero-phase filter and smoothed spatially with a 6 mm full-width-half-maximum Gaussian kernel.

Inter-Subject Correlation (ISC) Analysis of Blood-Oxygen-Level-Dependent Responses
Blood-oxygen-level-dependent (BOLD) responses were analyzed using ISC analysis as implemented in the ISC toolbox (https://www.nitrc.org/projects/isc-toolbox/ [53], accessed on 23 November 2017). First, we investigated how similar brain activity was across the whole audiobook within Finnish-and Russian-background groups by computing Brain Sci. 2022, 12, 649 7 of 19 ISC maps separately for both groups. For each voxel and for each of the 10 functional measurements, Pearson's correlation coefficients were calculated across the time courses of every participant pair within the Finnish-and Russian-background groups, resulting in 276 (n × (n -1)/2, where n = 24) ISC values per voxel for each group. To calculate the ISC values across the whole audiobook, the ISC values for each measurement were transformed into z-values using Fisher's transformation, the z-values were weighted with the length of the measurement and, thereafter, an average was calculated over the z-values within each voxel across 10 measurements. Finally, the z-values were transformed back into ISC values using the inverse z-transform.
The statistical significance of the ISC values was evaluated separately for the Finnishand Russian-background groups by a nonparametric voxel-wise resampling test to account for the temporal autocorrelations in the BOLD data [53]. In short, a null distribution was created from ISC values calculated after circularly shifting the time series for each participant by a random amount such that the time series between participants became unaligned in time. The null distribution was approximated with 0.5 million realizations randomized across voxels and time-points. The null distribution across the whole audiobook was determined by taking a weighted average over the null distributions computed for ten functional measurements. Again, the ISC values of the null distribution were transformed into z-values before averaging and the averaged z-values were transformed back into the ISC values using the inverse z-transformation. The threshold for significant ISC was determined by first computing p-values for the true realizations (i.e., the ISC values computed with non-shifted BOLD timeseries) in each voxel based on the null distribution and, thereafter, the resulting p-values were corrected for multiple comparisons using the FDR correction.
Between-group differences of Fisher's z-transformed ISC values (z-scores) were studied using a nonparametric cluster-based two-sample t-test where the statistical significance was determined based on the distribution of 5000 random permutations of z-values between the groups [54][55][56][57][58]. The null distribution was created from the maximum cluster sizes obtained by thresholding the statistical images at the cluster-defining threshold of p < 0.001 at each permutation. A corrected p-value for each suprathreshold cluster was obtained by comparing its size to the permutation distribution.

Semantic Similarity Analysis of the Self-Reported Word Lists
To examine between-group differences in word lists produced by the participants in the free association task, the word2vec skip-gram method [33] was used to transform the individual words into vectors representing the semantic content. The word2vec skip-gram model (Gensim Python Library, https://pypi.org/project/gensim/, accessed on 16 July 2018) was trained using the Finnish Internet Parsebank corpus ( [59], http://bionlp.utu. fi/finnish-internet-parsebank.html, accessed on 11 May 2018) with 500 dimensions and a window size of 10 words. The words that occurred less than 50 times in the corpus were excluded from the model training. The vector embeddings were adjusted by passing the corpus through the word2vec skip-gram model five times.
The model contained 98% of the words produced by the participants, and the rest of the words were discarded. The word lists produced by the participants were first corrected for spelling errors and stop-words were removed in the case the participant had written sentences. Thereafter, word2vec was used to map each word to a semantic vector. The semantic representation of each list was obtained by calculating the vector sum over the words in the list. The list vector was computed for 101 narrative segments. Between-participant pairwise semantic similarities were then obtained by calculating cosine distances between the list vectors separately within the Finnish-and Russian-background groups for each narrative segment (i.e., 101 t-tests).
The pairwise cosine similarities within the Finnish-background participants were compared to the pairwise cosine similarities between Russian-background participants using a nonparametric two-tailed t-test with 50,000 permutations [54,[56][57][58]. The t-tests were performed separately for each of the 101 segments. The segment-wise p-values (i.e., 101 in total) were corrected for multiple comparisons using FDR correction [60]. We also applied the two-tailed Wilcoxon signed rank test to investigate whether the number of associated words across the whole narrative (i.e., one statistical test) differed between the Finnish-and Russian-background participants.

Audiobook Listening Elicited Widespread ISC of BOLD Activity
There was significant ISC in several brain regions during audiobook listening ( Figure 1 and Table 2; activation maps can be found from Neurovault: https://neurovault.org/ collections/LINKKSTF/). Areas of ISC shared across both groups included Heschl's gyrus (HG), planum temporale (PTe), as well as superior and middle temporal gyri (STG, MTG) bilaterally, further extending to lateral aspects of occipital cortices. The Finnish-background participants exhibited more extended ISC in the lateral occipital cortex (LOC), especially in the right hemisphere and also the inferior occipitotemporal cortex (iOTC). Significant ISC was also found in both groups in the left inferior frontal gyrus (IFG) and bilateral middle frontal gyri (MFG), precuneus (PCun) and anterior cingulate cortex (ACC).

Cultural Family Background Modulates Brain Activity during Audiobook Listening
Stronger ISC was found for the Finnish rather than the Russian background participants in the left hemisphere in an area extending from the Heschl's gyrus and insula to the superior temporal gyrus, as well as in an area extending from the lingual gyrus to the middle occipital gyrus and cerebellum (Figure 2, cerebellum not shown; Table 3). In the right hemisphere, ISC was stronger for the Finnish-background participants in an area including parts of the middle temporal and middle occipital gyri, lateral occipital and posterior cingulate cortices, as well as in the cerebellum. The ISC was stronger for the Russian compared to the Finnish -background participants in left-hemisphere areas extending from Heschl's gyrus to the superior and middle temporal gyri, as well as bilaterally in posterior-inferior parts of the precuneus and anterior parts of the cuneus. No differences were found in DVARS values between the two participant groups, suggesting that the between-group differences found in the ISC values should not be related to the head movements (Finnish family background: 1.64 ± 0.46, Russian family background: 1.59 ± 0.41, t-value: 0.70, p = n.s., the nonparametric two-tailed t-test with 5000 permutations). Table 3. Differences of ISC values between Finnish-and Russian-family-background participants. Cluster size, peak coordinates, the maximum t -values and p-values of the clusters obtained in the nonparametric two-sample t-test between the two participant groups (5000 cluster-wise permutations; cluster-defining threshold, uncorrected: p < 0.001; the 0.05 family-wise error-corrected cluster size: 62 voxels). The loci of ISC were labeled according to the Harvard-Oxford Cortical Structural Atlas [61] implemented in FSL.

Cluster Label
Cluster Extent (Voxels)   [62]. Please note that the ISC difference in PCC was misleadingly projec the ventricle when visualized using Caret.  The results were visualized on the cortical surface using Caret software [62]. Please note that the ISC difference in PCC was misleadingly projected in the ventricle when visualized using Caret.

Cultural Family Background Increased Similarity of Audiobook Interpretation
We estimated similarities in interpretation of the audiobook using the free association task. Participants listened to the audiobook in 101 segments, and after each segment, they typed a list of words that described what had been on their mind at that point in the story during the brain measurement session (Figure 3; for a description of the method, see also [22]). The semantic similarity of the participants' word lists was estimated by transforming the lists into vector representations in a semantic space and by calculating the cosine similarities between the resulting semantic vectors [33]. Across the whole narrative, the Russian-background participants listed 33% more words than the Finnish-background participants (18,724 vs. 12,536 words; the two-tailed Wilcoxon signed rank test: V = 49, p < 0.01). Similarities of provided words were significantly different between groups in 56/101 segments (p < 0.05 in each segment, as assessed with the two-tailed nonparametric t-test with FDR adjustment). Out of these 56 segments, similarities were higher for the Russian-background participants in 44 segments and lower for 12 segments. There were significant between-group differences both in segments depicting interacting protagonists and in segments with descriptions of city scenery amid changing seasons. The segments with significant between-group differences are indicated in the English-translated transcription of the audiobook in the Supplementary Materials. they typed a list of words that described what had been on their mind at that point in the story during the brain measurement session (Figure 3; for a description of the method, see also [22]). The semantic similarity of the participants' word lists was estimated by transforming the lists into vector representations in a semantic space and by calculating the cosine similarities between the resulting semantic vectors [33]. Across the whole narrative, the Russian-background participants listed 33% more words than the Finnish-background participants (18,724 vs. 12,536 words; the two-tailed Wilcoxon signed rank test: V = 49, p < 0.01). Similarities of provided words were significantly different between groups in 56/101 segments (p < 0.05 in each segment, as assessed with the two-tailed nonparametric t-test with FDR adjustment). Out of these 56 segments, similarities were higher for the Russian-background participants in 44 segments and lower for 12 segments. There were significant between-group differences both in segments depicting interacting protagonists and in segments with descriptions of city scenery amid changing seasons. The segments with significant between-group differences are indicated in the English-translated transcription of the audiobook in the Supplementary material.

Discussion
In the present study, we were able to show that differences in cultural family background can be reflected as between-group differences in ISC of brain hemodynamic activity during audiobook listening in several brain areas, even when participants of both groups are fluent in Finnish (see, Section 2.1) and are Finnish residents. To our knowledge, this is the first time such an effect has been found. This is an important extension of the previous studies, the majority of which have found cross-cultural differences between residents of different continents or countries (usually Americans vs. East Asians; for reviews, see [3,4,12,13]). The audiobook allowed a more natural setting to investigate the effect of cultural background than previously used simple and strictly controlled stimuli. ISC provided complementary insights to traditional analyses since it does not require a prior model of brain responses to the stimulus and is, therefore, especially suitable for analyzing brain responses to complex naturalistic stimuli. Here, between-group differences in ISC were found in the left-hemispheric Heschl's gyrus, insula, superior temporal gyrus and lingual gyrus, right-hemispheric lateral occipital and posterior cingulate cortices as well as bilateral middle temporal gyrus, middle occipital gyrus, and precuneus ( Figure 2). These findings suggest that a shared cultural family background enhances similarity of processing of an audiobook at multiple levels of language processing.
The left superior temporal gyrus, the bilateral middle temporal gyrus, and the precuneus have previously, among other things, been associated with narrative speech comprehension and processing of semantic information ( [63]; for reviews, see [64,65]). Further, the precuneus exhibited increased ISC between participants who understood visual and auditory narratives more similarly [30]. The precuneus is also involved, for example, in the processing of paragraphs about in-group vs. out-group characters [66] and in episodic memory retrieval [67]. In light of these previous studies, the between-group ISC differences in our study seem to suggest that cultural family background can shape semantic processing. The between-group ISC differences in the left Heschl's gyrus and the surrounding areas, on the other hand, seem to suggest that cultural family background may also modulate the processing of the speech acoustics and prelexical features [16,64,68]. Here, top-down influences from the higher-level speech processing areas may play an important role [69][70][71]. Differences in ISC in Heschl's gyrus and the higher-level associative brain areas support previous studies that have found cultural differences in both lower-level (e.g., perception, attention) and higher-level processes (e.g., inferring other's emotions, contemplating the self [3,[11][12][13]34,72]). Interestingly, between-group differences in the ISC were also found in the occipital cortex. Given that visual imagery may activate the same areas as visual perception [73][74][75], the present results could reflect an effect of the cultural family background on the visual imagery elicited by the audiobook narrative.
Using word2vec and semantic similarity, we were able to quantitatively evaluate the closeness of meaning of the word associations elicited by the audiobook. The behavioral word-listing experiment disclosed that 56 out of the 101 audiobook segments elicited significantly different associations between the participants with Finnish and Russian family backgrounds (Figure 3). The semantic similarity of the associated words was higher among the Russian-background participants in 44 segments of the story compared to the Finnishbackground participants, and vice versa in 12 segments. This suggests that those story segments contained elements that elicited cultural/family background-related associations specific to each group and may have been understood more similarly within each group. These segments contained both culture-specific but also non-specific elements (see the English-translated text of the audiobook in the Supplementary Materials) suggesting that the cultural family background may also modify the way we understand concepts that are not obviously related to the culture. This latter observation may stem from cultural differences in perceptual styles as described in previous studies on object perception [11], color discrimination [76], and taste [77]. Our results are also in line with studies that have found cross-cultural differences in reading comprehension and interpretation [24,25]. For example, it has been shown that when asked to recall culturally unfamiliar texts, readers tend to distort facts and insert ideas from their own culture [78]. Overall, the Russianbackground participants produced 33% more words than the Finnish-only background participants, which could be explained by the Finno-Ugric culture having been observed to be a less talkative one [79].
This study has some limitations. First, the analyses employed in this study do not demonstrate a causal relationship between cultural background of the participants and narrative processing and interpretation. The between-group differences may be related to differences in cognitive or emotional demands during narrative listening or to viewing the story protagonists as belonging to the participant's own group vs. another cultural group. Therefore, further studies should be conducted to validate and extend these initial results that suggest that cultural background can modulate narrative processing and interpretation. For example, a follow-up study where immigrants are studied before and after immigration to another cultural environment might offer a stronger basis for inference of the causal relationship of culture, in the case of cultural immersion. Second, in the free association test, the participants were instructed to type words that were on their mind while listening to the story during neuroimaging, but it is still possible that the participants reported words that were on their mind only during the free association task. However, this was not a problem in the current study, since the fMRI data and associated words were analyzed separately. Furthermore, previous research has shown that the stimulus can serve as an effective recall cue of what participants experienced during fMRI scanning [80]. Third, having breaks outside of the scanner reduces comparability between measurements within the participant since the position of the participant's head is changed when the participant is repositioned in the scanner. However, it is unlikely that this has significantly affected our results, since ISC values were first computed separately for each measurement over which an average ISC was computed. Fourth, the sections of the audiobook that included elements from either Finnish or Russian cultures were created based on interviews with Russian-background persons. A more objective method to assess whether the sections are related more to Finnish or Russian cultures would be to ask independent raters to categorize the sections. Still, many of the Russian-background participants relayed that the stimulus realistically captured aspects of Russian culture, and there were significant differences in the free association task between the Finnish-and Russian-background participants, so here the stimulus did work as desired. Finally, to see whether the SMSInI sequence provided benefit in this study, the results should be compared to EPI data collected from the same experiment. In theory, a 100 ms TR used in this study should allow more accurate removal of physiological artifacts than the conventional 2-3 s TR of EPI, which is too low to avoid aliasing of higher frequency cardiac and respiratory cycles. However, SMSInI also introduces additional artifacts, such as susceptibility artifacts. Further, it is unclear whether fast sampling can capture higher-frequency information of neuronal activity, since the hemodynamic response is inherently slow, although there is evidence that fMRI can measure neural oscillations as fast as 0.75 Hz [45,46].
In conclusion, our results suggest that even relatively subtle between-group differences in cultural background can result in higher similarity within one group compared to the other in how natural speech is interpreted and how it is processed in multiple areas in the human brain. These effects might play an important role in enhancing mutual understanding between individuals with shared cultural backgrounds. Awareness of these effects could help overcome potential challenges in mutual understanding between individuals with different cultural backgrounds and social identities in modern multicultural societies wherein individuals from different backgrounds speak the same native language and might erroneously take highly similar understanding for granted. Enhanced awareness and scientific understanding of these differences in language use is important for avoiding obstacles in social interactions at many levels of society.  Data Availability Statement: Stimulus material and codes used in the current study are available from the corresponding author on reasonable request. Pseudonymized fMRI are available within the European Union (EU) from the corresponding author by reasonable request in respect of the privacy of the participants following the guidelines of the Data Protection Act of Finland (includes the EU's General Data Protection Regulation, GDPR). Since the facial features of the anatomical T1 MRI images are needed in the analysis of magnetoencephalography data also measured in this study, T1 images will be available within the EU by reasonable request following the guidelines of the Data Protection Act of Finland only after the end of the project when facial features of the MRI volumes can be removed.

Acknowledgments:
The calculations presented above were performed using computer resources within the Aalto University School of Science "Science-IT" project. We also want to thank Toni Auranen for excellent technical support in setting up the SMSInI sequence in the AMI centre, Julia Bethwaite for her valuable feedback on Russian cultural aspects of the story, Esko Salervo for voicefitting and performing voice-acting for the story, Jenna Kanerva and Filip Ginter at the University of Turku for development of the Finnish language word2vec model and all the participants for making this study possible.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Appendix B. Implicit Association Test (IAT)
The IAT [36] was used to investigate whether the participants unconsciously favored Finland or Russia. The IAT comprises seven tasks. The data were analyzed from the fourth and seventh tasks, and the other tasks introduced the IAT to the participant. In the first task, images and words related to Finnish or Russian cultures appeared in the middle of the computer screen and the two categories "Finland" and "Russia" were shown in the left and right corner of the screen, respectively. The participant was instructed to categorize the presented items into the appropriate category as quickly as possible by pressing the appropriate left-hand or right-hand key. The second task was to categorize positive and negative adjectives into the categories "Good" (left corner) and "Bad" (right corner) with the same procedure as in the first task. In the third task, the previous categories were combined into two categories "Finland and good" (left corner) and "Russia and bad" (right corner), and the participant was instructed to categorize the examples from all four categories into the appropriate combined category. The fourth task was a repeat of the third task. The fifth task was similar to the first task with the exception that the position of the names of the categories was reversed (i.e., left corner: "Russia", right corner: "Finland"). The sixth task was similar as the third one, except that the pairings of the categories were opposite (i.e., left corner: "Russia and good", right corner: "Finland and bad"). The final task was a repeat of the sixth task. The task order of the IAT was counterbalanced across the family background and gender of the participants such that for one half of the participants the positions of the categories "Finland" and "Russia" were reversed in all tasks (i.e., Russia was first paired with good and Finland with bad). The participant was obligated to correct errors before proceeding, and reaction times were measured to the occurrence of the correct response. The IAT effect was calculated for each participant applying the Cohen's d to the response latencies from the fourth and seventh tasks as follows: where Fi/BAD is the mean latency for the combined category "Finland and bad", Fi/GOOD the mean latency for the combined category "Finland and good", Ru/GOOD the mean latency for the combined category "Russia and good", Ru/BAD the mean latency for the combined category "Russia and bad", std Fi/BAD−Fi/GOOD the standard deviation of the latencies for the categories "Finland and bad" and "Finland and good" and std Ru/GOOD−Ru/BAD the standard deviation of the latencies for the categories "Russia and bad" and "Russia and good". The more positive the IAT effect was, the more the participant was interpreted to prefer Finland over Russia. The more negative the IAT effect was, the more the participant was interpreted to prefer Russia over Finland. A nonparametric two-sample t-test with 5000 permutations was used to examine whether the IAT effect was different between the two groups. Wilcoxon signed rank tests were applied separately to the d-scores of the Finnish-background and Russia-background participants to examine whether the participants within either of the groups preferred Finland over Russia or vice versa. The results of the IAT are shown in Appendix D.

Appendix C. Background Variables Which Most Accurately Predict the Free Association Experiment Results
We used boosted decision tree regression [81] to test which of the background variables ( Table 1 in the article and IAT scores in Appendix D) would most accurately predict the interpretations of the narrative in the free association experiment. The boosted decision tree regression is a nonlinear regression method that also allows the importance of variables to be determined. First, we created a between-participant similarity matrix for each background question using the score differences between each participant pair. This resulted in a data matrix of 1128 values (n × (n − 1)/2, where n = 48-24 rows and 24 columns). Then, we used 10 randomized cross-validation folds and fitted the regression model for each fold using 90% of the data for training and 10% for testing. We used MATLAB's function "fitrensemble" (default parameters) and Bayesian hyperparameter optimization to find optimal tree regressors for the training data. Predicted responses for the testing data were then compared to actual values. Feature importance weights were obtained from each of the ten models. The weights correspond to the relative importance of the background variables in predicting the semantic similarities of the associated words. The mean importance values and one standard deviation are depicted in Appendix E. The correlation between real and predicted values over 10 cross-validation folds was 0.052.

Appendix D
model for each fold using 90% of the data for training and 10% for testing. We used MATLAB's function "fitrensemble" (default parameters) and Bayesian hyperparameter optimization to find optimal tree regressors for the training data. Predicted responses for the testing data were then compared to actual values. Feature importance weights were obtained from each of the ten models. The weights correspond to the relative importance of the background variables in predicting the semantic similarities of the associated words. The mean importance values and one standard deviation are depicted in Appendix E. The correlation between real and predicted values over 10 cross-validation folds was 0.052. Figure A1. Participants' preferences for Finland and Russia obtained from IAT. Bar graphs present the number of the participants with each preference level. D-score threshold values were as follows: ≤−0.65: strong preference for Russia over Finland, >−0. 65 [36,37]). Wilcoxon test showed that Finnish-background participants preferred Finland over Russia (p < 0.001, mean d-score: 0.51, std: 0.44, negative values mean preference for Russia and positive for Finland) whereas the Russian-background participants favored neither country (p = 0.69, mean d-score: −0.07, std: 0.45). Figure A2. Relative importance of background variables in predicting between-participant similarities in the free association experiment. Relative importance values are means over 10 cross-validation folds, and error bars describe the standard error of the relative importance values over 10 crossvalidation folds. 1 = Nationality (Finnish or Russian), 2 = Number of Russian parents, 3 = The level of Russian language, 4 = How often do I visit Russia, 5 = How Finnish do I feel myself, 6 = How Russian do I feel myself, 7 = How positively/negatively do I see Finns, 8 = How positively/negatively do I see Russians, 9 = IAT score. Interestingly, for the IAT scores, the question on how Russian the participants felt themselves, and the question how Finnish the participants felt themselves best explained the similarities and differences in the word listings, and for example, level of command of Russian language had a negligible effect.