Large-Scale Analysis of X Inactivation Variations between Primed and Naïve Human Embryonic Stem Cells

X chromosome inactivation is a mammalian dosage compensation mechanism, where one of two X chromosomes is randomly inactivated in female cells. Previous studies have suggested that primed human embryonic stem cells (hESCs) maintain an eroded state of the X chromosome and do not express XIST, while in naïve transition, both XIST and the eroded X chromosome are reactivated. However, the pattern of chromosome X reactivation in naïve hESCs remains mainly unknown. In this study, we examine the variations in the status of X chromosome between primed and naïve hESCs by analyzing RNA sequencing samples from different studies. We show that most samples of naïve hESCs indeed reactivate XIST and there is an increase in gene expression levels on chromosome X. However, most of the naïve samples do not fully activate chromosome X in a uniform manner and present a distinct eroded pattern, probably as a result of XIST reactivation and initiation of re-inactivation of chromosome X. This large-scale analysis provides a higher-resolution description of the changes occurring in chromosome X during primed-to-naïve transition and emphasizes the importance of taking these variations into consideration when studying X inactivation in embryonic development.


Introduction
Females inherit one X chromosome from each parent, and therefore, a dosage compensation mechanism is required to balance its expression with that of males, which inherit only one maternal X chromosome [1]. There are three main dosage compensation mechanisms in nature: decreased transcription from both X chromosomes in females, present in C. elegans [2], two-fold increase in transcription levels from the single X chromosome in males, present in D. melanogaster [3] and random X inactivation, present in mammals [4]. Random X inactivation in humans occurs at the epiblast stage of embryonic development, and is suggested to be initiated by expression of XIST, a long non-coding RNA, which coats the X undergoing inactivation [5,6].
Human embryonic stem cells (hESCs) are derived from the inner cell mass of 5-7 days old blastocysts. The cultured hESCs are also referred to as "primed" and resemble a later, post-implementation blastocyst stage [7]. hESCs are a powerful tool extensively used in studies related to developmental processes and X chromosome inactivation during development [8][9][10]. Conversely, primed hESCs do not accurately simulate the X chromosome status during embryonic development. The majority of primed hESCs maintain an eroded state, where one X is active, and the other X is partially inactive (XaXe) [11][12][13][14][15][16][17]. Previous analysis has identified the geographical patterns of the erosion and discovered the partial inactivation in the middle of chromosome X, while the edges remain active [17,18]. In addition, unlike preimplantation blastocysts, primed hESCs do not express XIST.
In contrast, cultured mouse embryonic stem cells (mESCs) can be maintained in a state similar to preimplantation blastocysts, also referred to as "naïve" state. The conversion of 2 of 11 hPSCs from primed to naïve state can be carried out in various ways. Most protocols rely on the inhibition of MEK/Erk and GSK together with addition of LIF, and either a combination of other chemical inhibitors or overexpression of different pluripotency factors [19][20][21][22][23][24][25][26][27][28][29][30][31][32]. Naïve ESCs present a genome-wide hypomethylation and were previously reported to reactivate both XIST and the eroded X chromosome, resulting in two fully activated X chromosomes (XaXa) [21,33,34]. However, no global analysis regarding the variations in chromosome X between primed and naïve cells has been conducted, and the reactivation patterns remain unknown.
In this study, we describe the first large-scale analysis of X chromosome inactivation variations between primed and naïve hESCs. By analyzing a large number of RNA sequencing samples obtained from various studies, we were able to identify alterations in the expression of XIST and X-linked genes between primed and naïve hESCs. Intriguingly, we conclude that the reactivation of the eroded chromosome X was not full and is accompanied by further inactivation in naïve samples.

XIST Expression in Primed-To-Naïve Transition
To study X inactivation variations between naïve and primed hESCs, we collected 75 samples from 11 different studies, consisting of 48 naïve samples and 27 primed samples. In these samples, we assessed three different parameters: expression of XIST gene, overall transcription in the X chromosome and allelic representation in chromosome X ( Figure 1A).  In all primed samples, XIST was not found to be expressed (TPM < 1) ( Figure 1B). Out of 9 female hESC studies and 43 naïve samples, in 6 studies (labeled as studies A-F) and in 70% of all samples, there was initiation of XIST expression. In these samples (referred as XIST + samples), TPM values of XIST varied between 1.3 and 23 ( Figure 1B,C). The naïve samples in studies G-I (referred as XIST − samples) did not show XIST expression.
The variation in XIST reactivation across the analyzed samples led us to assess whether all of the studies were able to achieve naïve hESCs. In order to evaluate the "naivety" levels of the naïve samples in each study, we correlated each sample with RNA samples of human pre-implementation day 6 and day 7 epiblasts [35], focusing on 47 pre-implantation epiblast markers, obtained from a previous analysis performed on pre-implantation scRNA-seq studies [36]. As anticipated, all naïve samples showed higher correlation levels than the primed samples from the same study, including XIST − studies ( Figure S1A). Interestingly, XIST + samples show higher correlation with pre-implantation epiblasts than XIST − samples ( Figure S1B).

Chromosome X Genes and Allelic Expression in Naïve Samples
Next, we aimed to examine whether the entire chromosome X was reactivated in the naïve samples. We compared expression of the genes on the X chromosome in each naïve sample to the median expression of the primed sample of the same study. In almost all of the studies and in 80% of the samples, the naïve samples had higher levels of expression of X-linked genes than in their primed counterpart cells (Figure 2A,B). When averaging all of the naïve samples vs. all of the primed samples, naïve samples showed significantly higher levels of expression of X-linked genes (p-value < 0.01, Figure 2C).
For the analysis of allelic expression in primed and naïve cells, we identified single nucleotide variants (SNPs) in the X-linked genes and calculated the allelic ratio (see Experimental Procedure). We addressed genes with a ratio above 0.2 as biallelic, and genes with a ratio below 0.2 as uninformative (we could not address them as monoallelic because the low ratio might derive from lack of coverage, or the absence of polymorphism in a given sample) ( Figure 1A). Naïve samples demonstrated a significantly higher allelic ratio, indicating more biallelic expression than their primed counterparts ( Figure 2D). Most of the studies and 88% of the samples showed an increase in the allelic ratio ( Figures 2E and S2A). Overall, the majority of the naïve samples displayed upregulation in the expression of Xlinked genes and in biallelic representation. Additionally, we aimed to check the expression pattern of X-linked genes in primed hESCs. We normalized the TPM values of each gene in each primed sample to the median of the male hESC samples and created a moving average plot of the median along the X chromosome ( Figure S2B), verifying that an eroded state (XaXe) was kept in the primed samples. In addition, we wanted to check whether XIST was expressed in a monoallelic or a biallelic fashion in the naïve hESCs samples. Unfortunately, only 15 samples had sufficient coverage of SNP areas (data not shown). Out of the 15 samples, 14 samples had a biallelic XIST expression, and 1 sample was borderline. Thus, XIST is expressed in a biallelic fashion in the naïve cells.
Next, we aimed to examine whether the entire chromosome X was reactivated in the naïve samples. We compared expression of the genes on the X chromosome in each naïve sample to the median expression of the primed sample of the same study. In almost all of the studies and in 80% of the samples, the naïve samples had higher levels of expression of X-linked genes than in their primed counterpart cells (Figure 2A,B). When averaging all of the naïve samples vs. all of the primed samples, naïve samples showed significantly higher levels of expression of X-linked genes (p-value < 0.01, Figure 2C). For the analysis of allelic expression in primed and naïve cells, we identified single nucleotide variants (SNPs) in the X-linked genes and calculated the allelic ratio (see Experimental Procedure). We addressed genes with a ratio above 0.2 as biallelic, and genes with a ratio below 0.2 as uninformative (we could not address them as monoallelic

Geographical Pattern of Expression of X-Linked Genes and Their Allelic Representation
Our next aim was to identify whether the activation of chromosome X is uniform in the naïve cells, or whether it is locus-specific. We thus normalized the TPM values of each gene in each naïve sample to the median of the primed samples and created a moving average plot along the X chromosome. Remarkably, the samples that activated expression of XIST in the naïve cells (XIST + samples) did not reactivate chromosome X in a uniform way. While the distal regions of chromosome X showed a higher expression, a clear reduction in expression in the middle of the chromosome (downstream to XIST gene) was observed, especially in the Xq22 area ( Figures 3A and S3A). In contrast, samples which did not express XIST did not demonstrate this specific, regional expression pattern ( Figure 3A). Cells 2022, 11, x FOR PEER REVIEW 6 of 12 In addition, we compared the levels of expression of genes located at Xq22 to all genes on chromosome X in XIST − and XIST + samples. While there was no considerable difference in the overall chromosome X gene expression levels, XIST + samples had significantly lower levels of expression in the Xq22 area ( Figure 3B). Furthermore, the allelic expression ratio in XIST + was significantly decreased in Xq22 in comparison to the allelic expression ratio in all of chromosome X, indicating less biallelic expression in this area ( Figure 3C).
A recent study has attempted to obtain random X inactivation in hESCs [24]. This study suggested that the reprogrammed naïve hESC populations are heterogeneous and, In addition, we compared the levels of expression of genes located at Xq22 to all genes on chromosome X in XIST − and XIST + samples. While there was no considerable difference in the overall chromosome X gene expression levels, XIST + samples had significantly lower levels of expression in the Xq22 area ( Figure 3B). Furthermore, the allelic expression ratio in XIST + was significantly decreased in Xq22 in comparison to the allelic expression ratio in all of chromosome X, indicating less biallelic expression in this area ( Figure 3C).
A recent study has attempted to obtain random X inactivation in hESCs [24]. This study suggested that the reprogrammed naïve hESC populations are heterogeneous and, based on several markers, divided the naïve cells into two populations: naïve HT (which will be referred to as "high-naïve") and naïve LT (which will be referred to as "low-naïve"). The heterogeneity and differences between high-and low-naïve hESCs were attributed to the incomplete blockage of autocrine FGF signaling, specifically FGF2. It was suggested that due to this incomplete blockage, low-naïve hESCs have low levels of FGF4 and high levels of FGF2, while high-naïve hESCs (which were able to maintain full blockage) had lower levels of FGF2 and higher levels of FGF4 [37,38]. Furthermore, it was shown that high-naïve hESCs were indeed able to initiate random X inactivation upon transitioning to primed state and these cells have closer resemblance to preimplantation epiblasts as compared to low-naïve hESCs.
We analyzed 20 additional RNA-seq samples derived in this study, which were composed of four different cell lines, and for each cell line, we compared the high-naïve samples to the low-naïve samples. While all naïve samples expressed XIST (TPM > 1), most highnaïve samples had higher XIST expression levels than the low-naïve samples ( Figure S3B). When studying the geographical patterns of gene expression along the X chromosome, high-naïve samples (each normalized to the median of their corresponding low-naïve samples) demonstrated a similar pattern to the XIST + samples ( Figures 3A and S3A), with a reduction in expression in the middle of the chromosome, especially around Xq22 ( Figure S3C). In addition, high-naïve samples demonstrated significantly higher correlation with the epiblast markers previously mentioned ( Figure S3D) [35,36].
Next, we calculated the ratio between FGF4 and FGF2 expression in each of the samples in our study. We found that while almost all samples showed an increase in FGF4/FGF2 ratio, as expected ( Figure S3E), XIST + samples had a relatively higher FGF4/FGF2 ratio in comparison with XIST − samples ( Figure S3F).

Discussion
Random X inactivation is a well-known dosage compensation mechanism, present in most female mammals, and initiated at early embryonic developmental stages. Embryonic stem cells derived from human blastocysts resemble post-implementation blastocysts, and usually do not express XIST and do not have fully active (XaXa) or fully inactive (XaXi) X chromosomes, but an eroded state (XaXe). Reprogramming hPSCs into a naïve state, which resembles that of pre-implementation epiblasts in various aspects/features [21,39], has been suggested to be a valuable resource in studying early developmental stages. Naïve ESCs have been previously reported to have two active X chromosomes and biallelic expression of XIST long non-coding RNA gene.
Here, we aimed to perform large-scale analysis of RNA sequencing data of primed and naïve hESCs from a variety of studies, in order to further understand the X reactivation process and associated patterns. We examined the differences between primed and naïve states in three main categories: expression of XIST gene, overall transcription of the X chromosome and chromosome X allelic representation ( Figure 1A). As expected, we discovered that most naïve hESC samples upregulated XIST expression (Figure 1B,C). When calculating the correlation of expression of epiblast markers between the naive hESC samples and pre-implementation epiblasts, all naïve samples had higher correlation than the primed samples of each study ( Figure S1A). The reactivation of XIST was most likely obtained due to the hypomethylation of its promoter, a part of the genome-wide hypomethylation in naïve hESCs [40]. However, XIST + samples had a higher naive/primed correlation-toepiblast ratio than XIST − samples. This result might indicate a higher state of naivety in XIST + samples ( Figure S1B), and this is further supported by the higher FGF4/FGF2 expression ratio found in those samples ( Figure S3F). As predicted, naïve hESCs had increased expression levels (with the exception of certain samples, which were all a part of one study) and a higher allelic ratio of X-linked genes (Figures 2A-E and S2) than their primed equivalents. The typical hypomethylated state of naïve cells is presumably the cause for the rise in X-linked gene expression levels, and thus provides further evidence for X chromosome reactivation in naïve hESCs. Surprisingly, we did not observe full reactivation of the eroded X chromosome. When comparing the naïve XIST + samples to the primed samples, reactivation was observed in the distal edges of the chromosome, while a clear reduction in expression was detected in the middle of the chromosome and specifically in the Xq22 area ( Figures 3A and S3A). These trends of expression were repeated in the analysis of high-naïve normalized to lownaïve samples, along with higher XIST expression and higher correlation with epiblasts ( Figure S3B-D). This unique pattern of erosion, which occurred in XIST + samples, together with the lower gene expression levels ( Figure 3B) and decreased biallelic representation ( Figure 3C) in this area, may suggest that chromosome X did not attain a fully activated state. Our presumption is that following the hypomethylation of the XIST promoter, XIST was activated and initiated chromosome X inactivation downstream of its location.
The inactivation localization may also be due to the high concentration of LINE-1 (L1) elements [41]. L1 elements are mammalian-specific non-LTR retrotransposons, which are mostly inactive due to truncation in the 5' end. L1 elements are relatively abundant in chromosome X (in comparison to autosomes). They are hypothesized to be "way stations" in the process of X inactivation spreading. Previous regional analyses of chromosome X revealed that the distribution of L1 elements across the chromosome is not uniform, but instead L1 elements are clustered in specific areas, among them Xq22 [42]. The clustering of L1 elements correlates with the decrease and increase in gene expression in XIST + samples, e.g., the L1-rich areas, such as Xq13-Xq21, Xq27 and especially Xq22, have lower expression levels in naïve cells. These correlations further support our hypothesis that due to the activation of XIST, X inactivation was initiated and started spreading through the L1 "way stations".
Overall, we have conducted a large-scale comparison of X-linked gene expression between primed and naïve hESCs and shed light on chromosome X dynamics in this transition. Our analysis presents a detailed, geographic profile of the X chromosome, and hints at the possibility of a specific mode of X inactivation in naïve cells, and thus provides an alternative hypothesis concerning the perception of X activation status in naïve hESCs.
When selecting samples for the analyses presented in this study, our first barrier was the sex of the hESCs; in order to study X inactivation, we needed to exclude studies performed only in male cell lines. Next, we left out certain female hESC studies due to quality and coverage issues-during our data processing, we carried out several quality checks, both for gene expression levels and for allelic expression. Samples that did not pass the quality thresholds defined in our pipeline or did not have enough coverage for the stage of SNP analysis (for the allelic expression analyses) were filtered out.
Sample information can be found in Supplementary Table S1.

Bulk RNA Gene and Allelic Expression Analysis
Most of the studies used in this article grew hESCs on top of mouse embryonic feeder cells (MEFs). It has been shown that many DNA and RNA samples derived from cells grown on mouse feeder cells are contaminated with reads originated in mouse [45]. In order to filter out the mouse reads from the RNA sequencing samples, FASTQ files were aligned to mouse GRCm38 and human h38 genomes using STAR aligner v2.7.6a [46], followed by filtering out reads that were mapped to the mouse genome using XenofilteR software [47]. On average, around 10% of the reads were filtered out from each sample. Next, the filtered BAM files underwent two procedures in parallel: derivation of count tables for gene expression analysis and creation of variant tables for allelic expression analysis. Count tables were generated using the FeatureCounts tool [48]. The variant calling procedure was performed according to the pipeline described in the Lezmi and Benvenisty paper [49], with several alterations. Briefly, using the filtered BAM files, variants were called using GATK tools: MarkDuplicates, SplitNCigarReads, AddOrReplaceReadGroups, BaseCalibrator, ApplyBQSR, and HaplotypeCaller using recommended parameters. Hard filtering of variants and exclusion of variants that did not pass the filtering and variants under 10 reads were performed using GATK's VariantFiltration and BCFtools view tools. Annotation of variants to the GRCh38 BED exome file and establishment of variant tables was carried out using BCFtools' annotate and query tools.

Gene Expression TPM Calculation and Normalization
For establishment of normalized TPM values for each gene in each sample, a matrix containing the raw data reads of all naïve samples and a matrix containing the raw data reads of all primed samples were created. TPM was calculated by dividing the reads for each gene by the gene length, followed by division by the sum of all reads and multiplying the result by 10 6 . The expression threshold for a gene was determined as TPM > 1. For moving average plotting and geographic analysis, non-expressed genes and the top 10% most highly variable genes were removed.

Pseudo-Bulk RNA Analysis of Human Epiblasts
The analysis and comparison to epiblast markers was carried out as previously described [50]. Briefly, samples from Petropoulos et al. (2016) were downloaded and aligned to h38 human genome. The 35 female cells that were determined as pre-implantation epiblasts [36] were merged to form one pseudo-bulk pre-implementation epiblast sample. The reads were normalized to TPM as mentioned above, followed by a pairwise Pearson correlation analysis of each naïve sample from each study, separately from the epiblast sample, based on 47 previously established [36] epiblast markers. Sample information can be found in Supplementary Table S1.

Quantifying Naïve: Primed Moving Average Plots
In order to quantify the variation between naïve and primed samples, a previously published method was utilized [51], with several modifications. For each study, the primed samples served as a baseline for calculating median TPM levels for each gene. Each naïve sample served as a test sample, and the TPM level of each gene in each sample was divided by the primed TPM median. The Log 2 of the fold changes was then plotted in a moving average plot, drawn across chromosome X only, with a window of 50 genes. For creating a single graph of all samples, a median of all of the samples in each study was calculated and normalized as mentioned above. Next, medians of all medians of XIST + samples and XIST − samples were created separately.

Quantifying X: Autosomes Allelic Ratio
To assess the allelic expression across the X chromosome and compare it to naïve and primed samples, the X: autosomes allelic ratio was measured, as previously described [17]. Briefly, we calculated the minor/major ratio of each SNP. For normalization of SNPs in each chromosome, we divided the sum of biallelic SNPs by the number of genes on said chromosome (aka "normalized SNPs", NS). For the X: autosome allelic ratio, we calculated the following ratio: NS (chromosome X)/mean (NS (autosomes)).

Statistical Methods
The statistical tests used in this study include Student's T-test, Mann-Whitney nonparametric test and Pearson correlation test. The relevant method is mentioned in the corresponding figure legends.