Pan-Genomic Regulation of Gene Expression in Normal and Pathological Human Placentas

In this study, we attempted to find genetic variants affecting gene expression (eQTL = expression Quantitative Trait Loci) in the human placenta in normal and pathological situations. The analysis of gene expression in placental diseases (Pre-eclampsia and Intra-Uterine Growth Restriction) is hindered by the fact that diseased placental tissue samples are generally taken at earlier gestations compared to control samples. The difference in gestational age is considered a major confounding factor in the transcriptome regulation of the placenta. To alleviate this significant problem, we propose here a novel approach to pinpoint disease-specific cis-eQTLs. By statistical correction for gestational age at sampling as well as other confounding/surrogate variables systematically searched and identified, we found 43 e-genes for which proximal SNPs influence expression level. Then, we performed the analysis again, removing the disease status from the covariates, and we identified 54 e-genes, 16 of which are identified de novo and, thus, possibly related to placental disease. We found a highly significant overlap with previous studies for the list of 43 e-genes, validating our methodology and findings. Among the 16 disease-specific e-genes, several are intrinsic to trophoblast biology and, therefore, constitute novel targets of interest to better characterize placental pathology and its varied clinical consequences. The approach that we used may also be applied to the study of other human diseases where confounding factors have hampered a better understanding of the pathology.


Introduction
Eutherian mammals made a drastic evolutionary 'choice' for survival, the one to host the embryo and to develop the fetus over a relatively extended period of time. Since the feto-placental unit is, in fact, a semi-allograft, that in all other circumstances should be rejected by the maternal immune system. This tolerance period ranges from a few days in marsupials (13 days in the Virginia opossum, where poor coping with inflammation apparently shortens the length of the feto-maternal tolerance period [1]) to 22 months in the African elephant. The pivotal organ for this heavy evolutionary option is the placenta, which manages immunotolerance, exchange of nutrients, withdrawal of waste, and progestational hormone production. The inherent complexity associated with this role of the interface makes its regular function a fragile equilibrium, which in humans evolve relatively often towards placental diseases, such as Hypertensive Disorders of Pregnancy (HDP), Intra-uterine Growth restriction (IUGR), Gestational diabetes or other diseases related to abnormal implantation [2,3]. Despite this obvious fundamental role, understanding the genetic function of the human placenta is a difficult challenge, and overall this organ is not explored as much as other organs. For instance, the Genotype Tissue Expression (GTex) project, for instance, provides a comprehensive list of expression-Quantitative Trait Loci (eQTLs), as well as splicing-QTLs (sQTLs) in 49 human tissues, that do not include the placenta [4] Recent progress using global approaches made it possible to better analyze the regulation of gene expression in the pathological placenta, such as in preeclampsia and Intra-Uterine Growth Restriction [5,6]. Genetic regulation of gene expression by genetic variants has been systematically analyzed in three recent articles that initiated deciphering the landscape of genetically controlled genes in the placenta [7][8][9]. Our data presented herein revisit the question of eQTL in the placentas and uses a novel methodological approach to identify actual differences between normal and pathological placentas often blurred by a major confounding factor, such as placental age. For this, we gave a strong weight to the use and the definition of covariates. Covariates were defined by available clinical (i.e., age of the placenta, disease status) and technical parameters (i.e., sample preparation, origins), by principal components discovered from the transcriptome analysis, and through the systematic identification of Surrogate Variables [10].
Since confounding factors are a major issue in placental genetics (especially the gestational age, the mode of delivery-C-section or Natural), we develop a novel approach that could be used to identify relevant disease-associated genetic variants (SNPs) associated with alterations of gene expression. In the first step, the QTL analysis was performed considering the disease status as a covariate, and then, in the second step, we did not correct for disease status looking for eQTLs that might be influenced by genetics and disease and being, therefore, potentially interesting in the context of placental pathologies. We surmise that the additional couple of SNP-gene found were mostly and uniquely specific to the disease. While it is not possible to affirm that the gene expression alterations are causal to the disease, they could later be evaluated in other human cohorts of placental DNAs to assess their potential predictive value heuristically.
The approach is summarized in Figure 1. This alternative analysis strategy could be applicable to a wide range of eQTL studies, especially those in which gene expression is believed to be highly influenced by clinical and technical variables and that is limited by small sample sizes, such as those including disease samples.

Summary of the Principle
In the first step, we performed the analysis including the disease status (Control, Pre-eclampsia (PE), IUGR, and PE + IUGR) as a covariate, which allowed us to identify eQTLs that influence placental gene expression independently of the disease. In order to restrict the number of input genes for the analysis, we first built a linear model using the gene expression levels for each gene in the function of the covariates, then we selected a subset of potentially relevant genes based on their residual variance (the variability of their expression level) since genes for which there is no residual variance once the covariate effects are removed are not going to reveal any detectable effect that could be partly explained by changes at the genetic level. This approach has similarly been used in a QTL research paper for genomic methylation QTL (meQTL, [11]). This step of identification of eQTLs confirmed a large part of previously identified couples eSNP-eGenes discovered in previous studies and identified new ones, presenting a strategy that can be applied with a limited sample size (<100).
In a second step, we aimed to identify genes that are modified during complicated pregnancy by the disease specifically and which present a component of genetic regulation. Thus, we removed the disease status from the group of covariates and submitted the data for analysis again in order to identify a novel list of eQTLs. This approach led to the discovery of 16 cis-eQTLs that are potentially associated with placental diseases. We surmise that these genetic factors identified may have been overlooked in previous analyses since they could have largely been confounded with the covariates. In particular, in our approach, we managed to partially account for critical confounding covariates such as placental age, which is often difficult to dissociate from placental disease. Interestingly, both approaches identified genes where we could observe statistical associations between disease status and genotype at the locus. Angers University Hospital and Cochin Hospital cohorts have been described in [12]. They included 8 control (CTRL, placentas obtained from uncomplicated pregnancies) and 13 IUGR samples from Angers, 9 CTRL, 7 PE, and 3 PE+IUGR from Cochin. Preeclampsia was defined by the presence of hypertension (systolic pressure > 140 mm Hg, diastolic pressure > 90 mm Hg) and proteinuria (>0.3 g/day). IUGR was defined as a reduction of fetal growth during gestation accompanied by Doppler abnormalities and birth weight below the 10th percentile [13,14]. The Gynecochin cohort samples were kindly shared by Dr. Louis Marcellin, and included 20 CTRL, 13 from natural delivery, and 7 from Csection. For placental sample collection relative to Angers, Cochin, and Gynecochin cohorts, sections of 1 cm3 of placental villi were dissected from four different cotyledons between the basal and chorionic plates, as previously described in [14]. St. George's Hospital samples, 6 PE placentas from C-sections, were kindly provided by Prof. Basky Thilaganathan and MD Veronica Giorgione in the context of the iPlacenta consortium. PE was defined by de presence of hypertension according to guidelines by The International Society for the Study of Hypertension in Pregnancy [15]. Placental samples were collected from the middle region of the placenta containing only villous tissue, just above the cord insertion point, then washed in PBS.

RNA and DNA Extraction
Human placental tissues were obtained from three different cotyledons of the maternal side of the placenta, washed extensively in sterile PBS, and snap frozen or in RNAlater (Invitrogen, Waltham, MA, USA), then powdered using pestle and mortar on dry ice or the hammering method in liquid nitrogen. The powdered tissues were then processed for DNA and RNA extraction. RNA was extracted with TRIzol (Invitrogen), according to manufacturer's instructions; UltraPure™ Phenol:Chloroform:Isoamyl Alcohol (25:24:1, v/v) (Invitrogen) was used in place of Chloroform. Extracted RNA was resuspended in RNAse and DNAse-free water, and integrity was assessed with Bioanalyzer Agilent 2100 nano kit.
For DNA extraction, powdered tissue was resuspended in Sample Lysis Buffer (10 mM Tris Hcl pH8, 10 mM EDTA, 50 mM NaCl, 0.5% SDS, Proteinase K (Invitrogen #25530-049) used at 1:3000) and incubated overnight in shaking water bath at 58 • C. Addition of absolute ethanol allowed precipitation of the DNA in filamentous, visible form, which was then transferred to new 1.5 mL tubes for additional washes in ethanol 70%. The pellet was left to air-dry, and DNA was resuspended in Tris-EDTA.

Transcriptomic Dataset
100 ng of RNA from the human placental samples were analyzed by ClariomD arrays (Applied Biosystems™, Affymetrix, Thermo Fisher Scientific, Montigny-le-Bretonneux, France) as described in [12]. This array measures gene expression both at the gene level as well as the exon level, providing also splicing isoform-specific data. The 66 samples were processed in three different batches, generating three sets of raw data files (.CEL). Merging datasets. To reduce batch effects in fluorescence signal due to experimental variability, the raw data (.CEL) files were processed together using the Transcriptome Analysis Console (TAC) software (Thermo Fisher Scientific), performing the default Robust Multi-Array normalization (RMA). This algorithm performs background adjustment, followed by quantile normalization and summarization [16]. The ClariomD array measures expression levels for a total of 134,748 probes, corresponding to coding and non-coding genes, 18,858 and 66,845, respectively, as well as predicted genes, pseudogenes, and small RNAs. Probe coordinates refer to the GRCh38 Genome Build. For downstream analyses, only genes with known GeneID and Description in TAC have been kept, removing transcripts identified by the database AceView, as well as transcripts on the Y and mitochondrial chromosomes; for a final total of 46,624 probes. Probes have been further filtered by mean fluorescent value (LOG2) across all samples ≥ 4.5, reducing the total number of probes to 33,988 considered to be "expressed". Transcriptomic covariates. Classical clinical and experimental variables in placental studies were used for the whole sample and included a total of 9 variables (Batch, Cohort, Group, Delivery, Maternal Age, Ethnicity, Gestational Age, Sex, and Parity). In particular, Batch refers to the experimental batch of transcriptomic data acquisition, Cohort to the different cohort of origins of the placental samples relative to the Hospital in which they had been collected, Group to the disease status, Delivery to the delivery mode, either C-section or Natural delivery, and finally Sex to the placental sex. For these 9 variables, missing values were replaced with the mean of each disease group, according to the disease group of the sample with missing value [17]. Summary statistics by disease group are listed in Table 1, the complete dataset is available in Supplementary Table S2. Principal Component Analysis (PCA) was performed in R (version 4.1.0, 2021-05-18) with the R package PCATools, on the 57 samples, for the transcriptomic dataset of 46,624 probes. The eigencorplot function within the package was used to calculate correlation coefficients between the first 10 principal components (PCs) and the known covariates; it performs a Pearson correlation, followed by F-statistic [18]. The R package SVA was used to identify potential additional sources of variation in the dataset, i.e., surrogate variables, by building the mod linear distribution with "Group" as the independent variable and setting the mod0, to the intercept [10]. For more details, see [12]. A correlation matrix was calculated in R, using as input the 9 clinical and experimental variables, the first 10 PCs of gene expression PCA, and 4 identified surrogate variables to identify the final set of variables correlated with global transcriptomic changes. A correlation coefficient cut-off threshold of |0.9| was used to remove colinear variables [19]. The final set of transcriptomic covariates includes Batch, Group, Delivery, Maternal Age, Ethnicity, Gestational Age, Sex, Parity, and Transcriptome PC1 to PC5.

Genotype Dataset
For the 66 samples, 200 ng of genomic DNA were genotyped using the Infinium OmniExpress (illumina) BeadChip, which interrogates 713,407 SNPs. The raw data files were analyzed with GenomeStudio2.0 software to retrieve the genotype dataset using the A/B allele Illumina notation. SNP coordinates used refer to the GRCh38 Genome Build. Quality control. Quality control (QC) of samples and SNPs was conducted in PLINK1.9 [20,21]. Samples were subjected to quality control, removing samples with a rate of missing genotype ≥ 2%, as well as samples that presented an excess heterozygosity as an index of contamination F < −0.05, calculated on the pruned dataset (pairwise correlations, window size 50 SNPs, sliding window of 5 positions, SNPs with r2 ≥ 0.2 removed). For SNP QC, SNPs with a rate of missing genotypes across samples≥ 1%, as well as SNPs diverging from Hardy-Weinberg equilibrium (p-value < 10 −6 ), have been removed. In total, 57 samples and 665,191 variants passed quality control. Genotype covariates. PCA was performed in PLINK1.9 on the clean dataset (57 samples, 665,191 variants), pruned as described above, retaining the first 10 principal components (GenotypePC1-PC10) as covariates for downstream eQTL analysis to summarise population stratification. Ancestry Estimation. Ancestry estimation has been performed to infer the ethnicity of placental samples from Cochin and Anger cohorts. Briefly, the genotypes from the 1K Genomes Phase III release have been downloaded from The International Genome Sample Resource, mapped to the GRCh37 genome build, and have been used for imputation of allele frequencies across populations [22,23]. The set of SNPs in common between the two datasets has been established, and the two datasets merged using PLINK1.9. PCA was performed on the merged dataset, pruned as described above, including only autosomal chromosomes, for a total of 185,249 variants. Sample clustering along PC1 and PC2 have been used to extrapolate the unknown ethnicities of the placental samples based on their relative distance with the 1K Genomes Phase III samples.

eQTL Analysis Workflow
The R package MatrixEQTL has been used to perform the eQTL analyses in R [24], by applying a linear multivariate model to test the contribution of each SNP genotype to gene expression levels. Two main approaches were pursued. The ALL COVARIATES dataset included the full set of 23 covariates (Batch, Group, Delivery Placental Sex, Gestational Age, Maternal Age, Parity, Ethnicity, the first 5 PCs from gene expression PCA, the first 10 PCs from genotype principal component analysis that summarises population stratification), allowing to identify the eSNPs correlating with changes of gene expression of the eGenes while minimizing spurious associations between SNPs and Genes [25]. The MINUS DISEASE dataset included all covariates but "Group" and Gene Expression PC5 given its strong correlation with the disease phenotype described by the "Group" variable. With this approach, the contribution of the disease variable on gene expression changes was not corrected for, allowing to investigate a set of genes for which the effects of disease on expression are important and would have otherwise excluded to the residual variance threshold, to identify SNPs that correlate with gene expression changes and could harbor interaction effects with the disease phenotype.
For both approaches, the same workflow was followed. Each gene expression distribution was first normalized by a built-in function with rank ("average" method) followed by quantile normalization to reduce the impact of outliers. The normalized gene expression dataset was then modeled by multivariate linear regression, expressing gene expression for each gene in the dataset as a function of either the ALL COVARIATES or MINUS DISEASE set of covariates. The residual variance of the model can be thought of as an indication of potential genetic influence on gene expression levels [26,27]. Using incremental thresholds of residual variance from 0.5 to 0.95 with 0.05 increments, 10 gene sets termed "RV0.05" to "RV0.95" were defined and used as input gene expression datasets in eQTL analyses.

Cis-QTL Analyses of Gene Expression Subsets
Gene subsets RV0.50 to RV0.95 and 417,114 SNPs (MAF ≥ 0.15) were analyzed for correlation with the LINEAR model parameter in MatrixEQTL [24]. We evaluated only the correlation of eGenes and eSNPs in cis (cis-eQTLs) given the limited power of our analysis and the small number of trans-QTLs previously identified in the placenta by a larger designed study [9]. To further reduce the total number of tests, the cis-distance has been set to 100 kb rather than the default 1 Mb [8]. The gene expression dataset has been normalized with MatrixEQTL prior to performing eQTL analysis, as described above. For the ALL COVARIATES Dataset analysis, 23 covariates (Batch, Group, Delivery, Maternal Age, Ethnicity, Gestational Age, Sex, Parity, first five PCs of gene expression PCA, first 10 PCs of genotype PCA) have been accounted for by the software when performing multivariate linear regression for each gene-SNP pair, while for the MINUS DISEASE dataset, 21 covariates have been included: Batch, Delivery, Maternal Age, Ethnicity, Gestational Age, Sex, Parity, first four PCs of gene expression PCA, first 10 PCs of genotype PCA. To correct for multiple testing, the in-built function in MatrixEQTL has been used that estimates FDR-adjusted p-values (q-values) for each gene-SNP pair with the Benjamini-Hochberg procedure. In this study, only SNPs with FDR ≤ 0.05 have been considered as statistically significant. The full lists of statistically significant eSNP-eGene pairs for both datasets are presented in Supplementary Tables S3 and S4.

Calculating Enrichment of Significant cis-QTLs for Each Subset
To choose the best cis-eQTL analytical design, defined by the gene expression subset used, the enrichment of significant cis-QTLs has been calculated. We define enrichment as the number of cis-eQTL per eGene. Cis-eQTL enrichment has been considered together with the total number of eGenes and eSNPs identified as a result of statistically significant cis-eQTLs (FDR ≤ 0.05) in order to find the optimal cut-off, where we see an increase of enrichment, without losing too many eGenes. For both ALL COVARIATES and MINUS DISEASE datasets, the input gene expression dataset RV0.85 was selected as optimal for the final cis-QTL analysis workflow and statistically significant cisQTLs were further characterized and are presented here. The list of input genes with relative individual expression levels as well as residual variances, are presented in Supplementary Tables S5  and S6, respectively.

Calculating Overlap with Previous Studies
The statistical significance of the overlap with previous studies was calculated with Fisher's exact test, with normal approximation, with a statistical significance threshold set at p-value ≤ 0.05 [28]. As reference total number of genes, we used N = 30,818; this number of placental expressed genes, with ≥0.1 RPKM (Read Per Kilobase of exon model per Million mapped reads), as defined in the latest high-quality study on placental transcriptome by Gong and coworkers [29]. The group carried out an RNA-sequencing analysis on 302 human placental samples, including messenger RNAs, long non-coding RNAs, as well as small and circular non-coding RNAs. This number is consistent with our study, where we observe 33,988 genes with mean fluorescent value (LOG 2 ) ≥ 4.5. The total number of genes in the human genome, even though still lacking consensus, is reported to be higher than 35000, including coding and non-coding genes, as discussed in [30].

Calculating Statistical Significance of Interaction between Best-eSNP and Disease on eGene Gene Expression
To check whether an interaction exists between the genotype at the best-eSNP and the disease status on gene expression levels of the eGene we performed a linear regression in R and considered the linear model to be significant when the model p-value ≤ 0.05, similarly the effects of each variable on the model were considered significant when the coefficient p-value in the model was ≤ 0.05. The general formula of the linear regression model was: eGene~eSNP + Group + eSNP*Group. The linear regression was performed on the residuals of the gene expression dataset after multilinear regression with the MINUS DISEASE set of covariates to correct for confounding factors (the data had also initially been normalized by ranking and quantile normalization as described above).

Transcriptome Identification of Confounding Variables
The transcriptome dataset is composed of 57 RNA samples that passed the quality controls from either the Cochin Hospital (Technologic Facility Gernom'IC), Angers University Hospital, or St George's Hospital, University of London (Table 1). The transcriptome data were obtained using the Clariom D microarray (Affymetrix), which allows to analysis of the level of mRNAs at the exon level [12].
From the transcriptomic datasets, only genes with known GeneID and Description in the TAC (Transcriptome Analysis Console, Affymetrix) were kept, removing transcripts identified by the database AceView, as well as transcripts on the Y and mitochondrial chromosomes (n = 616 and 18, respectively), for a final total of 46,624 probes. Probes were further filtered by mean fluorescent value (LOG2) across all samples ≥ 4.5, eventually reducing the total number of probes to 33,988.
A new analysis strategy of our work was the removal of the covariates affecting gene expression; while some are interesting biological factors that could deserve interest, such as maternal age for instance, we decided to consider them all as confounders to identify solely eQTL that influence placental gene expression without obvious medical consequences. There were nine variables available for each sample that was included ab initio in the analysis (Batch, Cohort (St Georges, Cochin, or Angers), Disease Group, Mode of Delivery, Maternal Age, Ethnicity, Gestational Age at Delivery, Sex, Parity). Using PCA in R (PCATools) enabled us to define principal components (PCA axes). The 10 first PCA axes captured~61% of the variability, while the two first axes, PC1 and PC2 only, concentrated 34.5% of the variability. The correlation of the PC axes with the clinical variables is represented in Figure 2A. The weights of the clinical covariates could be estimated at 7.4%, 5%, 10.7%, 5.7%, 5.8%, 14.4%, 11.1%, 23.4%, and 23.5% for Parity, Sex, Gestational age, Ethnicity, Maternal Age, Delivery Mode, Disease Group, Cohort, and Batch, respectively. The total is above 100% since there are complex correlations between the different items; for instance, disease status is correlated with gestational age. Some axes were strongly correlated with the Group (i.e., the disease status), especially PC1 and PC5, the latter one being also intimately correlated with the gestational age.
In addition, The R package SVA was used to identify possible surrogate variables (additional unknown sources of variation in the dataset). We collated all variables in a correlation matrix that was calculated in R, using as input the nine clinical and experimental variables, the first 10 PCs of gene expression PCA, and four identified surrogate variables to identify the final set of variables correlated with global transcriptomic changes. A correlation coefficient cut-off threshold of |0.9| was used to remove colinear variables [19]; see Figure 2B. For the final choice of covariates included in the eQTL analysis, we kept the five first PCs, "Batch", "Group", "Delivery", "Maternal Age", "Ethnicity", "Gestational Age" and "Sex". The covariate "Cohort" was removed since it was colinear with "Batch".

Genotyping and Population Stratification
The final genotype dataset included 665,191 variants for the 57 samples, of which we kept SNPs with a Minor Allele Frequency > 0.15 (417,114 were kept). The legislation in France forbids the collection of ethnic data. Nevertheless, with the genotype information, we were able to classify our samples using PCA alone and with reference to the 1000 genome project (Supplementary Figure S1). Seven placental samples were clearly of African origin, 42 were of European origin, and eight were mixtures between African and European backgrounds. The first 10 PCs from the genotype principal component analysis were included as covariates in the eQTL analysis to account for variations due only to population stratification.

Optimal Feature Selection for the eQTL Analysis
We first explored the effect of varying the number of input genes for the eQTL analysis to identify the optimal gene sets in order to achieve a satisfactory balance between the number of statistically significant cis-eQTLs (FDR ≤ 0.05), the number of identified eGenes and cis-eQTL enrichment (in 100 kb proximal to the gene), defined as the number of eSNPs/eGene. For this, we used the concept of residual variance (RV) to reduce the number of tested genes.
Given our limited sample size, it was our primary goal to be able to define the set of genes to be used as input in the eQTL analysis that would give us the optimal experimental design to identify significant cis-eQTLs, reducing as much as possible the number of tests, without missing out potentially relevant genes. The eQTL workflow that we used was based on multivariate linear modeling, which expresses the variance observed in the expression levels for each gene in the function of the input covariates. Therefore, the RV is a measure of potential genetic influence on gene expression levels, amongst other variables [26,27]. The summary of the threshold optimization for analysis is presented is Supplementary Table S1. While the number of significant eGenes decreased from 153 to 22, when the RV raised from 0.5 to 0.95, the number of significant SNP per gene increased from 4.00 to 8.91.
We selected an RV threshold of 0.85 for further analysis as a good compromise between having significant but relevant eQTL, albeit other thresholds could certainly be chosen. At this threshold, at least six SNP per gene were significant, meaning that the association with gene expression is not merely due to linkage disequilibrium. The analysis at this threshold rested upon 3201 genes, 417114 SNPs, a multivariate model with 23 covariates, and a window of ±100 kb around each gene, following the experimental choice of Kikas et al. [8], which justified this window by the fact that relevant regulatory SNPs are mostly in the vicinity (~100 kb) of the gene. These parameters resulted in the discovery of 279 statistically significant placental SNPs (p < 1.24 × 10 −4 , FDR < 0.05) proximal to 43 eGenes ( Figure 3, Table 2). The full list of statistically significant eGene-eSNP pairs is provided in Supplementary Table S3.
In Table 2, the novel genes discovered are presented in yellow, while the genes in bold correspond to the most stringent threshold and to a Bernouilli genome-wide correction which does not consider the possible LD relating to successive SNPs. When this is considered, the Bernouilli correction corresponds to a limited number of independent tests, leading to a threshold of~10 −4 instead of 10 −8 (blue and red line, respectively, in Figure 3A). We report 22 novel placental eQTL, while 51% of the placental eGenes identified here have been previously described as placental cis-eQTLs in the three placental studies available to date (n = seventeen for the Peng et al. dataset, n = eight for Delahaye et al., and n = five for Kikas et al.) [7][8][9]. Fisher's exact tests of the overlaps between datasets were all statistically significant with p-values < 0.05, respectively p < 5.83 × 10 −7 , p < 1.88 × 10 −6 , and p < 2.75 × 10 −8 , validating our experimental approach.
Interestingly, out of the 279 cis-QTL found, 37 corresponded to SNPs located between two genes modified at the expression level. On chromosome 6, 24 eSNPs were found to be shared between TOB2P1 and ZSCAN9 out of their 41 and 42 cis-eQTLs, respectively ( Figure 3B). On chromosome 11, two different regions of interest were identified, between LOC646029 and AQP11, that shared 10 eSNPs, out of 11 and 12 cis eQTLs, respectively. Finally, three eSNPs associated with CARD17 expression deregulations correlated also with upstream transcript CASP1P2, which had a total of 11 cis-eQTLs ( Figure 3B). These observations strongly suggest that the SNPs influence gene expression in the chromosome region, either by modifying an enhancer acting on two genes close by or by being transcribed in a non-coding RNA having a cis-regulatory effect on transcription. The x-axis expresses the genomic coordinates, while in the y-axis, the -LOG10 of the p-value, the horizontal line sets the level for FDR ≤ 0.05. The vertical lines represent the start and end sites for each of the genes. eSNPs that were found to be in common between eGenes in the same region are highlighted in yellow. Table 2. List of eGenes, with the most significant cis-eQTL for the ALL COVARIATES dataset. The table lists the total number of eGenes correlating with the identified significant cis-eQTLs; for each eGene the total number of eSNPs are listed, and the details of the Best-eSNP reported. The Best-eSNP, in this case, is defined as the most significant cis-eQTL associated with each eGene (smallest recorded q-value). Beta corresponds to the slope coefficient for the multivariate linear model and is an estimation of the eSNP effect size on gene expression variance. Novel placental eGenes are highlighted in yellow. If the eGene had been previously identified as such, the studies are specified in the "Previous study" column. Highlighted in light yellow are the genes that were not found in the previous studies.

Identifying eQTLs Involved in the Disease by a Subtraction Strategy
The same cis-eQTL analysis was performed de novo, suppressing the parameter 'Disease' from the confounding factors, as well as PC5 (Figure 2), which were very highly correlated. This allowed us to include in the analysis genes for which the disease has a strong influence on changes in gene expression levels, reducing the overall residual variance after modeling (the number of genes increased from 3201 to 3279, Figure 4). However, these genes are also believed to be the most interesting in terms of understanding pregnancy pathologies. For this reason, we aimed to investigate the contribution of genotype changes on these genes in the context of disease to be able to also detect potential genotype-disease interactions that would suggest the presence of conditional eQTLs [31] Interestingly, in this case, the number of identified eGenes was always higher than when the 'Disease' was used as a covariate. At the chosen threshold of 0.85 for the residual variance (Figure 4), the number of eGenes identified increases from 43 to 54, with the number of statistically significant (FDR ≤0.05) eSNP increasing from 279 to 307, as represented in Figure 4. The interpretation of this figure is that 16 SNPs are specific only to the disease, 38 are influenced by the covariates and the disease, while amongst the five specific to the dataset analyzed with all the covariates, four did not pass the RV threshold, MLLT10, HJURP, MIR4527, and Y_RNA. The last one, FGF19, is present in both datasets, but when the disease is not considered as a covariate, FGF19 is no longer observed as an eGene. The full list of statistically significant eGene-eSNP pairs is presented in Supplementary  The 16 eGenes that were specific to the disease are presented in Figure 5 and Table 3. Several analyses of gene ontology were carried out using these 16 eGenes (disease-specific) or the 54 eGenes (placental-specific), including the 38 genes that were also presented in the first analysis using String (https://string-db.org/, accessed on 1 July 2021). All these analyses failed to identify a significant grouping of genes, suggesting that the couples eSNP-eGene that we could identify correspond to various aspects of placental biology and physiopathology and not to several specific categories or gene ontologies. Not surprisingly, however, some of these genes have been previously identified as deregulated in pregnancy pathologies (Table 3).  -/+ stands for the reference strand of DNA on which the gene is situated in the coding orientation.
While previously identified eQTLs, such as influencing ZSCAN9 and ERAP2, have shown a strong genetic influence on their regulation, the implication of this on placental function remains elusive. In Figures 6-8, we show the levels of expression stratified according to genotype of the most significant eSNP for a subset of eGenes that presented a pronounced conditional response depending on disease group (Control vs Disease) and have been previously linked to placental function and diseases such as IL36R, FUT10, PTTG1, CBLB, and DNAJC15. Here, we plot the raw data for the expression levels; however, during analysis, the data is first normalized and then corrected for the covariates. Table 4 summarizes the results of Genotype*Disease interaction testing for each of the eGenes identified and its most significant eSNP. Interestingly, ZSCAN9, AQP11, LOC646029, FUT10, NDUFS5, and C8ORF89 showed statistically significant (p-value ≤ 0.05) interaction between the effects of disease status and the eSNP genotype on the expression of the eGene. This suggests the existence of a complex interaction between the gene and the phenotype of the disease.   Table 4. Best-eSNP-eGene pairs for the MINUS DISEASE dataset showing the summary statistics from the eQTL analysis and the results of testing the significance of the interaction between genotype and disease on gene expression. The Best-eSNP, in this case, is defined as the most significant cis-eQTL associated with each eGene (smallest recorded q-value). Beta corresponds to the slope coefficient for the multivariate linear model and is an estimation of the eSNP effect size on gene expression variance. The table shows the test p-values for each coefficient of the linear model to test the effect of genotype and disease on gene expression and their interaction, as well as the model p-value. eGenes that presented a statistically significant interaction between eSNP genotype and disease status (p-value ≤ 0.05) are highlighted in green.

Discussion
In this paper, we attempted a novel approach to uncover relevant associations between genetic variants and gene expression in the placenta from healthy and pathological pregnancies. We started by carefully identifying covariates that may have masked important results in previous studies. As observed previously, gestational age is a major covariate and is strongly associated with disease status. In PCA analysis, the fifth axis is the most strongly associated with gestational age (r = 0.49, Figure 2A), and the most strongly associated to placental disease (r = −0.47, Figure 2A). The inverse correlation between the two features is consistently seen in many transcriptomic studies comparing normal and pathological placentas. The problem is that, almost by definition, placentas from pregnancy complications diseases are usually obtained earlier in gestation than from normal pregnancies. The originality of our approach is as follows: we performed two separate searches for eQTLs, one using all covariates available (including disease status) and one without the disease status used as a covariate, assuming that the difference in the discovery will be due more or less exclusively to the disease. At the initiation of this work, we performed a thorough characterization of the covariates influencing gene expression, and while we could not fully distinguish between "disease group" and "gestational age" given their strong correlation with PC5, we could partially account for gestational age confounding by statistically correcting for PC2 which is strongly correlated (0.39, p val < 0.01), but not with "disease group".
In the first part of our results which included "disease group" as a covariate, we expanded the current knowledge of genetic modulation of gene expression in the placenta, adding to the relatively limited number of previous studies in the field [7][8][9]12]. We identified 279 placental cis-eQTLs (FDR ≤ 0.05) in 57 human placental samples, correlating with expression changes of 43 unique eGenes, among which 22 constituted novel eGenes over the previous studies. One hundred and eleven (111) of the total 279 cis-eQTLs, including the most significant eSNPs, were associated with previously described placental eGenes ZSCAN9, PSG7, TOB2P1, ERAP2, and AQP11, corroborating the validity of our analysis. Interestingly, for the first time in the placenta, we describe cis-eQTL eSNPs that correlate with neighboring transcripts, identifying potential placental regulatory hotspots. On chromosome 11, a novel eGene LOC646029 and downstream AQP11 shared 10 eSNPs, while the three eSNPs associated with CARD17 expression levels were also eSNPs for novel placental eGene CASP1P2. Similarly, 24 out of the 42 eSNPs associated with ZSCAN9 were found to correlate with changes in expression of the upstream transcript TOB2P1 on chromosome 6. CASP1P2 and TOB2P1 are both pseudogenes, which implies that they may not encompass functional open reading frames and that changes in their expression could not have an impact on placental function as proteins. However, the changes in their expression levels as well as in neighboring genes as a result of genetic changes, highlight these regions as potential placental regulatory sites. As additional evidence of the relevance of these regions to placental function, genotype changes in these loci show a statistically significant interaction with the disease status, resulting in significant changes in the expression of ZSCAN9 and LOC646029, as well as AQP11.
Kikas and coworkers extensively reviewed previous studies on genetic regulation of gene expression in the placenta, including gene-candidate approaches, as well as the three genome-wide studies mentioned above [40]. Peng and coworkers combined RNAsequencing and genotype data of 159 healthy placentas, laying the foundations for eQTL analyses in the placenta [9]. They report only the top eSNP of each eGene, therefore identifying 3218 cis-eQTLs correlating with 3218 unique transcripts and 35 trans-eQTLs (cis-distance between eSNP and eGene < 500kb) with an FDR threshold, defined by permutations, of 0.1%. This limited number of trans-QTLs, even with a sample size that is more than double one of other studies in placental eQTLs, prompted to restrict further efforts to cis-QTL identification alone. Delahaye and coworkers, expanded the repertoire of placental cis-eQTL by identifying 985 eSNPs, corresponding to 615 eGenes in 80 healthy placentas, as well as investigating methylation-QTLs in 300 placenta samples [7]. Finally, Kikas and coworkers, with a total of 40 samples, identified 199 significant cis-eQTLs (FDR < 0.05) corresponding to 63 unique genes in a window of 100 kb around each gene [8]. Both studies showed considerable overlap with [9], respectively, for 62% and 80% of the eGenes. However, if we consider the percentage of unique eGenes, not replicated across all studies, 88%, 37%, and 21% of eGenes were identified, respectively, by Peng et al. [9], Delahaye et al. [7] and Kikas et al. [8].
When comparing our results with previous studies, the overall level of overlap between eGene lists is 51%, which could be due to differences in population stratification, sample size as well as differences in analysis workflow. In particular, the number of overlapping eGenes is n = seventeen for the [9] dataset, n = eight for [7], and n = five for [8]. The most significant overlap is with the work of [8]; Fisher's exact tests p < 2.75 × 10 −8 , 1-2 orders of magnitude more significant than the overlap with the two other studies. This could be explained by the fact that in terms of analytical design, our study is closer to the one of [8]. In both analyses, cis-QTLs were analyzed in a window of 100 kb, for a total number of variants of 417,114, in the same order of magnitude, similarly; the sample sizes are also close with 40 and 57 placentas in [8] and our study respectively. Probably the most crucial similarities are the use of the same software for the eQTL analysis (i.e., MatrixEQTL in R [24]) and the inclusion of placentas from pregnancies complicated by PE. The use of the same analytical software improves the replicability of the results since the statistical tests are the same and can have a huge impact on which eSNP-eGene pairs pass the threshold of significance.
Similarly, the presence of PE samples implies a similar pattern of gene expression; however, in our study design, we did consider "disease group" as a confounding factor and included it as a covariate, which should reduce the effects on the output.
With our analysis comparing the two datasets, the ALL COVARIATES vs. MINUS DISEASE, that differ only by removing the disease as a confounding covariate in the second, we propose a new strategy for finding eQTL that are relevant in pathological pregnancies. In particular, this approach allowed us first to validate our study design as well as sample quality by being able to compare our findings with the existing literature. In the second instance, we were able to further investigate the role of genotype changes on the expression of genes for which disease has a relevant influence and could therefore be important for placental function.
Genome-Wide Association Studies (GWAS) use large sample sizes (in the tens of hundreds) in an attempt to identify the enrichment of variant alleles in the disease group compared to healthy controls. GWAS conducted in the placenta has identified SNPs upstream of the FLT1 locus, rs4769613, as harboring the risk allele (T) associated with PE [41]. Despite the limited functional understanding offered by GWAS approaches, the use of combinatorial approaches that exploit knowledge from different types of datasets helps draw these functional conclusions [42][43][44]. Recently, the rs4769613 locus was genotyped in a cohort of placental samples, including 57 PE and 277 control; the presence of the risk allele correlated with changes in the placental gene expression level of FLT1 only in the PE group, defining rs4769613 as a conditional eQTL [31]. This suggests that the risk variant, when present in the placentas of mothers with PE, favors the increase in expression of the FLT1 gene and, therefore, could be located in a regulatory region for the gene itself, targeted by regulatory factors that are expressed specifically in the PE placenta.
In our work, we identified multiple genes that presented a promising 'genotype and disease' interaction, which could therefore represent placental conditional eQTLs relevant to disease etiology. The genes found during this stage are supposed to be influenced by the disease either exclusively or mainly but with an influence of other factors.
Among these, AQP11, IL36R, FUT10, PTTG1, CBLB, and DNAJC15 are interesting both in terms of behavior, as well as the documented role in placental function.
Among the 'disease exclusive' genes represented in detail (in green boxes), there are PTTG1, DNAJC15 (Figure 6), TAS2R64P and LINC00654 (Figure 7), C8ORF89 and ERICH1 (Figure 8). The profiles of expression are very different; for instance, in the case of ERICH1, the SNP influence of the profile is the same way in disease and controls, but overall in the disease, the expression is systematically inferior. By contrast, C8ORF89 has an SNPdependent profile in controls but not in disease; PTTG1 is overexpressed in disease but only rather in specific SNP configuration.
PTTG1 has been involved in trophoblast invasion through the regulation of matrix metalloproteases MMP2 and MMP9 by regulating integrin/rho signalling [34]. AQP11 belongs to the family of water transporters through the cellular and endoplasmic reticulum membrane. It has recently been implicated in redox regulation, and other members of the same family, such as AQP, were found to be involved in the mechanism of aspirin action in preventing PE [45][46][47]. IL36RN is highly expressed in the placenta, where it could be involved in trophoblast proliferation [48]. FUT10, involved in the fucosylation of proteins, has been shown to promote the binding of TNF to its receptor, thus promoting the inflammatory reaction, which is relevant for placental diseases [49]. Interestingly, we identified a significant interaction between disease and genotype on the levels of FUT10. DNAJC15 is reported as an important factor in mitochondrial function and regulation of the respiratory chain. It has been abundantly demonstrated by us and many others how these cascades of oxidative phosphorylation and oxidative stress are pivotal in placental diseases [50,51]. CBLB has been recently shown to influence Natural Killer cell differentiation [52] and Uterine NK cells are pivotal for efficient implantation. In summary, all these genes may be involved in placental disease by a contribution to various pathways, which explains why they are not explicitly regrouped in a biological function.

Study Limitations and Conclusions
Amongst the limitations of our study, there is, of course, the sample size, which makes our data a preliminary incitement that needs confirmation in further studies. The choice to subtract extensively the covariate effect encompasses a risk of missing positive hits (Type II error), but our objective was to decrease the Type I error and minimize the finding of false positives. Some additional analyses could be performed by removing some of the covariates to detect eSNPs influenced by sex, but this is beyond the objective of our current work. Furthermore, among the control placentas, one-third were from natural deliveries, which we included in our study by adding the delivery mode as an additional covariate). Gestational age was found to induce 10% of the variability, consistent with previous observations in the field. Subtracting its effect and using it as a covariate might be considered a mathematical trick that could mask complex interactions between the disease effects and the gestational age effects. Nevertheless, it is currently difficult to adopt a better strategy in the field of placental disease.
The approach based on two-hits, using unusually the disease as a covariate and then adding it afterward, allows by subtraction to identify novel couples of SNP-eGenes that may be of interest for placental disease studies or other cases of human diseases where confounding factors are important. We hope that our work will be a starting point for researchers using other complex datasets available for analysis.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/cells12040578/s1, Figure S1, Results of the Principal Component Analysis performed to assess population stratification and establish admixture. Table S1 present the parameters for threshold evaluation of the residual variance, Table S2 give the dataset that was used for the analyses, Tables S3 and S4 list statistically significant couples eSNP-eGene for the dataset including disease as a covariate and for the one excluding it. Supplementary Tables S5 and S6 lists the input genes with their expression levels, and residual variances, respectively. Table S7 ists the top-eSNP in terms of statistical significance.  Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.

Data Availability Statement:
The significant data of the study are provided in detail in the Supplementary Tables. Acess to the complete dataset can be obtained on request from the authors.