C1431T Variant of PPARγ Is Associated with Preeclampsia in Pregnant Women

Peroxisome proliferator-activated receptor γ (PPARγ) is essential for placental development, whose SNPs have shown increased susceptibility to pregnancy-related diseases, such as preeclampsia. Our aim was to investigate the association between preeclampsia and three PPARγ SNPs (Pro12Ala, C1431T, and C681G), which together with nine clinical factors were used to build a pragmatic model for preeclampsia prediction. Data were collected from 1648 women from the EDEN cohort, of which 35 women had preeclamptic pregnancies, and the remaining 1613 women had normal pregnancies. Univariate analysis comparing preeclamptic patients to the control resulted in the SNP C1431T being the only factor significantly associated with preeclampsia (p < 0.05), with a confidence interval of 95% and odds ratio ranging from 4.90 to 8.75. On the other hand, three methods of multivariate feature selection highlighted seven features that could be potential predictors of preeclampsia: maternal C1431T and C681G variants, obesity, body mass index, number of pregnancies, primiparity, cigarette use, and education. These seven features were further used as input into eight different machine-learning algorithms to create predictive models, whose performances were evaluated based on metrics of accuracy and the area under the receiver operating characteristic curve (AUC). The boost tree-based model performed the best, with respective accuracy and AUC values of 0.971 ± 0.002 and 0.991 ± 0.001 in the training set and 0.951 and 0.701 in the testing set. A flowchart based on the boost tree model was constructed to depict the procedure for preeclampsia prediction. This final decision tree showed that the C1431T variant of PPARγ is significantly associated with susceptibility to preeclampsia. We believe that this final decision tree could be applied in the clinical prediction of preeclampsia in the very early stages of pregnancy.


Introduction
Preeclampsia, which is characterized by high blood pressure and concurrent proteinuria, is a complication of pregnancy that usually manifests after 20-25 weeks of pregnancy [1]. This disease is highly associated with morbidity and mortality for both the mother and the fetus because of its serious risks to fetal maturity and the maternal cardiovascular system [2]. Preeclampsia occurs in 5% to 7% of all pregnant women, leading is to apply generalized linear models, which are easy and fast to implement. However, erroneous specification of model parameters or assumptions can lead to bias in the results. We therefore applied new machine-learning methods that are able to fully consider complex relationships between the predictors and the outcome, with fine-scale argument tuning. In this way, potential bias can be, to some extent, diminished. A summary of the study procedure is shown in Figure 1. Figure 1. Schematic diagram of the study. Of the 2002 pregnant women in the EDEN mother-child cohort study, 1648 satisfied the inclusion and exclusion criteria and were recruited to this study. The dataset was randomly stratified into two parts, a training set and a testing set, according to a 3:1 ratio. Three methods were used to evaluate the importance of correlated features within the training set: logistic regression, lasso regression, and the Boruta algorithm. Eight machine-learning models were built, tuned, and trained on an oversampled training set with five-fold cross-validation (CV), followed by validation on the testing set. The performance of the models was evaluated using metrics of accuracy and the area under the receiver operating characteristic curve (AUC). The final model was used to build a decision tree for predicting preeclampsia. GW: gestational week; PE: preeclampsia; SNPs: single nucleotide polymorphisms. Table 1 presents a summary of maternal clinical features from the control (n = 1613) and preeclampsia (n = 35) groups. According to t-test and chi-square tests, the only factor that was significantly different between the two groups was the presence in the maternal genome of the C1431T variant of PPARγ (Table 1). Similarly, logistic regression and analysis of the log odds ratio found that mothers carrying this variant had a higher risk of developing preeclampsia (p-value < 0.05; Figure 2). The 95% confidence interval of the odds ratio for maternal C1431T ranged from 4.90 to 8.75 (Supplementary Table S2). A comparison of the three genotype models confirmed that the maternal C1431T variant was the only factor associated with a significant difference in dominant and co-dominant models (Table 2). Clinical factors and genetic data were missing from 9.1% of the dataset (Supplementary Figure S1), while there was no significant change in the results before and after the imputation (Table S1). A summary of the data before and after imputation, showing no differences between the imputed and non-imputed summary tables, is shown in Table S1. Of the 2002 pregnant women in the EDEN mother-child cohort study, 1648 satisfied the inclusion and exclusion criteria and were recruited to this study. The dataset was randomly stratified into two parts, a training set and a testing set, according to a 3:1 ratio. Three methods were used to evaluate the importance of correlated features within the training set: logistic regression, lasso regression, and the Boruta algorithm. Eight machine-learning models were built, tuned, and trained on an oversampled training set with five-fold cross-validation (CV), followed by validation on the testing set. The performance of the models was evaluated using metrics of accuracy and the area under the receiver operating characteristic curve (AUC). The final model was used to build a decision tree for predicting preeclampsia. GW: gestational week; PE: preeclampsia; SNPs: single nucleotide polymorphisms. Table 1 presents a summary of maternal clinical features from the control (n = 1613) and preeclampsia (n = 35) groups. According to t-test and chi-square tests, the only factor that was significantly different between the two groups was the presence in the maternal genome of the C1431T variant of PPARγ (Table 1). Similarly, logistic regression and analysis of the log odds ratio found that mothers carrying this variant had a higher risk of developing preeclampsia (p-value < 0.05; Figure 2). The 95% confidence interval of the odds ratio for maternal C1431T ranged from 4.90 to 8.75 (Supplementary Table S2). A comparison of the three genotype models confirmed that the maternal C1431T variant was the only factor associated with a significant difference in dominant and co-dominant models ( Table 2). Clinical factors and genetic data were missing from 9.1% of the dataset (Supplementary Figure S1), while there was no significant change in the results before and after the imputation (Table S1). A summary of the data before and after imputation, showing no differences between the imputed and non-imputed summary tables, is shown in Table S1.   Three methods were used for feature selection: the Boruta algorithm, lasso regression, and logistic regression. The Boruta algorithm highlighted four features as important for predicting preeclampsia: education, maternal C681G and C1341T variants, and obesity. Relaxing the inclusion criteria resulted in inclusion of maternal delivery age, BMI, creating six important features ( Figure 3A). Lasso regression singled out maternal C1341T and C681G variants and primiparity ( Figure 3B), and logistic regression found maternal C1341T, number of pregnancies, primiparity, number of cigarettes, education, and BMI ( Figure 3C). Therefore, the final features included in the model-building process were maternal C681G and C1341T, obesity, BMI, number of pregnancies, primiparity, number of cigarettes, and education. The log odds ratios of all clinical features are presented on the left, with corresponding p-values on the right. Bars indicate the mean and 95% confidence interval of the log odds ratio. P-value less than 0.05 indicates significance. Age: maternal age at delivery; Height: maternal height; Education: maternal education; BMI: body mass index before pregnancy;

Overview of Maternal Clinical Features
Smoking: cigarette use.

Selection of Candidate Features for the Prediction of Preeclampsia Using Three Methods: Boruta Algorithm, Lasso Regression, and Logistic Regression
Three methods were used for feature selection: the Boruta algorithm, lasso regression, and logistic regression. The Boruta algorithm highlighted four features as important for predicting preeclampsia: education, maternal C681G and C1341T variants, and obesity. Relaxing the inclusion criteria resulted in inclusion of maternal delivery age, BMI, creating six important features ( Figure 3A). Lasso regression singled out maternal C1341T and C681G variants and primiparity ( Figure 3B), and logistic regression found maternal C1341T, number of pregnancies, primiparity, number of cigarettes, education, and BMI ( Figure 3C). Therefore, the final features included in the model-building process were maternal C681G and C1341T, obesity, BMI, number of pregnancies, primiparity, number of cigarettes, and education.  The Boruta algorithm is a wrapper built around the random forest classification algorithm, in which shadow features are generated from the shuffled values, which are duplicates of the dataset in each column. This leads to a range of values of the importance on features, in which blue boxplots depict the quantiles that indicate the weighted thresholds, based on Z-score (minimum, mean, and maximum), used for selecting features. Red boxplots represent features that were found to be unimportant, while green boxplots indicate important features. Yellow boxplots show features that may be important depending on the criteria used. (B) Feature selection with the lasso regression method, which minimizes the cost function to select those features of use and discards the useless or redundant features so that it can make its coefficient equal to 0. The weighted value can be calculated directly, with the soft threshold for importance as more than 0.5. (C) Feature selection with the logistic regression method based on a binary dependent variable model with the soft threshold for importance as more than 1. Age: maternal age at delivery; Height: maternal height; Education maternal education; BMI: body mass index before pregnancy; Smoking: cigarette use.

Modeling Based on Machine Learning
The dataset was divided into a training set and a testing set. Due to the insufficiency of preeclampsia cases in the dataset, the training set was oversampled with respect to the incidence of preeclampsia, in order to obtain the balance between negative and positive cases as described in the Materials and Methods section. There were no differences between the datasets before and after oversampling in the representation of categorical factor and the mean and standard deviation of numeric factors (Supplementary Table S3). There was a wide degree of overlap between the distribution of positive and negative cases in the training set before and after balancing (Supplementary Figure S2D&E). The optimal combination of features was selected following a thorough process of tuning based on the optimal AUC of models (Supplementary Figure S6).
The results of the final eight machine-learning models, with respect to model accuracy and AUC, are shown in Table 3 for both the training and testing sets. The optimal model was the boost tree model, for which the values for accuracy and AUC in the training set were 0.971 and 0.991, which was perhaps overfitted. We then validated the model in the testing set with the values for accuracy and AUC 0.951 and 0.701. The diagnostic performance of each of the machine-learning models (AUCs) is depicted in Figure 4. Table 3. Prediction of the eight models of machine-learning analysis. The performance of the models was evaluated on the training set first by the accuracy and AUC, followed by the verification on the testing set. The values of the optimal model were bolded. Data were presented mean ± S.D.

Training Set
Testing Set

Prediction Procedures of Boost Tree
A boost tree-based decision tree, along with a heatmap of scaled feature values, was constructed using the balanced training set ( Figure 5). The clinical features that were determined to be important for prediction included the maternal PPARγ genotypes, primi-

Prediction Procedures of Boost Tree
A boost tree-based decision tree, along with a heatmap of scaled feature values, was constructed using the balanced training set ( Figure 5). The clinical features that were determined to be important for prediction included the maternal PPARγ genotypes, primiparity, number of pregnancies, obesity, BMI, and education. As expected from the univariate tests, the maternal C1431T variant was the first key branching node of the tree. The simplicity of this procedure is intended to facilitate its possible use in clinical practice for the prediction of preeclampsia.

Discussion
In the present study, we identify a significant association between the C1431T SNP of PPARγ and preeclampsia. We also present a decision tree based on a boost tree model that represents a possible diagnostic procedure for pragmatic preeclampsia prediction, Figure 5. Boost tree and heatmap for predicting preeclampsia. The nodes of the boost tree represent the contributing features, while branches depict the threshold values. The first row of the heatmap presents the outcomes, while the lower rows present the predictor values. The colors represent the scaled value of a sample of each feature. Primiparity and maternal C681G function as the second nodes, followed by the number of pregnancies and BMI, while education played a less important role in the final decision. For genotypes, "1" represents no mutation in the allele, "2" means a single mutation, and "3" a double mutation; for primiparity and obesity, "1" means "no" while "2" means "yes".

Discussion
In the present study, we identify a significant association between the C1431T SNP of PPARγ and preeclampsia. We also present a decision tree based on a boost tree model that represents a possible diagnostic procedure for pragmatic preeclampsia prediction, which could benefit early diagnosis regardless of gestational age.
The relationship between the PPARγ and preeclampsia has been widely reported. For example, the expression of PPARγ was increased in late onset preeclampsia, but not early onset preeclampsia, compared to normal pregnancy [40]; the activation of PPARγ by its agonist rosiglitazone can probably relieve the preeclampsia by reducing uterine perfusion pressure and benefiting placental vasculature [41,42]; PPARγ plays an important role in controlling endothelial function and blood pressure homeostasis, which is crucial for pathological preeclampsia [43]. However, a previous report states that there is no relationship between PPARγ polymorphism and the occurrence of preeclampsia [44]. In our study, the presence of the C1431T variant of PPARγ in the mother was shown to play a significant role in distinguishing between preeclamptic and normal pregnancies. However, no such role was detected for either the Pro12Ala or C681G SNPs. Interestingly, even though a chi-square test found no evidence of a link between C681G and preeclampsia, the inclusion of this variant in the final predictive model, as suggested by the feature selection process, was found to improve both the accuracy and AUC values of the machinelearning model.
Machine-learning algorithms are widely used to obtain better predictive accuracy compared to conventional generalized linear models in decision-making scenarios. Furthermore, they offer alternative strategies for the diagnosis of diseases based on clinical features [45][46][47][48]. Currently, there are eight machine-learning algorithms in wide use for modeling and building diagnostic procedures based on appropriate medical history: elastic net regression, random forest, support vector machine, decision tree, k-nearest neighbor, naïve Bayes, boost tree, and multilayer perceptron [49,50]. Several studies have compared different machine-learning methods for disease prediction under various clinical conditions, and the results have been mixed [51][52][53], suggesting that the optimal algorithm may vary depending on context. In our study, we applied and compared these eight machinelearning algorithms and addressed two common challenges in modeling: insufficiency and overfitting of the models. To prevent the former, we oversampled the positive cases to prevent inaccuracy due to the imbalance between positive and negative cases [54,55] and evaluated both the balanced and unbalanced versions of the training and testing sets. To avoid the latter, we used different approaches for preprocessing our dataset and applied five-fold CV to the training set, followed by validation on the testing set.
First, the original dataset was split into a training set and a testing set without balancing. In this case, the boost tree was the optimal model, with values of accuracy and AUC as 0.99 and 0.92 in the training set and 0.98 and 0.77 in the testing set, respectively (Table S4 and Figure S3). We then repeated this procedure, but first balanced the original dataset before splitting it into the training set and testing set. The boost tree remained the optimal method with accuracy and AUC values of 0.957 and 0.990 for the training set and 0.975 and 0.996 in the testing set, respectively (Table S5 and Figure S4). We suspected that overfitting may have influenced this model, owing to the internal relationship between the training set and the testing set that resulted from the data simulation. Lastly, we balanced the training set, only by oversampling the positive cases, and kept the testing set as it was after the split of the original dataset; those results are shown in Table 4 and Figure 4. Additionally, in the final model, we performed failure mode and effects analysis and calculated the F-score to verify the suitability of accuracy as a metric. We obtained high values for both training and testing sets (Table S6), which were generally in line with the accuracy values. However, the AUC value of the testing set in the final boost tree model was not high enough to be considered a convincing example; we hypothesize this might be due to the relatively small number of positive cases in the testing set. Further studies with larger datasets are needed to resolve this question. Table 4. Confusion matrix. Each row of the matrix represents the predicted condition, while each column represents the true condition, which allows more detailed analysis including the accuracy, sensitivity, specificity, precision, and F1-score. In our study, the boost tree consistently yielded the highest accuracy and AUC value for both the training and testing sets, regardless of the methods used for preprocessing. For this reason, we used this approach to build a clinical flowchart for the evaluation of preeclampsia. This model outperformed both the screening methods currently recommended by the National Institute for Health and Care Excellence (a combination of maternal factors, uterine artery pulsatility index, mean arterial pressure, and PlGF; 41% accuracy) and ACOG guidelines (94% accuracy) [56], as well as methods based on biomarkers such as soluble fms-like tyrosine kinase 1 (sFlt1) and PIGF (77% accuracy) [57]. In addition, our model can be used to evaluate women before they become pregnant, as all of the predictors can be examined pre-pregnancy, and thus facilitates earlier diagnosis than existing alternatives [5,50,56,58,59]. However, despite the high degree of accuracy achieved here, the clear procedure for prediction, and the potential for earlier diagnosis, our model has some deficiencies that should be addressed. First, further studies in other regions or nations are needed, since patterns of SNPs can vary among populations and this can lead to inconsistent conclusions [60]. Second, certain clinical data were missing for some of our study subjects, and future studies on larger samples may be able to avoid this problem. Lastly, a larger number of positive cases should be included to balance the representation of preeclamptic and healthy pregnancies. Even though an appropriate algorithm was used to account for the imbalance here, it is possible that the difference between simulation and real cases may subtly influence the model performance.

Study Population
The EDEN study (study of pre-and post-natal determinants of children's growth and development) is an ongoing mother-child cohort study that was set up in two locations in France, Nancy and Poitiers (France). A total of 2002 pregnant women were enrolled, on average at the 24th gestation week. More details about the EDEN study are available in [39]. The study received approval from the ethics committee (Comité Consultatif de Protection des Personnes dans la Recherche Biomédicale, N • 02-70, 12 December 2002) of Kremlin Bicêtre Hospital and from CNIL (Commission Nationale Informatique et Liberté), the French data protection authority. Written informed consent was obtained twice from parents: at enrollment and again after the child's birth. All research was performed in accordance with the relevant guidelines and regulations. Of the 2002 pregnancies, 1648 met the inclusion/exclusion criteria (Figure 1) to be included in the present work.

Clinical Features
At 24-28 gestational weeks, each mother was clinically examined and asked to complete a self-administered questionnaire. Clinical features, such as maternal weight and height, were measured during the examination, while data on personal history such as weight before pregnancy, educational level (from 1 to 10: 1 = none; 2,3,4 = tertiary; 5,6,7 = secondary; 8,9 = high school; 10 = other), and smoking habits were collected during an initial interview. Additional clinical features, such as gestational age at delivery and the number of previous pregnancies, were extracted from clinical records. Body mass index (BMI) was calculated according to the formula: BMI = kg/m 2 , where kg is the weight

Genotyping
Maternal blood samples were collected during pregnancy and stored in −80 • C freezers with alarm control. DNA was extracted from leukocytes using the QIAamp DNA Blood Mini Kit (QIAGEN) according to the manufacturer's instructions. Genotyping of the SNPs was conducted using one of two techniques. For the first 729 women enrolled in the study, a LightCycler apparatus (Roche Diagnostics, Meylan, France) and hybridization probes were used, with primers and probes designed and synthesized by TIB MOLBIOL (Berlin, Germany). The PCR mixture (10 µl total volume) contained 20 ng of DNA, 1X Fast Start DNA master hybridization probes, 0.5 µM of primers, 0.15 µM of probes, and 3 mM MgCl 2 . Melting curve analysis was applied to monitor SNPs genotyping. For the remaining women (1024), the TaqMan procedure (Applied Biosystems, Foster City, CA, USA) was used with similar reagent preparation, as described above. DNA samples were amplified by PCR on a 96-well plate with the following cycling parameters: denaturation at 95 • C for 10 min, and 40 cycles of 92 • C for 15 s and 60 • C for 1min. The results of the TaqMan assays were read on a 7900HT Fast Real-Time PCR System (Applied Biosystems, Foster City, CA, USA), and alleles were called using SDS software (Applied Biosystems, Foster City, CA, USA). The genotyping call rate of the three SNPs was above 98% in each case, including with the duplicate controls. Further details on the genotyping primers, probes, and PCR conditions are available from the corresponding author.

Basic Statistical Analyses
Basic statistical analyses were performed based on R software (version 4.0.4) with basic packages in Rstudio (PBC, Boston, MA, USA, http://www.rstudio.com/), an integrated development environment for R. Maternal clinical features were described separately in women with and without preeclampsia. The distribution of missing data was visualized using the R package mice (version 3.11) [61], and imputation of missing data was carried out using the R package Imputation and visualized using missForest (version 1.4) [62]. General view of features and individuals in the dataset was presented using principal component analysis using FactoMineR (version 2.3) and visualized by factoextra (version 1.0.7), taking three components into consideration [63]. Student's t-test was used to compare continuous features between groups, and a chi-square test was used for comparing discrete features. Fisher's exact test was applied when any of the cell values of a contingency table were less than five. Multivariate logistic regression was used to calculate the odds ratio for preeclampsia, followed by log-transformation. A p-value less than 0.05 was considered statistically significant. Since each SNP can represent either a major allele (M) or a minor allele (m), the genotype can be a major allele homozygote (MM), a heterozygote (Mm), or a minor allele homozygote (mm). We therefore performed the comparison of allele frequencies among groups using one of three genetic models: a dominant model (MM versus Mm + mm), a recessive model (MM + Mm versus mm), and a co-dominant model (MM + mm versus Mm). Chi-square tests were used to analyze the ratios in the different groups based on the different genetic models. An R script with reproducible code and detailed comments is provided in the Supplemental Materials (Text S1).

Feature Selection
The original dataset was randomly divided into the training and validation sets under ratio 4:3. The training set was submitted to the feature selection and model building, while the testing set was used to evaluate the models. To account for the imbalance of positive and negative cases in the training set, the R package imbalance (version 1.0.2) was used to oversample the smaller population under a ratio of positive to negative cases of 3:5 [64], followed by feature selection using three algorithms: logistic regression, lasso regression, and the Boruta algorithm. The Boruta algorithm was executed using the Boruta package  [65]. The Boruta algorithm is a wrapper built around the random forest classification algorithm, in which shadow features are generated from the shuffled values, which are duplicates of the dataset in each column. The features are selected based on the Z-score, ranging from the minimum to the maximum. The lasso regression is based on a regression analysis method, which minimizes the cost function to select those features of use and discard the useless or redundant features, so that it can make its coefficient equal to 0. The logistic regression performs a statistical model that in its basic form uses a logistic function to model a binary dependent variable. Feature importance was calculated as the sum of the decrease in error when split by a feature. Soft thresholds were determined based on the "principle of the mean", that the importance of a feature should be higher than the mean importance of all features. In this way, the soft threshold was set to 0.5 in lasso regression, 1 in logistic regression, and 2 with the Boruta method. The features that were highlighted by the three methods were then curated manually based on preliminary screening of clinical features using univariable logistic regression analysis, odds ratios, and clinical knowledge. An R script with reproducible code and detailed comments is provided in the Supplemental Materials (Text S1).

Modeling and Evaluation
Machine-learning model building was performed with the tidymodels series of packages (https://www.tidymodels.org/) written by the Rstudio team, which includes tidymodels (version 0.1. Meanwhile, an R script with reproducible code and detailed comments is provided in the Supplemental Materials (Text S1) and a package aimed to run this process is under development. Using the features selected in the previous step, we chose eight widely used machine-learning algorithms (elastic net regression, support vector machine, random forest, boost tree, decision tree, k-nearest neighbor, naïve Bayes, and multilayer perceptron) to build and evaluate models. Argument tuning was performed using the 1000-candidate maximum entropy design, an optimal method design of argument combination based on Shannon's definition of entropy as the amount of information. The oversampled training set was subsequently resampled with five-fold cross-validation, accompanied by two sets of repeats. To evaluate the performance of the models, the receiver operating characteristic (ROC) curves and the area under the receiver operating characteristic curve (AUC) values were used. Specifically, the closer the AUC value is to 1, the better the performance. The quality of each model was also evaluated using metrics of accuracy, sensitivity, specificity, and the adjusted F1-score, which were calculated based on the confusion matrix in Table 4 and Equations (1)-(5). The testing set was retained for final validation.

Conclusions
By comparing preeclamptic and healthy patients, our study reveals a significant association with a variant of PPARγ (C1431T). By combining data on this variant with information on several clinical features, including the maternal C681G variants, obesity, body mass index, number of pregnancies, primiparity, cigarette use, and education, we built an efficient boost tree model that is able to predict preeclampsia in very early pregnancy. This model could be invaluable in screening high-risk pregnancies in clinical practice and could serve as a decision-making reference for clinicians.

Supplementary Materials:
The following are available online at https://www.mdpi.com/article/10 .3390/life11101052/s1.The supplementary materials provide more details about the study and describe additional analyses that were conducted in order to support the results shown in the text. There is information on the procedures that were used to clean up the dataset (Table S1, Figure S1), the evaluation of clinical features ( Figure S2), a summary of the odds ratios of clinical features (Table S2), more information on modeling based on machine learning (Tables S3-S6, Figures S3 and S4, Figure S5), and appendix materials regarding parameter tuning ( Figure S6).

Data Availability Statement:
The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.