A Prediction Model for Preoperative Risk Assessment in Endometrial Cancer Utilizing Clinical and Molecular Variables

The utility of comprehensive surgical staging in patients with low risk disease has been questioned. Thus, a reliable means of determining risk would be quite useful. The aim of our study was to create the best performing prediction model to classify endometrioid endometrial cancer (EEC) patients into low or high risk using a combination of molecular and clinical-pathological variables. We then validated these models with publicly available datasets. Analyses between low and high risk EEC were performed using clinical and pathological data, gene and miRNA expression data, gene copy number variation and somatic mutation data. Variables were selected to be included in the prediction model of risk using cross-validation analysis; prediction models were then constructed using these variables. Model performance was assessed by area under the curve (AUC). Prediction models were validated using appropriate datasets in The Cancer Genome Atlas (TCGA) and Gene Expression Omnibus (GEO) databases. A prediction model with only clinical variables performed at 88%. Integrating clinical and molecular data improved prediction performance up to 97%. The best prediction models included clinical, miRNA expression and/or somatic mutation data, and stratified pre-operative risk in EEC patients. Integrating molecular and clinical data improved the performance of prediction models to over 95%, resulting in potentially useful clinical tests.


Introduction
Endometrial cancer is the most common gynecologic cancer diagnosed in the United States, with an estimated 61,380 new cases and 10,920 deaths in 2017 [1]. Endometrioid endometrial cancer

Results
The flowchart of patients included in the UI analysis is represented in Figure 1. Clinical and pathological characteristics of these patients are described in Table 1.

Results
The flowchart of patients included in the UI analysis is represented in Figure 1. Clinical and pathological characteristics of these patients are described in Table 1.

Survival Analysis
Five-year survival was 98%for UI low risk EEC patients and 75% for high risk patients (p < 10 −4 , Figure 2A). For TCGA dataset, five-year survival was 95% for low risk patients and 75% for high risk EEC patients (Figure 3, p < 10 −4 ). In a multivariate analysis of the UI dataset, previous stroke, number of positive LN, and risk level were independently associated with disease-specific survival (p < 0.05, Figure 2B). These analyses demonstrate that level of risk is an independent outcome measure and that low risk patients have excellent five-year survival.

Survival Analysis
Five-year survival was 98%for UI low risk EEC patients and 75% for high risk patients (p < 10 −4 , Figure 2A). For TCGA dataset, five-year survival was 95% for low risk patients and 75% for high risk EEC patients (Figure 3, p < 10 −4 ). In a multivariate analysis of the UI dataset, previous stroke, number of positive LN, and risk level were independently associated with disease-specific survival (p < 0.05, Figure 2B). These analyses demonstrate that level of risk is an independent outcome measure and that low risk patients have excellent five-year survival.

Survival Analysis
Five-year survival was 98%for UI low risk EEC patients and 75% for high risk patients (p < 10 −4 , Figure 2A). For TCGA dataset, five-year survival was 95% for low risk patients and 75% for high risk EEC patients (Figure 3, p < 10 −4 ). In a multivariate analysis of the UI dataset, previous stroke, number of positive LN, and risk level were independently associated with disease-specific survival (p < 0.05, Figure 2B). These analyses demonstrate that level of risk is an independent outcome measure and that low risk patients have excellent five-year survival.

Variable Selection for Prediction Modeling
RNA extracted from primary tumor tissue samples of 62 EEC patients was of sufficient quality for subsequent RNA sequencing (RNA-seq, Figure 1). This sub-cohort included tissue from 26 high risk patients and 36 low risk patients. RNA-seq analysis resulted in gene expression data for 26,336 genes and 1916 micro-RNAs (miRNAs), along with identification of 12,340 somatic mutations and 26,720 segments with gene copy number variations (CNV). Cross-validation selection analysis identified 255 genes, 55 miRNAs, 398 somatic mutations and 846 CNVs that were most informative in the prediction process ( Figure 4). Only those variables selected by cross-validation were included in prediction analyses.

Variable Selection for Prediction Modeling
RNA extracted from primary tumor tissue samples of 62 EEC patients was of sufficient quality for subsequent RNA sequencing (RNA-seq, Figure 1). This sub-cohort included tissue from 26 high risk patients and 36 low risk patients. RNA-seq analysis resulted in gene expression data for 26,336 genes and 1,916 micro-RNAs (miRNAs), along with identification of 12,340 somatic mutations and 26,720 segments with gene copy number variations (CNV). Cross-validation selection analysis identified 255 genes, 55 miRNAs, 398 somatic mutations and 846 CNVs that were most informative in the prediction process ( Figure 4). Only those variables selected by cross-validation were included in prediction analyses. Variables that passed a cut-off p-value < 0.05 in a univariate linear model and were present in each fold of the k-fold cross-validation were selected. In A, B and C: patients are on the X axis and molecular variables are on the Y axis. (A) Heatmap of expression of 255 selected genes out of a total of 26,336 genes. Normalized gene expression is represented in a red-green scheme from lower to higher expression, respectively; (B) Heatmap of 55 selected miRNAs out of a total of 1,916 miRNAs. Normalized miRNA expression is represented in a blue-orange scheme from lower to higher expression, respectively; (C) Heatmap of 398 selected somatic mutations out of a total of 12,340 mutations. Each somatic mutation for each patient is represented in purple. Grey represents non-mutated genes; (D) Manhattan plot of 846 selected loci with copy number variation out of a total of 26,720 loci. The Y axis represents how many times the locus was involved in the prediction process with cross-validation (k-fold with 25 replications). The X axis represents the chromosomal location. The horizontal red line denotes 25 replications. See the Variable Selection Section 4 for more details.

UI Prediction Models
To build models that include only one type of data, or data class, we used only the variables that were selected with the cross-validation analysis. Accordingly, the input variables for models with Variables that passed a cut-off p-value < 0.05 in a univariate linear model and were present in each fold of the k-fold cross-validation were selected. In A, B and C: patients are on the X axis and molecular variables are on the Y axis. (A) Heatmap of expression of 255 selected genes out of a total of 26,336 genes. Normalized gene expression is represented in a red-green scheme from lower to higher expression, respectively; (B) Heatmap of 55 selected miRNAs out of a total of 1916 miRNAs. Normalized miRNA expression is represented in a blue-orange scheme from lower to higher expression, respectively; (C) Heatmap of 398 selected somatic mutations out of a total of 12,340 mutations. Each somatic mutation for each patient is represented in purple. Grey represents non-mutated genes; (D) Manhattan plot of 846 selected loci with copy number variation out of a total of 26,720 loci. The Y axis represents how many times the locus was involved in the prediction process with cross-validation (k-fold with 25 replications). The X axis represents the chromosomal location. The horizontal red line denotes 25 replications. See the Variable Selection Section 4 for more details.

UI Prediction Models
To build models that include only one type of data, or data class, we used only the variables that were selected with the cross-validation analysis. Accordingly, the input variables for models with one type of data were built using 17 clinical variables, 255 mRNAs, 55 miRNAs, 398 somatic mutations and 846 CNVs. The prediction analysis with Lasso discarded those variables that had no influence in the prediction model and incorporated the most informative, or resulting variables, for the prediction process: 7 clinical variables, 38 mRNAs, 28 miRNAs, 35 somatic mutations, and 65 CNVs ( Table 2, Prediction models including one data class). For prediction analyses using more than one data class, we used the resulting variables and integrated them in the same model, as these were the best predictors for that data class ( Table 2, Prediction models including two, three, four and five data classes). Each prediction model with more than one data class had different combinations of resulting variables, depending on which were more informative for that particular integrative model (Table 2). Table 2. Prediction models for levels of risk using diverse clinical, pathological and molecular data. Models that included only 1 data class used all variables selected with the cross-validation analysis (input variables: 17 clinical features, 255 mRNAs, 55 miRNAs, 398 somatic mutations and 846 CNVs). The Lasso analysis selected only the most informative resulting variables for the prediction process: 7 clinical features, 38 mRNAs, 28 miRNAs, 35 somatic mutations, and 65 CNVs. For prediction analysis using more than one data class, we used the resulting variables. Model performances were measured by AUC and their 95% confidence interval (CI). Models with the best performance are marked with * Prediction models using only one data class. # Number of variables  Table 2). Models of selected molecular variables (gene expression, miRNA expression, CNVs or somatic mutations) did not perform as well. However, integrating clinical data with one or more of the molecular data categories improved prediction performance by 10-15% as compared to models with only single classes of selected molecular variables ( Table 2). The best prediction models included clinical data combined with miRNA expression and/or somatic mutations (models M2-B, M2-C and M3-C in Table 2, Figure 5). Although prediction models including selected variables from gene expression performed fairly in UI models, we were not able to replicate those results in the validation set.
prediction models included clinical data combined with miRNA expression and/or somatic mutations (models M2-B, M2-C and M3-C in Table 2, Figure 5). Although prediction models including selected variables from gene expression performed fairly in UI models, we were not able to replicate those results in the validation set.

TCGA/GEO Replication
To assess how UI models perform in independent datasets, we replicated the analysis with the same variables that were selected by cross-validation in the UI database and used those variables in TCGA and GEO datasets (Table 3). While prediction models including only selected clinical variables performed rather well, other models with only molecular variables, as well as models with combinations of clinical and molecular variables, performed 10 to 20 AUC percentage points lower than UI models. This could be explained in part because some of the variables in the UI dataset were not available in TCGA and GEO datasets. Another factor that must be considered is that patients in the different cohorts are abstracted from different populations that may have dissimilar genetic background compositions. Better performing models included clinical data and selected CNVs, with somatic mutations and/or miRNA variables, all had AUC performances of over 75% (TCGA model M3-B, TCGA model M3-E, and TCGA model M4-D in Table 3). Models including gene expression did not replicate as well. Part of this decline in performance may be because two selected transcripts in the UI model (FAM134B and LOC101927701, a non-coding RNA) had no expression data in TCGA dataset.

TCGA/GEO Replication
To assess how UI models perform in independent datasets, we replicated the analysis with the same variables that were selected by cross-validation in the UI database and used those variables in TCGA and GEO datasets (Table 3). While prediction models including only selected clinical variables performed rather well, other models with only molecular variables, as well as models with combinations of clinical and molecular variables, performed 10 to 20 AUC percentage points lower than UI models. This could be explained in part because some of the variables in the UI dataset were not available in TCGA and GEO datasets. Another factor that must be considered is that patients in the different cohorts are abstracted from different populations that may have dissimilar genetic background compositions. Better performing models included clinical data and selected CNVs, with somatic mutations and/or miRNA variables, all had AUC performances of over 75% (TCGA model M3-B, TCGA model M3-E, and TCGA model M4-D in Table 3). Models including gene expression did not replicate as well. Part of this decline in performance may be because two selected transcripts in the UI model (FAM134B and LOC101927701, a non-coding RNA) had no expression data in TCGA dataset. Table 3. External replication of prediction models for levels of risk. In the analysis of TCGA and GEO datasets, we used resulting variables from UI analyses of 1 data class or type; these results are included for comparison in this table and denoted as "UI model #") (The definitions for input variables and resulting variables are the same as in Table 2). In most cases, variables resulting from the UI analyses were not available in external sets (marked by *). Model performances were measured by AUC and their 95% confidence interval (CI).

TCGA Validation
For external validation of UI prediction models in TCGA data, we chose models with high performance using the UI dataset that also replicated the best in TCGA dataset: (1) Model M2-B: clinical and miRNA data; (2) model M2-C: clinical and somatic mutation data; (3) model M3-C: clinical, gene expression (mRNA), and somatic mutation data; and (4) model M3-D: clinical, miRNA, and somatic mutation data (Tables 3 and 4). For each model, we defined a threshold, or cut-off (Table 4), which is necessary to test the predictive accuracy of the models. Thresholds are set at the values for the score of the model above which patients were classified as high risk with a sensitivity of >90% (see also Section 4). Models that were best validated in an independent database (TCGA) included clinical and miRNA data (M2-B) and somatic mutations (M3-D), with accuracies of around 60%, and negative predictive values (NPV) of >75%. Table 5 details which variables were used in validated models. Scores for each model are calculated by multiplying the value of the clinical variable, normalized miRNA expression, normalized gene expression, or number of somatic mutations by the weight of the variable, followed by adding each of these weighted values (for additional details, see Tables A1-A3). Table 4. Validation of prediction models using data from TCGA EEC dataset. As described in the Methods section, the threshold cut-off values were selected to attain a sensitivity of around 90% and the specificity and negative predictive value. The goal was to create models that would capture at least 90% of the high-risk cases, while ruling out most low risk ones. Recurrence probability scale *: 1/(exp(-score) + 1), where score is the resulting value of the prediction model on a log scale.

Model M3-D Clinical + miRNAs + Mutations
Recurrence probability scale * Cut-off = 0.  Table 5. Variables included in the best performing prediction models. Performance of these models was measured by AUC, sensitivity, specificity, and positive and negative prediction values in both the UI (testing) and TCGA (validation) datasets. * Weights for clinical variables were calculated as the exponential of the estimate in the Lasso regression model. Then, the score of the variable is calculated multiplying the weight of the variable by its value. For example, for age, the score is the weight of age, 1.03 times the age in years; for grade, the score is the weight, 12.99, 1.48, or 1.71 times the numerical grade of the tumor (weights differs among the various models). ** Details of individual weights for miRNA expression are in Table A1. # Details of individual weights for somatic mutations are in Table A2. ## Details of individual weights for gene expression are in Table A3.

Discussion
Our study aimed to create a prediction model that could have an immediate impact on treatment decisions. As anticipated, five-year survival was significantly associated with our definitions of high risk (HR) and low risk (LR) patients in both our UI patient database and in TCGA dataset for EEC. In our prior study using somatic mutations and variant allele frequencies, we created a prediction model with an AUC of 91% [19]. The limitation of that model was the technology required to determine allele frequencies of mutated genes. Selected genes had to be sequenced to a depth of at least 50× to have an accurate allele frequency determination, a process requiring considerable time and expense. Herein, we built on this concept of integrating detailed clinical and molecular data by expanding the classes of molecular data that were used in the models. With the addition of more variables associated with levels of risk, we substantially improved our model performance based on AUC values.
Although clinical data alone had a performance of 88% to predict if a patient will be high risk, the best prediction models combined clinical and miRNA expression data, with or without somatic mutations. Based on our results, simply adding more data classes did not substantially improve the models. Rather, finding the optimal molecular data to include in combination with comprehensive and accurate clinical data creates the best prediction model. The addition of molecular markers is an improvement over our current pre-operative models that use only clinical-pathological markers, grade and age (95% CIs are not overlapping), which solidifies our contention that molecular parameters are necessary to improve our clinical ability to predict which patients are high risk prior to surgery.
Validation in TCGA and GEO databases indicated consistency in model performance. However, performance in the validation models was overall lower, which may be attributed to less comprehensive clinical data in those databases. For example, comparing only clinical data resulted in a performance of 75% using TCGA clinical-pathological features versus 88% using the UI dataset. This supports the importance of obtaining thorough clinical information. In addition, patients in TCGA, GEO and UI datasets were recruited from different populations that may have different genetic backgrounds. For example, in a study aimed to determine the genetic substructure of the population being recruited, we observed that in our institution we only identified one subpopulation, mostly of European origin [23]. However, 4-6 subpopulations with a more diverse background were identified in patients included in the TCGA dataset. Patient population sub-structure can therefore limit the utility of a database to validate results from a single institution. However, consistently higher performances in the training set (UI) than in the testing sets (TCGA, GEO) could also be due to overfitting, which is a common problem that occurs when there are more variables in the analysis than in the samples [24], despite using adequate methods and resampling (cross-validation techniques). Overfitting may contribute to overoptimistic results in some prediction models, and could also be minimized in future prospective validations.
A strength of our prediction model is that it can distinguish between low risk patients and high risk patients that would likely benefit from a more aggressive surgical approach, surgical staging, and adjuvant treatment. Our prediction models for high risk patients are not based only on LN involvement. We also included as high risk patients those with adnexal or cervical involvement, or patients with other local and/or distant metastases. These patients have poorer prognosis, despite a lack of LN involvement [10]. The resulting models have the potential to be more comprehensive with more coverage than those only including LN involvement as a high risk factor, like those models based on SLN biopsies [15,16]. This is especially important for obese patients with multiple medical comorbidities, who are at increased surgical risk. A pre-operative risk stratification test would be highly useful to guide management and tailor pre-operative discussions, risk evaluation, and surgical planning.
Our study was performed on tumors from hysterectomy specimens, with access to all dissected tumor material. We acknowledge that there are some feasibility issues that must be addressed before these prediction models could be implemented as a laboratory test to prospectively predict risk using biopsy specimens. For example, uterine biopsies may not properly represent the heterogeneity of the whole tumor, which may influence interpretation of tumor grade. However, retrospective studies have demonstrated that molecular parameters are highly concordant between diagnostic preoperative biopsies and surgical specimens [25]. Molecular-based prediction models from biopsy specimens will likely be consistent with prediction models created with surgical specimens. The relatively low amount of material from biopsy specimens is not a concern because sequencing can be accomplished using small RNA quantities, and with rapid turn-around time that would allow for risk prediction prior to surgery (typically 3-4 weeks after biopsy) [25,26]. We envision that the ideal test would utilize fresh tissue from the uterine biopsy and polymerase chain reaction (PCR) to gather molecular data. Such a test would therefore be quick, affordable, and widely available. Finally, our validation strategy used retrospective data, and prospective validation of these initial prediction models is necessary. An additional strength of our study is that it was validated using TCGA, one of the most comprehensive, publicly available databases from EEC specimens.
Variables in the best prediction models can be considered classifiers for high risk patients. A classifier can be used to select and stratify patients for therapy. However, the components of a classifier are not necessarily biological markers of disease severity [27]. Indeed, there have been other attempts to stratify EEC based on tumor characteristics. Tumors in the TCGA study were grouped based on similar features by clustering, which is an unsupervised learning method [28]. Other histological types of endometrial cancer, such as the more aggressive serous type, were used to build these clusters, but poor prognosis or clinical outcomes of these patients were not included. Moreover, clustering is different than classification. Classification is a supervised learning method that assigns previously predefined labels based on molecular features. One suggestion is to perform classification of risk based on TCGA endometrial molecular groups built with clustering: POLE ultramutated, microsatellite instability hypermutated, copy-number low and copy-number high [28]. Accuracy of those TCGA-based models range from 59% to 73% [29][30][31]. Models described herein better predict high risk status in EEC patients, with AUC >90%.
Finally, an additional benefit of our integrated models is the identification of putative mechanisms of risk, which are suggested by the specific molecular variables in the best models. For example, miRNAs such as MIR-181, MIR-30b, and MIR-200b as well as specific gene loci such as ADAMTS13, NOTCH4 and PIGN have been linked to many cancers, including EEC [32], and should be evaluated in vitro in the future.

Classification of EEC Risk
Classification of EEC risk was based on the criteria and results from Gynecologic Oncology Group (GOG) 33 and GOG 99 clinical trials [6,17]. High risk patients were defined as those presenting with stage II, III and IV disease as defined by 2009 FIGO classification and sanctioned in 2014 [33] as well as patients with stage I disease and high-intermediate risk features as defined by GOG 99 criteria [17]. High intermediate risk features in stage I tumors included grade 2 or 3 tumors, presence of lymphovascular invasion, and outer-third myometrial invasion with the following criteria: (1) At least 70 years of age with at least one risk factor; (2) at least 50 years of age with two risk factors; and (3) any age with all three risk factors. Low risk patients include all other stage I patients either with no myometrial invasion or low-intermediate risk features as defined by GOG 99 criteria [17]. There were 206 low risk and 194 high risk patients from TCGA dataset and 70 low risk and 56 high risk patients in the UI dataset.

University of Iowa (UI)
Endometrial cancer patients with endometrioid histology and complete clinical and pathological data were included. Patients with secondary gynecologic malignancies, neoadjuvant chemotherapy or radiation, and/or incomplete data were excluded. A total of 70 low risk patients and 56 high risk patients were identified and included in the validation analysis. One patient had no information about level of risk recorded and was excluded from the analysis. An outline of the study population is shown in Figure 1. Clinical and pathological characteristics are described in Table 1.
The institutional review board (IRB) of the UI approved the current study including human subjects/materials on 28 July 2016 (IRB Number 201607815: 'Prediction Model for Risk Assessment in Endometrial Cancer').

The Cancer Genome Atlas (TCGA)
Patients with non-endometrioid histology were excluded. Of those patients with Type I endometrial cancer, or EEC, clinical and RNA-seq data were downloaded. Patients were divided into risk categories, high risk (N = 194) and low risk (N = 206), as described above. Clinical and pathological characteristics of TCGA EEC patients included in the analysis are provided in Appendix B Table A4. The distribution between low and high risk patients in UI and TCGA cohorts was similar (chi square p-value = 0.33).

Gene Expression Omnibus (GEO)
Only clinical data and gene expression data from a microarray expression experiment were available from this dataset, reference number GSE17025 [34] (Table A5).

University of Iowa (UI)
RNA Purification and Sequencing: The University of Iowa Department of Obstetrics and Gynecology maintains a Women's Health Tissue Repository (WHTR) containing more than 60,000 biological samples, including more than 2500 primary gynecologic tumors [35]. All tissues in the WHTR are collected under informed consent (IRB#200910784 and IRB#200209010). Of the 126 patients identified in the original EEC panel, we were able to obtain 62 primary tumor tissues with sufficient RNA yield and quality for analysis, 36 low risk and 26 high risk (no differences in distribution relative to the complete patients' cohort: chi square p-value = 0.74).
Total cellular RNA was purified from primary tumor tissue using the mirVana (Thermo Fisher, Waltham, MA, USA) RNA purification kit following manufacturers' instructions. Yield and quality of purified cellular RNA was assessed using a Trinean DropSense 16 spectrophotometer and an Agilent Model 2100 bioanalyzer. Only RNAs with an RNA integrity number (RIN) [36] greater than or equal to 7.0 were selected for RNA sequencing.
Equal mass total RNA (500 ng) from each qualifying tumor was fragmented, converted to cDNA and ligated to bar-coded sequencing adaptors using Illumina TriSeq stranded total RNA library preparation (Illiumina, San Diego, CA, USA). Molar concentrations of the indexed libraries were confirmed on the Agilent Model 2100 bioanalyzer and libraries were then combined into equi-molar pools for sequencing. The concentration of the pools was confirmed using the Illumina Library Quantification Kit (KAPA Biosystems, Wilmington, MA, USA). Sequencing was then carried out on the Illumina HiSeq 4000 genome sequencing platform using 150bp paired-end SBS chemistry. All library preparation and sequencing was performed in the Genome Facility of the University of Iowa Institute of Human Genetics (IIHG).
File pre-processing of diverse biological data: Briefly, sequence reads were mapped and aligned to the human reference genome (version hg38) using STAR, a paired-end enabled algorithm [37]. BAM files were produced after alignment. We used featureCount to measure gene expression from BAM files [38]. After the gene counts were generated, we used DESeq2 package to import, normalize and prepare the data for analysis [39]. We independently used gene expression and miRNA expression for the association analysis.
BAM files for each sample were also used for mutation discovery and base-calling against the human genome reference utilizing SAMtools and BCFtools [40]. Results were annotated with ANNOVAR and formatted to display the number of mutations per gene and sample [40]. We included only non-synonymous somatic mutations. CNV was determined with SAMtools and CopywriteR using BAM files as input [41].

Survival Analysis
To assess the association of survival with risk levels and other clinical variables, survival analysis was performed using Cox proportional hazard ratios. All variables associated with survival in a univariate analysis (p ≤ 0.05) were included in the multivariate regression model.

Variable Selection for Prediction Modeling
In the prediction model, we only used those variables that could be assessed at baseline, prior to initiation of treatment. Our approach was to (1) reduce the number of variables using a univariate selection of prediction variables with cross-validation; (2) introduce those significant variables from the univariate selection process to the prediction model of level of risk. Rather than introducing all variables directly in the prediction model, this approach was chosen because it would likely lead to a model that is more sparse (i.e., simpler, with fewer variables) and can be more easily validated retrospectively and prospectively. To reduce the number of variables, we used the caret R package [42]. This software fits a simple linear model between a single feature and the outcome. Features that were statistically significant (p-values < 0.05) in this univariate analysis were then used for multivariate Lasso regression modeling. Cross-validation of the subsequent models will be biased unless some sort of resampling is included in the feature selection step [43]. Thus, variable selection for all classes of clinical and biological data (gene and miRNA expression, gene copy number and mutation analysis) were performed using k-fold cross-validation with the caret R package to decrease the possibility of overfitting the final model [42].

Prediction Model Construction
Selected clinical and molecular variables from the k-fold cross-validation process were analyzed individually and in combination to determine their prediction potential for preoperative risk. The Lasso method, as implemented in the glmnet R package [44], was used to develop a regression model to predict low risk versus high risk patients. We selected Lasso because it is a multivariate regression method that allows simultaneous selection and estimation of the effects of variables, while accounting and adjusting for confounding factors. In our experience, Lasso consistently handles missing values and lower number of samples and computes the AUC without reporting any errors, as compared to other prediction methods [45]. We evaluated the performance of our model using the AUC and its 95% CI. AUC was estimated with 1000 replicates of 10-fold cross-validation to avoid over-fitting of the model (internal validation) [27]. Bias-corrected and accelerated bootstrap CIs were computed for resulting AUCs. A value of 0.5 indicates a lack of model predictive performance, and 1.0 indicates perfect predictive performance.

The Cancer Genome Atlas Replication and Validation
We replicated (or repeated) the same analysis performed in the UI dataset in TCGA dataset. For this external replication, we included the same variables used in modeling with the UI dataset, but extracted data from TCGA datasets. Then, the same Lasso analysis used in the UI cohort was performed in this TCGA data. The performance of the replication analysis was measured in terms of AUC and 95% CIs.
For validation of UI prediction models in the TCGA dataset, we took the models built using UI dataset and inserted TCGA data to see if they could discriminate between low and high risk patients in the TCGA dataset. The validation analysis does an inference: with the UI-built model and TCGA data, the model attempts to predict low or high risk classes. For the validation analysis, we took the best UI prediction models for risk that replicated well in TCGA and applied them to the TCGA dataset to obtain a predicted probability of high risk for each patient [44]. Then, we used the R package pROC to determine thresholds, or cut-offs, for the UI model applied to the TCGA data [46]. Thresholds were treated as a tuning parameter for which values were sought to produce a final classification model and were computed with 2000 bootstrap replicates. Threshold values that yielded sensitivities around 90% were ranked from highest to lowest sensitivity and negative predictive value. Among the ranked results, the top-ranked set of tuning parameters was used to fit a final score of the model to the entire set of patients and define the classification rule. The goal was to create models that would identify at least 90% of high risk patients, while ruling out most patients with low risk.
An example of R coding for the variable selection, Lasso prediction analysis, replication and validation analysis is provided in a supplementary file (Appendix C).

Conclusions
Combining clinical and molecular data on EEC tumor specimens allows us to stratify patients into high and low risk categories with greater than 95% confidence. The performance of our prediction model is superior to the current standard of care using clinical factors alone. Identification of crucial molecular variables from next generation technologies can be developed using conventional and quantitative PCR and expression arrays to enable design of a novel diagnostic test to be used on tissue obtained from endometrial biopsies.
Conflicts of Interest: K.W.T. is a co-founder of Immortagen, Inc., and had a role in the interpretation of data and writing of the manuscript. All other authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest (such as honoraria; educational grants; participation in speakers' bureaus; membership, employment, consultancies, stock ownership, or other equity interest; and expert testimony or patent-licensing arrangements), or non-financial interest (such as personal or professional relationships, affiliations, knowledge or beliefs) in the subject matter or materials discussed in this manuscript.

Appendix A
Individual scores for data classes included in the best performing UI prediction models.