A Critical Evaluation of Validation and Clinical Experience Studies in Non-Invasive Prenatal Testing for Trisomies 21, 18, and 13 and Monosomy X

Non-invasive prenatal testing (NIPT) for trisomies 21, 18, 13 and monosomy X is widely utilized with massively parallel shotgun sequencing (MPSS), digital analysis of selected regions (DANSR), and single nucleotide polymorphism (SNP) analyses being the most widely reported methods. We searched the literature to find all NIPT clinical validation and clinical experience studies between January 2011 and January 2022. Meta-analyses were performed using bivariate random-effects and univariate regression models for estimating summary performance measures across studies. Bivariate meta-regression was performed to explore the influence of testing method and study design. Subgroup and sensitivity analyses evaluated factors that may have led to heterogeneity. Based on 55 validation studies, the detection rate (DR) was significantly higher for retrospective studies, while the false positive rate (FPR) was significantly lower for prospective studies. Comparing the performance of NIPT methods for trisomies 21, 18, and 13 combined, the SNP method had a higher DR and lower FPR than other methods, significantly so for MPSS, though not for DANSR. The performance of the different methods in the 84 clinical experience studies was consistent with validation studies. Clinical positive predictive values of all NIPT methods improved over the last decade. We conclude that all NIPT methods are highly effective for fetal aneuploidy screening, with performance differences across methodologies.


Introduction
Non-invasive prenatal testing (NIPT) describes a family of tests that rely on the analysis of cell-free DNA fragments in the plasma of pregnant women to screen for fetuses affected by the common autosomal trisomies (trisomy 21, 18, and 13). Some NIPT laboratories also test for sex chromosome abnormalities (Turner syndrome, Klinefelter syndrome, XXX, XYY, and various more complex karyotypes), other autosomal aneuploidy, chromosome segmental imbalances (typically, >7 Mb), select microdeletion syndromes, Rhesus blood group typing, and some monogenic disorders [1].
Initial clinical validation studies were focused primarily on the prenatal identification of the common autosomal trisomies by detecting an overall quantitative difference in the proportion of cfDNA for those chromosomes where a copy number difference could exist [2][3][4]. The analysis requires a sufficiently high amount of circulating cfDNA derived from the conceptus such that fetal aneuploidy would be detectable, even in the presence of the larger quantities of cfDNA of maternal origin. Ensuring a sufficient "fetal fraction" (the proportion of cfDNA derived from placental trophoblasts) was therefore a key factor in the development of successful testing. Generally, adequate fetal fraction is achievable for testing after 9 or 10 weeks gestational age [5].
was not recorded because there were multiple overlapping searches with early exclusion of ineligible studies. Because of potential ascertainment biases (see below), studies were subclassified as either "validation" or "clinical experience". "Validation" was defined as a study in which a set of maternal plasma samples, drawn at varying first-or second-trimester gestational ages, were tested for the presence or absence of fetal chromosome abnormality and where the actual aneuploidy status of the pregnancy ("truth") was known for all samples included in the analysis. NIPT validation studies were typically conducted prior to the formal use in clinical practice and the test results were not used clinically. "Clinical experience" studies were defined as actual experience of a laboratory service that was routinely providing cfDNA screening for the purposes of patient management. Clinical experience studies typically involved the laboratory requesting follow-up data from ordering physicians for cases with high-risk but not low-risk NIPT results. Consequently, knowledge of Because of potential ascertainment biases (see below), studies were subclassified as either "validation" or "clinical experience". "Validation" was defined as a study in which a set of maternal plasma samples, drawn at varying first-or second-trimester gestational ages, were tested for the presence or absence of fetal chromosome abnormality and where the actual aneuploidy status of the pregnancy ("truth") was known for all samples included in the analysis. NIPT validation studies were typically conducted prior to the formal use in clinical practice and the test results were not used clinically. "Clinical experience" studies were defined as actual experience of a laboratory service that was routinely providing cfDNA screening for the purposes of patient management. Clinical experience studies typically involved the laboratory requesting follow-up data from ordering physicians for cases with high-risk but not low-risk NIPT results. Consequently, knowledge of outcomes was incomplete. For both validation and clinical experience studies, results of cytogenetic or cytogenomic analyses of amniotic fluid cells or chorionic villus sampling (CVS) or the occurrence of a normal livebirth or abnormal birth with the expected phenotype were considered evidence of truth. Without confirmatory genetic testing, the presence of either abnormal ultrasound findings or a spontaneous abortion was not considered sufficient evidence for aneuploidy. Because most NIPT were not designed to detect mosaicism, we excluded cases where a mosaic karyotype was detected in follow-up genetic testing. Where separately identified, twins or higher multiples were also excluded. Samples not yielding a result were also excluded.
Meta-analyses of test performance including all eligible studies were performed separately for "validation" and "clinical experience" studies. The outcome measures assessed and statistical methods applied were chosen based on what was most appropriate for each study subclassification as described below. We conducted statistical analyses using R, a language and environment for statistical computing and graphics. The R package mada was used for diagnostic meta-analysis implementing the approach of Reitsma et al., 2005 [18].

Validation Studies
Validation studies were further sub-classified as either "retrospective" or "prospective." Retrospective was defined as a study where a set of maternal plasma specimens were collected and frozen, and samples were only analyzed if truth was known (mostly from CVS or amniocentesis samples) and sample truth status conformed to inclusion criteria in the study design. Case-control studies were considered to be retrospective for the purpose of this analysis. Prospective studies were based on a set of cases where cfDNA screening and the decision to include samples in the analysis was carried out prior to the knowledge of truth and where efforts were made to gather outcome on all tested cases. In some studies, methods were insufficiently documented to determine whether the design was prospective or retrospective.
For the validation studies, the outcome measures to assess test performance were prevalence (Prev), detection rate (DR, sensitivity), false positive rate (FPR, 1-specificity), diagnostic odds ratio (DOR), and positive predictive value (PPV). These outcome measures were based on the formulae below. Screen positive, affected cases were considered true positives (TP); screen negative, affected cases were considered false negatives (FN); screen positive, unaffected cases were considered false positives (FP); screen negative, unaffected cases were considered true negatives (TN). sensitivity × speci f icity The DOR is defined as the ratio of the likelihood of the test being positive for an affected case (LR + ) relative to the likelihood of the test being negative for an unaffected case (LR − ). A higher DOR is indicative of better test performance. A rationale for using this single measure is that it is independent of prevalence and includes information about both sensitivity and specificity. However, this independence means that it cannot distinguish between tests with high sensitivity and low specificity and tests with low sensitivity and high specificity. Therefore, it was important to also implement methods considering both performance measures simultaneously.
Differences in PPV observed in different studies may be attributable to differences in the proportion of affected pregnancies included in the various studies. To adjust for this difference in prevalence across studies, we used a standard prevalence for each syndrome representative of first trimester rates for a US population (population prev) [19]. The rates used were 1/365 for T21, 1/1208 for T18, 1/3745 for T13, and 1/1291 for MX. A standardized PPV (stdPPV) using the population prevalence was then calculated for each syndrome and testing method with the formula: Study-level data for validation studies were stratified by study type (retrospective and prospective) and by the three main methods of testing (MPSS, DANSR, and SNP). Other test methods were excluded due to insufficient data. Categorical variables were summarized as the proportion of studies within each category. Continuous variables were summarized by the median (25%ile and 75%ile). Forest plots showed the sensitivity and specificity for each study by syndrome (Trisomy 21, Trisomy 18, Trisomy 13, and Monosomy X).
Pooled estimates of the observed test performance were calculated as a descriptive summary of the data across studies by syndrome, method of testing, and study type. However, simple pooling is a fixed-effect method that ignores both the characteristics of the individual studies being pooled and the dependence of binary measures of test performance on the particular threshold used to determine the outcome. As the threshold is varied across all possible values, a trade-off is induced between sensitivity (DR) and specificity (1 − FPR). To take this into account, we used a bivariate approach, as recommended by the Cochrane Diagnostic Test Accuracy Working Group, to estimate the test performance across studies. Bivariate random effects regression models were used to estimate average sensitivity or detection rate (DR) and specificity (1 − FPR) through a joint distribution, accounting for the correlation between the two performance measures. This random effects approach incorporates unexplained variability in test performance measures between studies. Because there was also variation due to sampling, as studies differed in size, the precision by which DR and FPR were estimated in each study was incorporated by giving higher weight to studies with more precise estimates [18]. Because the model requires non-zero cells, we added a continuity correction of 0.1 to each cell of a study where a zero was encountered.
In addition to fitting models for each syndrome and method of testing or study type separately, bivariate meta-regression was performed on all validation study data to explore the influence of method of testing and study type by including them as covariates in the model. Studies using the SNP-based method were matched with studies using other methods with respect to study-level characteristics including start year, country, and prevalence. As a sensitivity analysis, meta-regression was performed for matched studies, for the subgroup of studies that differed from SNP-based studies with respect to study-level characteristics and for the full set of studies. Results were compared. Covariates were added to the bivariate model to examine their effects on DR and FPR as well as adjust for potential effects while testing for differences in performance by method of testing and study type. No corrections were made for multiple comparisons.
Summary ROC plots (sROC) showed the observed pairs of sensitivity (DR) and FPRs for each study as well as summary estimates of DRs and FPRs from bivariate models for each method of testing or study type including the corresponding 95% confidence ellipse showing the region of confidence that describes the uncertainty of the estimates.

Clinical Experience Studies
Clinical experience studies included reports from reference laboratories, regional fetal screening programs, and individual maternal-fetal medicine programs offering NIPT testing. Testing may have included prior conventional screening, with NIPT offered to highrisk women ("secondary screening"), or it may have been offered to a general pregnancy population ("primary screening"). For all clinical experience studies, the outcome data was judged incomplete. In particular, it is not possible to make a reliable estimate of DR because a proportion of test negative affected pregnancies will not come to attention. Even in the situation where there is a follow-up of a high proportion of screen-negative livebirths, there can be missed affected cases because of the relatively high risk of postscreening spontaneous pregnancy loss of aneuploid pregnancies. For this reason, TN were not tabulated for clinical experience studies. FN are included where reported, but it is important to note that these were largely data offered by ordering physicians, rather than requested by laboratories. As such, it is expected to be an underestimate. Analyses that only include participants for whom the outcome was obtained may produce biased estimates of test performance. We calculated the following test performance measures based on the observed data as well as using methods to correct for potential bias due to missing outcome data.
A minimum estimate of prevalence (minPrev) was calculated, recognizing the limitation that some false negatives are under-ascertained.

minPrev = (TP + FN)/Proportion o f high risk calls with outcome data Number o f women with results
An approximate estimate of FPR (estFPR) was calculated on the basis that positive cases with follow-up were reflective of all positive cases and essentially all test negative cases were unaffected (see discussion). To correct for verification bias when calculating these performance metrics, we implemented inverse probability weighting. The number of confirmed high-risk calls was inflated by the inverse probability of having confirmation under the assumption that confirmation is missing at random for high-risk calls and positive cases with follow-up were reflective of all positive cases. We did not consider tests where no results were obtained. Standard statistical methods for verification bias correction have been shown to be inadequate when there are few false negatives. Therefore, the Observed PPVs were not considered to provide a reliable overall estimate and also did not reflect differences in prevalence. To allow for these factors, we also calculated and compared the standardized PPVs that incorporated the rate of false negatives among low-risk calls from validation studies and also used prevalence rates for a US population drawn from Benn et al. (2015) [19]. Study follow-up was calculated as the percentage of high-risk calls that were confirmed.

Overall Performance
We identified 55 eligible validation studies, of which 22 were retrospective, 24 prospective, and 9 were both or of unknown study design (Supplemental Table S1). Of these 55 studies, 29 used an MPSS methodology, 13 used DANSR, 6 used SNP, and 7 used other methods. Supplemental Figure S1 summarizes, using forest plots, the sensitivity and specificity of these studies. Additional summary tabulation of these studies, by the three main methods of testing, publication year, year of laboratory testing, laboratory country, population country, gestational age, maternal age, and major study groups included are summarized in Supplemental Tables S2 and S3. Studies across the three methods of testing were comparable with respect to study-level characteristics. There appear to be differences between studies grouped by method of testing for start year, population country, and prevalence. These characteristics were used for matching in sensitivity analyses. Raw study-level data can be found in Supplemental Table S4.
For all validation studies combined, the pooled DR for T21 was 99.44% (95% CI 99.06-99.67%), and the FPR was 0.07% (CI 0.05-0.09%) ( Table 1). From a bivariate random effects regression model, the mean DR for T21 was 98.72% (CI 97.97, 99.19%), and the FPR was 0.12% (CI 0.07, 0.21%) ( Table 2). The DRs for T18, T13, and MX were lower than that for T21, but FPRs were similar.  Supplemental Table S5 shows the mean DORs stratified by study design, whether the validation was prospective or retrospective in design. Table 3 summarizes the mean DORs from a univariate model for each syndrome by testing method. For T21, T18 and T13, the highest DOR values were achieved with the SNP-based NIPT. For MX, the highest DOR was seen for DANSR.  Table 5). The mean FPR for retrospective studies for T21 was 0.21% (95% CI 0.08-0.57) and for prospective studies 0.09% (95% CI 0.05-0.17). The DRs for T18, T13, and MX were lower, but FPRs were similar. In a model controlling for syndrome, both the mean DR and mean FPR for retrospective studies were significantly higher than that of prospective studies (p = 0.004 and 0.003, respectively). Cohort prevalences for all syndromes were 4-6 times higher in retrospective compared to prospective studies. The differences in DR and FPR between prospective and retrospective studies were no longer statistically significant when including prevalence as a covariate in the model.

Method of Testing
For T21, the mean FPR for SNP was significantly lower than that of MPSS among all validation studies (p = 0.029) ( Table 2). In a sensitivity analysis that separately considered studies similar to, or different from, SNP-based studies with respect to study-level characteristics, the mean FPR for SNP remained statistically significantly lower than that of MPSS (p = 0.01 and p < 0.001, respectively). For T18, when considering matched studies or all studies combined, there was no statistically significant difference between methods. For T13 and MX, there were no statistically significant differences between methods of testing. When evaluating test performance in a model for each syndrome separately, there were no statistically significant differences for DR between methods of testing and no statistically significant difference between DANSR and SNP.
To evaluate the relationship between DR/FPR and method of testing using data for all four syndromes combined, we fitted a bivariate model including syndrome and method of testing as covariates. The mean FPR for SNP was significantly lower than that of MPSS when controlling for syndrome among both matched studies (p = 0.046) and all studies combined (p = 0.008). When considering the three trisomies combined (no Monosomy X), the mean DR and FPR for SNP were significantly better than that of MPSS when controlling for syndrome among matched studies (p = 0.04 and p = 0.02, respectively) ( Figure 2). method of testing as covariates. The mean FPR for SNP was significantly lower than that of MPSS when controlling for syndrome among both matched studies (p = 0.046) and all studies combined (p = 0.008). When considering the three trisomies combined (no Monosomy X), the mean DR and FPR for SNP were significantly better than that of MPSS when controlling for syndrome among matched studies (p = 0.04 and p = 0.02, respectively) (Figure 2).  The summary ROC (sROC) plot shows the observed pairs of sensitivity and false positive rates (1-specificity) for each study as small symbols with summary estimates from bivariate models represented by the larger bolded symbols. Also shown are the corresponding 95% confidence ellipse showing the region of confidence that describes the uncertainty of the estimates for each method of testing. These plots typically show a curve estimated using a regression model meant to fit as close to the observed data as possible. The curve is estimated from 0 to 1 on both axes, but because the range of observed sensitivities and FPRs in our studies is limited, the model extends well beyond the range of the data to regions with limited or no data and therefore is not shown.

Country
Because the studies conducted in China tended to have several differences from the other studies, namely, Chinese women tend to be of lower maternal weight, some early studies did not measure fetal fraction, some centers initially had compensation insurance for false negatives, testing was done mostly in the second trimester, and methods typically used lower sequencing depth, we performed a comparative analysis. The mean FPR for  The summary ROC (sROC) plot shows the observed pairs of sensitivity and false positive rates (1-specificity) for each study as small symbols with summary estimates from bivariate models represented by the larger bolded symbols. Also shown are the corresponding 95% confidence ellipse showing the region of confidence that describes the uncertainty of the estimates for each method of testing. These plots typically show a curve estimated using a regression model meant to fit as close to the observed data as possible. The curve is estimated from 0 to 1 on both axes, but because the range of observed sensitivities and FPRs in our studies is limited, the model extends well beyond the range of the data to regions with limited or no data and therefore is not shown.

Country
Because the studies conducted in China tended to have several differences from the other studies, namely, Chinese women tend to be of lower maternal weight, some early studies did not measure fetal fraction, some centers initially had compensation insurance for false negatives, testing was done mostly in the second trimester, and methods typically used lower sequencing depth, we performed a comparative analysis. The mean FPR for studies conducted in China was significantly lower than that of studies conducted in other countries (p = 0.01). To assess whether these findings influenced our results for differences between testing method, we performed a sensitivity analysis excluding studies conducted in China. The mean FPR for SNP remained significantly lower than that of MPSS when controlling for syndrome and excluding studies conducted in China (p = 0.001). In a model including the three trisomies (no Monosomy X), the mean FPR for SNP remained significantly lower than that of MPSS (p = 0.006), but the mean DR was no longer statistically significantly different (p = 0.07) when excluding Chinese studies.
Comparing studies conducted in European countries with those conducted in the US, the mean DR for US studies was significantly higher than that of European studies when controlling for syndrome and method of testing (p = 0.04).

Clinical Experience Studies
Overall Performance A total of 84 eligible clinical experience studies were identified (Supplemental Table S1). Of these, 55 studies used MPSS, 12 DANSR, 12 SNP, and 5 used other technologies. Supplemental Table S6 summarizes the time of the study, laboratory country, population country, maternal age, and major indications for testing for those cases where the testing was based on MPSS, DANSR, or SNPs. Raw data from these studies are summarized in Supplemental Table S7. As summarized in Supplemental Table S8, the percentage of high-risk results where there was follow-up was highly variable. Table 6 presents the estimated FPRs (estFPR), observed PPVs for confirmed cases (obsPPV), and standardized PPVs based on estimates of the population prevalence, as described in the Methods section. For confirmed cases for all syndromes, standardized PPVs were much lower than observed PPVs for confirmed cases for all syndromes. This difference can be explained by the higher prevalence seen across the clinical experience studies compared with the population rates used and the higher false negative rates used from the validation studies.  When results were stratified by study start year, the data were consistent with declining minimum prevalence estimates for each of the conditions (Table 7). Despite the declining prevalence estimates, PPV for confirmed cases increased with time for T21, T18, and T13 and remained approximately constant for MX. We observed trends in test performance among the clinical experience studies consistent with those observed in the validation studies. Standardized PPVs for SNP trend higher than those for MPSS for all syndromes and FPRs for SNP trend lower than those for MPSS for all syndromes.

Discussion
In this study, we corroborate and update previous meta-analyses that showed that both validation studies and clinical experience studies demonstrate high efficacy of cell-free DNA in the prenatal screening of fetal trisomies 21, 18, 13, and monosomy X.
We show here that performance in validation studies depends on study design, with FPRs in prospective studies being significantly lower than in retrospective studies. Retro-spective study design generally focused on ascertainment based on affected pregnancies identified through amniocentesis or CVS, i.e., unambiguous affected or unaffected pregnancies with mosaic cases, pregnancy losses, and unrelated abnormalities excluded. These studies are usually weighted to include sufficient affected pregnancies to allow robust estimation of detection rates. In prospective studies, there are variable policies with respect to the inclusion of cases with fetal loss, mosaicism, and other abnormality. Generally, prospective studies involve maternal plasma sample collection at the time of conventional maternal serum screening, and therefore, these studies are more representative of an average risk population, although inclusions may be weighted towards women with higher risks who undergo additional testing. The major distinction between the two designs appears to be prevalence. Our data show that after adjusting for prevalence, differences in DR and FPR between prospective and retrospective studies were no longer statistically significant.
This observation that testing performance is related to prevalence runs counter to classical views in screening where DR and FPR are considered intrinsic to the test and independent of the population screened. However, NIPT differs from conventional screening because many false positives have a biological, non-technical basis. For example, fetal or placental mosaicism can explain some false-positive results. Mosaicism can arise through a primary meiotic error (with correction to disomy via trisomy rescue), and the frequency of these cases can be expected to be maternal age dependent. Similarly, false positives due to a vanished twin, maternal cancer, somatic X-chromosome loss, and other maternal health conditions are also anticipated to be dependent on maternal age. A lower FPR and better than expected PPV in younger women compared to older women has been described [20]. An association between test performance and prevalence has been shown in other screening test settings [21,22]. Therefore, it is important to consider prevalence when assessing differences in study population or design.
We also found that the performance of the three main clinically available methods differed according to chromosome, with the SNP and DANSR methods trending towards better performance as compared to MPSS, although the differences were often not statistically significant. When considering the three trisomies combined and when controlling for syndrome among matched studies, the SNP method showed significantly better performance than MPSS. The trends were most clear when considering the diagnostic odds ratio, a measure that combines sensitivity and specificity. The same trends were also seen in the clinical experience data, where prevalence-adjusted PPVs provided a composite assessment of performance (Table 4).
NIPT has received increasing acceptance in medical practice, and consistent with this, we observed a trend towards declining prevalence for all four aneuploidies in women receiving the testing from 2010 to 2019 as the proportion of high-risk women in the testing population decreased (Table 5). Although this might be expected to result in lower PPV, in practice, the observed PPVs remained approximately constant. This may, in part, be explained by our additional observation that test performance is not independent of prevalence. It is likely also that improvements in testing have occurred. We acknowledge the observed PPVs are subject to verification bias, the effect of which could lead to less accurate results.
For clinical experience studies, we evaluated observed PPV and estimated FPR under the assumption that test-positive cases with definitive diagnosis or pregnancy outcome information are representative of all test-positive cases. In our experience, this tendency is primarily driven by the fact that some practices often do not have the time to respond to requests for follow-up information. Additionally, although it has been strongly recommended that all women with positive NIPT results receive definitive follow-up CVS or amniocentesis, follow-up is generally incomplete, and it is possible that in some cases with high-risk NIPT results where major malformations were detected by ultrasound, women chose to terminate their pregnancy without prenatal diagnostic testing. Conversely, it is possible some women with positive NIPT results that would be expected to show major malformations by ultrasound exam (e.g., trisomy 13 and 18) may choose not to pursue invasive testing in the absence of ultrasound findings. Additional difficulties exist for MX where mosaicism, partial X-chromosome deletions or unbalanced rearrangements, and the associated variable presentation are confounders. While drawing conclusions from positive cases with incomplete follow-up may therefore seem tenuous, our analysis does indicate that the FPRs and prevalence-adjusted PPVs in clinical practice are, in fact, consistent with the validation studies.
A greater difficulty exists in assessing DR from clinical experience studies. First, although some studies did report follow-up on FN that was offered, the majority of clinical experience studies did not request follow-up on cases with a low-risk result. Additionally, a high proportion of false-negative pregnancies may result in spontaneous losses without coming to attention. Others could have resulted in live-borns but not have been reported to the physician who ordered the NIPT. Some laboratories have attempted to infer DR based on extrapolation of outcome data from positive tests [23,24]. However, given the entirely different patient management of positive versus negative cases and the lack of supporting data to validate their underlying assumptions, these estimates must be viewed as lacking an evidence base.
We have not considered samples that did not provide results, as very few of the studies reported outcome on such cases. It is known that most of these are attributable to low fetal fraction, which is dependent on maternal weight and gestational age. Furthermore, there are no absolute standards for the measurement of fetal fraction or the threshold at which testing can be considered reliable. With low fetal fraction, there is a trade-off between successful testing that provides results and accuracy, both of which need to be considered when comparing test methodologies. Various strategies have been proposed to deal with low fetal fraction, including reflexively resequencing at a higher depth of read, re-assessing risk based on fetal fraction as a biomarker, use of artificial intelligence, use of alternative screening, or proceeding directly to diagnostic testing. As cfDNA testing is refined, the challenge posed by a low fetal fraction is being minimized.
Our review and analysis have limitations. We restricted our search to English language references, and some studies did not contain sufficient information necessary for inclusion. We were restricted to the study-level characteristics that we were able to extract, which limited our ability to assess and account for heterogeneity in the analyses. Furthermore, we did not assess the quality of individual studies. Due to the challenges with ascertainment bias in the clinical experience studies previously described, we were required to make assumptions and adjustments to calculate certain test performance measures. The bivariate random-effects regression model required a correction to be added to zero cells; this is known to lead to an underestimate of the performance in cases where the FNR and FPR are considerably lower than the correction factor, though this is not expected to affect overall trends. We used the method to account for various sources of bias and acknowledge our attempts may be inadequate. We did not evaluate the performance of NIPT for microdeletion syndromes and other chromosome imbalances. These newer additional areas of testing are not offered by all laboratories or may be limited to specific risk groups. Women with high prior risks due to abnormal ultrasound findings, family history, or maternal serum screening tests often proceed directly to cytogenetic and microarray diagnostic testing through CVS or amniocentesis. Exclusion of high-risk populations will lower the observed screening test PPVs.
In summary, we have shown that prospective validation studies demonstrate the excellent performance of NIPT for trisomies 21, 18, 13, and monosomy X and that methodological performance differences exist. The available data from clinical experience studies show that the performance of NIPT in clinical care is consistent with FPRs and PPVs obtained in clinical validation studies.  Figure S1: Forest Plots of all Data. References  are cited in the supplementary materials.