Reducing False-Positive Results in Newborn Screening Using Machine Learning

Newborn screening (NBS) for inborn metabolic disorders is a highly successful public health program that by design is accompanied by false-positive results. Here we trained a Random Forest machine learning classifier on screening data to improve prediction of true and false positives. Data included 39 metabolic analytes detected by tandem mass spectrometry and clinical variables such as gestational age and birth weight. Analytical performance was evaluated for a cohort of 2777 screen positives reported by the California NBS program, which consisted of 235 confirmed cases and 2542 false positives for one of four disorders: glutaric acidemia type 1 (GA-1), methylmalonic acidemia (MMA), ornithine transcarbamylase deficiency (OTCD), and very long-chain acyl-CoA dehydrogenase deficiency (VLCADD). Without changing the sensitivity to detect these disorders in screening, Random Forest-based analysis of all metabolites reduced the number of false positives for GA-1 by 89%, for MMA by 45%, for OTCD by 98%, and for VLCADD by 2%. All primary disease markers and previously reported analytes such as methionine for MMA and OTCD were among the top-ranked analytes. Random Forest’s ability to classify GA-1 false positives was found similar to results obtained using Clinical Laboratory Integrated Reports (CLIR). We developed an online Random Forest tool for interpretive analysis of increasingly complex data from newborn screening.


Introduction
Newborn screening (NBS) using tandem mass spectrometry (MS/MS) has transformed our ability to identify and provide early, lifesaving treatment to infants with hereditary metabolic diseases. Because screening is designed to identify affected infants at high sensitivity, it is accompanied by frequent false-positive results [1]. Additional biochemical and DNA testing of all screen-positive cases is performed to confirm (true positive) or reject (false positive) the primary screening result and to reach a final diagnosis. In some cases, this two-tier strategy can lead to iterative testing rounds and diagnostic delays, placing undue burden on the healthcare system including physicians and clinical laboratories, and on the patients and their families.
At present, only one or a few metabolic analytes or ratios from MS/MS screening panels are used to identify infants with a metabolic disorder. For example, screen-positive cases for methylmalonic acidemia (MMA) are identified using specific cutoff values for propionylcarnitine (C3) and its ratio with acetylcarnitine (C2), two of the 39 analytes measured in MS/MS screening. As an alternative approach to analyte cutoffs, Clinical Laboratory Integrated Reports (CLIR, formerly R4S) postanalytical testing employs a large database of dynamic reference ranges for disease-related analytes and many additional informative analyte ratios in order to improve separation of true-and false-positive cases [2][3][4][5]. The ranges and overlap of analyte values between patient and control groups can be adjusted in CLIR for multiple continuous and clinical variables (e.g., birth weight, sex, age at blood collection), which have been shown to significantly reduce false-positive results [6].
Machine learning is an emerging strategy for the classification of metabolic disorders in newborns [7,8]. In particular, Random Forest (RF) or Random Decision Forests [9,10] are powerful tree-based methods for supervised machine learning with numerous applications in high-throughput genomic [11] and metabolomic data analysis [12][13][14]. We recently showed that analysis of all 39 MS/MS analytes in the California NBS panel using RF was able to improve the separation of trueand false-positive cases [15]. We compared results from our RF analysis to results obtained from CLIR for the same cohort of MMA screen positives. This comparison showed that the prediction of MMA false positives was significantly improved by utilizing the entire set of MS/MS analytes measured at birth. Here we adapted our RF approach developed for methylmalonic acidemia (MMA) [15] to the study of additional metabolic disorders to improve the diagnosis of glutaric acidemia type 1 (GA-1) and very long-chain acyl-CoA dehydrogenase deficiency (VLCADD); and facilitate detection of ornithine transcarbamylase deficiency (OTCD) that is not currently on the Recommended Universal Screening Panel (RUSP) [16]. The performance and stability of the RF model was evaluated using NBS data from screen-positive infants for these disorders reported by the California NBS program. Based on these findings, we developed open-source web-based software (https://rusptools.shinyapps.io/RandomForest) that incorporates our RF model for the analysis and interpretation of newborn screening data. The new RF tool could be used to identify false-positive results in conjunction with CLIR tools and established second-tier confirmatory testing using biochemical and DNA analysis of all screen-positive cases.

Data Summary
This study was approved by the Institutional Review Boards at Yale University (protocol ID 1505015917, 10 May 2019), Stanford University (protocol ID 30618, 25 February 2019) and the State of California Committee for the Protection of Human Subjects (protocol ID 13-05-1236, 7 June 2019). We analyzed newborn screening data from a cohort of 2777 infants, consisting of 235 cases with confirmed glutaric acidemia type 1 (GA-1), methylmalonic acidemia (MMA), ornithine transcarbamylase deficiency (OTCD), or very long-chain acyl-CoA dehydrogenase deficiency (VLCADD), and 2542 false positives for one of these disorders (Table 1). A number of confirmed positive cases had metabolic marker concentrations below the established cutoff values and thus were not technically screen positive for the respective disease. Positive predictive value (PPV) was calculated after removing these cases, which included 5 of the 48 GA-1 cases and 4 of the 103 MMA cases. OTCD is detected through decreased citrulline levels and 6 of the 24 confirmed positive OTCD cases had levels above the established cutoff. All babies had newborn screening performed through the California NBS program between 2005 and 2015, except for OTCD, which was performed between 2010 to 2015. Data included 39 analytes (free carnitine, 26 acylcarnitines, and 12 amino acids), as well as gestational age (GA, in days), birth weight (BW, in grams), sex (male, female or unknown), race/ethnicity status that was self-reported by the parents, age at blood collection (AaC, in hours), and total parenteral nutrition (TPN, yes or no) status (Table 2). Table 1. Number of patients, false positives and PPV of first and second-tier testing (newborn screening (NBS), glutaric acidemia type 1 (GA-1), methylmalonic acidemia (MMA), ornithine transcarbamylase deficiency (OTCD), or very long-chain acyl-CoA dehydrogenase deficiency (VLCADD)).

Disorder
Confirmed Positive

NBS Metabolic Data Analysis Using Random Forest
Random Forest [10] was used to evaluate information from all 39 metabolic analytes and the six clinical variables (GA, BW, sex, race/ethnicity, AaC, TPN) in the 2777 screen-positive cases. In this study, analyte ratios were not included in the RF model due to: 1) Difficulty selecting ratios from the large number of possible ratio for 39 analytes, 2) Ability of RF decision trees to capture nonlinear relationships between analyte ratios (e.g., change of C3 in relation to C2) and 3) Possibility that adding ratios could create a bias in the ranking of analytes using mean decrease in accuracy (MDA). One-hot-encoding was used to convert the following three categorical variables into a form that could be used by the machine learning algorithm: sex (male, female, or sex-NA), race/ethnicity (Asian, Black, Hispanic, White, or Other/Unknown) and TPN (TPN-Yes, TPN-No, or TPN-NA). Leave-one-out cross-validation (LOOCV) was used to estimate the reliability of RF to correctly predict true-and false-positive cases. For each disease, the RF model was trained on all screen-positive cases for that disease except for one blinded case for which a prediction was made. This process was repeated for all screen-positive cases for each disease. For example, 604 MMA screen positives were combined for training, while 1 MMA screen-positive case was blinded and used for testing. This process was repeated until all 605 MMA cases were classified by RF. Only RF assignments from testing cases (and not from training) were used for final outcome prediction. This prediction was based on counting the "vote" from each RF decision tree with a binary classification of only two possible outcomes: a screen positive can either be true positive or false positive for the disorder. The number of decision trees was set to 1000 in each RF model [17]. The fraction of decision trees that voted for a case as true positive among all 1000 decision trees was defined as the RF score in this study. In result of the LOOCV, one RF score was assigned to each of the 605 MMA screen positive that ranged from 0 to 1. This RF score was used to plot the receiver operating characteristic (ROC) curve (Figure 1), and to calculate the area under the curve (AUC). The ROC curve shows the correlation between sensitivity and specificity at different cutoffs of the RF score. AUC indicates the performance of the model with range between 0.5 and 1. AUC of value 1 indicates a perfect model for separating true and false positives, while 0.5 has no class separation capacity. A high RF score indicates a high probability of a case being a true positive, while a low RF score indicates a high probability of a false-positive NBS result. There is a direct correlation between the RF score and screening sensitivity and specificity.

Validation of the Random Forest Model
The LOOCV approach is similar in concept to the analysis of individual screen-positive cases in NBS. However, there is no sampling difference for each repeat in LOOCV and only the final LOOCV error estimate on the testing set is reported [18]. Thus, for each disorder only one AUC is generated without an estimate of variation. To rigorously assess the stability of the RF method, we performed a 10-fold cross validation that was repeated 1000 times for each disorder. For each disorder, all screen-positive cases were divided into ten sample groups with an equal proportion of true and false positives in each group. For example, the 605 MMA screen-positive cases were divided into ten groups each containing approximately 10 true positives and 50 false positives. At each validation step, nine sample groups were combined for training, and one group of blinded samples was used for testing. In result of this 10-fold cross validation, each of the 605 MMA samples received one RF score. Only RF scores from testing cases (and not from training) were used to plot the ROC curve and calculate the AUC. This process was repeated 1000 times for each disorder in order to assess the variation in AUC values ( Figure 2). For each disorder, the median number of false positives predicted across the 1000 repeats was based on the sensitivity level of detecting this disorder in the California NBS program. Finally, the mean decrease in accuracy (MDA) was used to measure the contribution of individual metabolic analytes in the RF model ( Figure 3). Based on the high correlation between some of the metabolic analytes (Pearson correlation coefficient > 0.9), MDA was selected instead of the alternative approach using mean decrease in Gini (MDG) index [19].

Validation of the Random Forest Model
The LOOCV approach is similar in concept to the analysis of individual screen-positive cases in NBS. However, there is no sampling difference for each repeat in LOOCV and only the final LOOCV error estimate on the testing set is reported [18]. Thus, for each disorder only one AUC is generated without an estimate of variation. To rigorously assess the stability of the RF method, we performed a 10-fold cross validation that was repeated 1000 times for each disorder. For each disorder, all screenpositive cases were divided into ten sample groups with an equal proportion of true and false positives in each group. For example, the 605 MMA screen-positive cases were divided into ten groups each containing approximately 10 true positives and 50 false positives. At each validation step, nine sample groups were combined for training, and one group of blinded samples was used for testing. In result of this 10-fold cross validation, each of the 605 MMA samples received one RF score. Only RF scores from testing cases (and not from training) were used to plot the ROC curve and calculate the AUC. This process was repeated 1000 times for each disorder in order to assess the variation in AUC values ( Figure 2). For each disorder, the median number of false positives predicted across the 1000 repeats was based on the sensitivity level of detecting this disorder in the California NBS program. Finally, the mean decrease in accuracy (MDA) was used to measure the contribution of individual metabolic analytes in the RF model ( Figure 3). Based on the high correlation between some of the metabolic analytes (Pearson correlation coefficient > 0.9), MDA was selected instead of the alternative approach using mean decrease in Gini (MDG) index [19]. Assessing the performance of RF using cross validation. A 10-fold cross validation (1000 repeats) of the RF model was performed for each disorder to classify each screen positive as either a true or false positive. Only RF scores from testing samples were used to plot the ROC curve and to calculate the AUC. The small variation in AUC values without extreme outlier cases for each disorder demonstrates the overall stability of our RF model.  Assessing the performance of RF using cross validation. A 10-fold cross validation (1000 repeats) of the RF model was performed for each disorder to classify each screen positive as either a true or false positive. Only RF scores from testing samples were used to plot the ROC curve and to calculate the AUC. The small variation in AUC values without extreme outlier cases for each disorder demonstrates the overall stability of our RF model.

Web-Based RF Tool and Statistical Analysis
Open-source web-based software was developed for the analysis and interpretation of MS/MS data from newborn screening (https://rusptools.shinyapps.io/RandomForest). The new online tool incorporates our RF model for the four studied diseases and was developed with the R shiny package [20], which has been used to build user-friendly interactive web apps with R. The tool's graphical user interface (GUI) was designed to streamline the process of NBS data reanalysis and to facilitate deployability in the NBS laboratory. A cutoff value was required to separate true-and false-positive cases. The estimated sensitivity for detecting true positives was calculated as the median of sensitivity from our 10-fold CV (1000 repeats). The default sensitivity cutoff in the software corresponds to the current sensitivity of detecting each disorder in the California NBS program. Users can also customize the cutoff value. A cutoff based on high sensitivity indicates a low RF score and low specificity. Detailed description of the input data format, output results, and a user guide are available at https://peng-gang.github.io/RUSP_RF_UserGuide/. Statistical analyses, graphs and design of the research and online tool was done in R software 3.6.1 [21] using these R packages: randomForest [22], ggplot2 [23], pROC [24], caret [25] and shiny [20].

Metabolic Pattern Analysis Using Random Forest
To demonstrate that machine learning could improve discrimination between true-and false-positive cases without compromising sensitivity, we trained a RF classifier on NBS data from screen positives for four metabolic disorders reported by the California NBS program. Without changing the sensitivity of first-tier screening for these disorders, RF reduced the number of false positives by 89% for GA-1, 45% for MMA, 98% for OTCD and by 2% for VLCADD (Table 1). Accordingly, the positive predictive value (PPV = TP/TP + FP) was significantly improved for three out of the four disorders with a 6.2-fold increase to 22.3% for GA-1, a 0.6-fold increase to 26.4% for MMA and a 16.7-fold increase to 62.1% for OTCD. The performance of RF for the four disorders ranged from an AUC of 0.80 to 1.00 (95% CI) (Figure 1). The ROC curve shows the relationship between sensitivity and specificity in the RF model. Users can choose any point on the ROC curve to select the desired sensitivity for disease screening, while the vertical line through that point corresponds to the specificity of the RF model for detecting the disease, based on the selected sensitivity. To further investigate the potential variability in performance in our RF model, we performed a 10-fold cross validation with 1000 repeats for each disorder. This design maximized the sampling differences during cross validation and revealed only small variations in the AUC for each disorder without extreme outlier cases (Figure 2), which indicated the overall stability of the RF model.

Ranking of Metabolic Analytes
The MDA index was used to identify the individual contribution of specific MS/MS analytes and covariates in our RF model. The 20 top-ranked analytes and variables for each disorder are shown in Figure 3. Notably, each of the primary markers used to detect the four diseases in the California NBS program was among the five top-ranked analytes for each disorder. Two of the top-ranked analytes are part of informative ratios for GA-1 (C8) and MMA (methionine) [26,27], while several other analytes were found to be related to these disorders based on a literature search (Table 3). Except for birth weight and TPN ranked at 10 and 17 for MMA, respectively, none of the clinical variables were among the 20 top-ranked RF features.

Comparison of CLIR and Random Forest
The performance of RF and CLIR postanalytical tools was compared using MS/MS data for GA-1 screen-positive newborns. To preclude any bias in this comparison, 366 of the 1344 false-positive cases in our cohort (Table 1) were removed from this analysis based on the "Tool Runner" function in CLIR. The remaining cohort of 1026 GA-1 screen positives (48 TP and 978 FP) was analyzed with each method. In CLIR, analysis was performed separately for derivatized (407 FP and 25 TP) and underivatized (571 FP and 23 TP) data, while RF analysis was done for all 1026 GA-1 screen positives combined. In result, the number of GA-1 false-positive cases were reduced using CLIR by 93.1% (four false negatives), and using RF by 94.6% (five false negatives) ( Table 4). Adjusting the RF score cutoff from 0.12 (default cutoff based on 10-fold cross validation) to 0.086 reduced the number of GA-1 false positives by 92.6% (four false negatives).

Web-Based RF Tool
The new software tool is available at https://rusptools.shinyapps.io/RandomForest/ and the GUI is shown in Figure 4. The RUSPtools user guide is also available under Supplementary Material. The workflow starts with selecting a "disorder" and a "NBS program" reporting the data, which in this example is MMA and California, respectively. Users then upload a MS/MS sample data file (i.e., sample_input_file.csv in the website) and click "Run RUSP_RF." Output results containing two boxplots and a table are shown on the right panel, which is being computed in less than 30 seconds depending on input file size and server connection. An error message is provided for an incompatible user file format. The boxplots show the distribution of the RF score for each sample in the groups of false positives (blue boxplot, left) and true positives (red boxplot, right). The RF score of user-uploaded samples are shown in between the two boxplots. The table in Figure 4 lists the corresponding RF predictions (TP or FP) for each sample based on a suggested RF cutoff value. This default sensitivity cutoff was set to be the same as the sensitivity in the state NBS program for each disorder. Users can also customize a cutoff value in the left panel in the online tool.

Discussion
Although MS/MS screening identifies most infants with a metabolic disorder on the RUSP, it also creates a high number of false positives that require additional confirmatory testing of all screenpositive cases. At present, NBS relies on the detection of abnormal levels of only one or a few disease-

Discussion
Although MS/MS screening identifies most infants with a metabolic disorder on the RUSP, it also creates a high number of false positives that require additional confirmatory testing of all screen-positive cases. At present, NBS relies on the detection of abnormal levels of only one or a few disease-specific markers and their ratios. We recently showed improved separation of true and false-positive cases through Random Forest-based analysis of all analytes on the MS/MS screening panel [15]. Here we expanded this RF-based approach for analysis of four metabolic disorders (GA-1, MMA, OTCD and VLCADD), each of which is compromised by high false-positive rates and diagnostic delays following a positive newborn screen. Without changing the sensitivity for detecting these disorders in screening, RF was able to reduce the number of false positives by 89% for GA-1, 45% for MMA, 98% for OTCD and by 2% for VLCADD (Figure 1). By reducing false positives in first-tier screening, this RF-based second-tier approach increased the PPV, and in particular for detecting GA-1 (from 3% to 22%) and OTCD (3% to 62%) ( Table 1). These results support our previous findings of improved performance using RF-based analysis of the entire newborn metabolic profile [15].
Metabolic analytes with a large mean decrease in accuracy (MDA) in the RF model are more important for classification of disease status. MDA was used to identify the top-ranked MS/MS analytes and clinical variables for each disorder (Figure 3). All primary MS/MS markers currently in use for identifying screen positives for the four disorders in the California NBS program were among the five top-ranked analytes (Table 3). RF also identified several secondary analytes that are part of important analyte ratios with primary analytes for GA-1 (C5DC/C8), MMA (C3/Methionine) and VLCADD (C14:1/C2) [26][27][28]34,35]. Methionine, which was top ranked by MDA analysis for MMA, has been associated with differences in MMA phenotypic subgroups, with lower levels in patients with remethylation defects (CblC, D or F) compared to mutase deficiency (mut 0/− ) [15,30]. Notably, methionine was also the top-ranked analyte in the RF model for reducing OTCD false positives (Figure 3). The methionine/citrulline ratio was identified as an OTCD screening marker [30]. Similar in concept to separating MMA subgroups, these results suggest that methionine could be associated with OTCD phenotypic subgroups. However, abnormal levels of multiple serum amino acids such as methionine, proline, alanine and glycine could also be a sign of generalized liver damage seen in OTCD patients [29]. In contrast to the other three diseases, there was only a very small reduction in false-positive cases for VLCADD, which indicates the need for discovery of novel screening markers and molecular confirmatory testing to identify VLCADD carriers who could mistakenly be classified as false positives [37]. In comparison, a retrospective study using R4S tools showed that sequential postanalytical analysis could have reduced follow-up testing in 25.8% of VLCADD cases [38].
Random Forest incorporates information from all metabolic analytes and clinical variables collected at birth. Analytes and variables with lower association to a particular disorder would be assigned a smaller weight in RF and downranked in the MDA analysis. By including clinical variables in RF, the metabolic analytes can be adjusted in relation to the variable. For example, if an analyte level was higher in males than in females, the cutoff value for this analyte would be automatically adjusted higher for a male compared to a female. The inclusion of additional important analyte ratios could further improve RF performance. Because it may be difficult to simultaneously adjust the levels of many analytes for multiple interacting variables, RF provides a new solution for this problem by directly integrating all the information from screening into a single RF score. A single RF score could improve prediction of metabolic disease status, and particularly as the amount of NBS data and the consequent challenges of analyzing these data increases in the future.
To further evaluate the performance of RF, a comparison to CLIR postanalytical tools was performed. Using MS/MS data for GA-1 screen-positive cases, the performances of CLIR and RF were found to be similar for predicting false positives (Table 4). Based on the default RF score cutoff, RF predicted 14 fewer false positives and one more false negative compared to CLIR. Lowering the RF cutoff to reach the same sensitivity as CLIR resulted in four false negatives (same as CLIR) and 72 false positives (five more than CLIR). Notably, CLIR incorporates several millions of normal screening test results and profiles of screen-positive cases from NBS programs across the US and worldwide. The RF tool in comparison is currently limited to only the data from this study in one state (California NBS program) and four diseases. Similar in concept to CLIR, additional NBS data could be readily incorporated and further improve RF-based predictions. RF and CLIR utilize different methodologies with different advantages for reducing FP screens. When comparing results between CLIR and RF for detecting GA-1 false-positive cases, we found that 40 infants were categorized as TP by CLIR and as FP by RF, while an additional 26 infants were categorized as TP by RF and as FP by CLIR, respectively. Results from the two tools could be integrated using ensemble methods to achieve better predictive performance than could be obtained from each single method alone.
We note that data for metabolic analytes and clinical variables may be collected differently across NBS programs. Age at blood collection, for example, is an important covariate for metabolite levels [39], and some states may collect blood spots earlier than 24 hours. Age at collection was included in the RF model to adjust for its effect on marker levels and to make the algorithm applicable to other NBS programs. However, there may be other distinguishing factors that limit the application of this RF model (built using CA NBS data) for these programs. To address this problem, we could either collect data from different NBS programs and make adjustment in the RF tool (e.g., batch effect correction), or develop different RF models that are tailored to specific needs of each program.
To facilitate broader application of RF in second-tier analysis and interpretation, we established a novel web-based software (https://rusptools.shinyapps.io/RandomForest/). This RF tool could be of primary interest to NBS reference laboratories for evaluating MS/MS data from screen-positive cases. Analysis of individual NBS data and prediction of false-positive screens can be obtained within minutes, given the RF model has been established for that particular disease. However, RF-based predictions should always be considered in conjunction with established second-tier confirmatory analysis using biochemical and DNA testing of all screen-positive cases. Ideally, such combined analysis should be performed more rapidly to reduce the number of "false alarms" and positive callouts before parent contact. This is particularly important for inborn metabolic disorders that can present in the first weeks of life and require fast turnaround time of NBS results. The new open-source software creates a low barrier for entry that enables users to rapidly analyze case data, and in turn help improve the RF algorithm for newborn screening.
Supplementary Materials: Supplementary materials can be found at http://www.mdpi.com/2409-515X/6/1/16/s1. Author Contributions: G.P. and C.S. designed the study, G.P. and Y.T. performed the statistical analysis, T.M.C., G.M.E., H.Z., and C.S. provided input on data analysis and interpretation, and G.P., Y.T. and C.S. wrote the manuscript, which all authors edited and approved. All authors have read and agreed to the published version of the manuscript.
Funding: This work was in parts funded by institutional support from Yale University and a grant from the National Institute of Child Health and Human Development (R01HD081355).