New Empirical Bayes Models to Jointly Analyze Multiple RNA-Sequencing Data in a Hypophosphatasia Disease Study

Hypophosphatasia is a rare inherited metabolic disorder caused by the deficiency of tissue-nonspecific alkaline phosphatase. More severe and early onset cases present symptoms of muscle weakness, diminished motor coordination, and epileptic seizures. These neurological manifestations are poorly characterized. Thus, it is urgent to discover novel differentially expressed genes for investigating the genetic mechanisms underlying the neurological manifestations of hypophosphatasia. RNA-sequencing data offer a high-resolution and highly accurate transcript profile. In this study, we apply an empirical Bayes model to RNA-sequencing data acquired from the spinal cord and neocortex tissues of a mouse model, individually, to more accurately estimate the genetic effects without bias. More importantly, we further develop two integration methods, weighted gene approach and weighted Z method, to incorporate two RNA-sequencing data into a model for enhancing the effects of genetic markers in the diagnostics of hypophosphatasia disease. The simulation and real data analysis have demonstrated the effectiveness of our proposed integration methods, which can maximize genetic signals identified from the spinal cord and neocortex tissues, minimize the prediction error, and largely improve the prediction accuracy in risk prediction.


Introduction
Hypophosphatasia (HPP) is a rare inherited metabolic disorder, and its severe forms were estimated to affect 1 in 100,000 births in Canada and 1 in 300,000 in Europe, while moderate forms of HPP are 50 times more frequent [1].US data suggest that HPP is more prevalent in white people than in black people [1].The ethnic group with the highest reported incidence of HPP is the Mennonites in Manitoba, Canada, with 1 in 25 individuals carrying a tissue-nonspecific alkaline phosphatase (TNAP) mutation and around 1 in 25,000 newborns having lethal HPP [2].In fact, HPP is caused by the deficiency in TNAP, and TNAP activity is essential for bone formation, mineralization, and differentiation of bone marrow stromal cells.Specifically, TNAP deficiency in bone and muscle progenitor cells results in mitochondrial hyperfunction and increased ATP levels, both of which greatly influence cell function and survival [3].The clinical symptoms of HPP widely vary according to the patient's age at the onset of the disorder.Less severe forms of HPP are often characterized by pain or fractures over time [4,5] due to poor bone mineralization.More severe and early onset cases of HPP present issues with muscle weakness, diminished motor coordination [3], and epileptic seizures [6].These neurological manifestations are poorly characterized, especially in children.Therefore, the development of strategies to improve the neurological outcomes is urgently needed.One goal of this study is to propose a statistical model to accurately reveal the genetic mechanism underlying the neurological manifestations of HPP.In this study, the tissue samples come from Alpl +/+ (wild type) and Alpl−/− (Global TNAP knockout) mice to represent the phenotype of patients with infantile HPP in the human population [3].We have collected two RNA-sequencing (RNA-seq) data from the spinal cord and neocortex tissues of a mouse model of infantile hypophosphatasia.The more important task is to develop a new method to integrate multiple RNA-seq data for discovering influential genes associated with the HPP disease and predicting the disease presence in subjects.
In 2008, RNA-seq was first introduced to the field of transcriptomics [7][8][9][10], and it has since greatly transformed the genetic study.Transcriptomics is the study of the complete set of RNA transcripts that are produced by the genome under specific conditions or in a specific cell.Studying transcript levels is essential to understand the translation process from information encoded in the genome to cellular functions.Consequently, transcript profiling is an important tool for predicting the presence of disease.RNA-seq offers several advantages over other types of transcriptome profiling, particularly single-cell RNA-seq which can identify novel transcripts not corresponding to an existing genomic sequence.Additionally, RNA-seq has a low background signal, and a large dynamic range of expression levels to detect transcripts, and it can precisely locate transcription boundaries down to a single-base resolution [11].More practically, a large amount of RNA transcript is needed, but the cost of mapping whole genome is lower than other data, such as microarrays or cDNA sequencing [11].
While the applications of RNA-seq study have become widespread, this complex process faces a big challenge on how to properly analyze RNA-seq data [12].There exist several pipelines and publications outlining best practices for the entire process from the extraction to the data analysis [12][13][14][15].Many traditional statistical methods have been applied to RNA-seq data such as a likelihood ratio test [16] or negative binomial models [16,17].With the fast growing of deep learning and machine learning in many different domains, deep learning models have recently been applied to RNA-seq data [18,19] with limited success since high-throughput sequencing data fights with the imbalance between a large number of genetic markers and a small number of samples.
As mentioned before, the number of genetic markers in a RNA-seq data is extremely greater than the number of samples, and this kind of data analysis faces a challenge on how to accurately explore the genetic effects from high-dimensional variants.Several frequentist methods have been developed to shrink genetic estimates and reduce bias, such as LASSO [20] and the shrunken centroids methods [21].However, these methods tend to overestimate the genetic effects and include noisy signals.Fortunately, Bayesian approach produces estimates which are immune to estimation bias in a high-dimensional data analysis.In addition, an empirical Bayes (EB) method [22] can simplify the calculations of Bayesian approaches and help address the selection bias problem occurred in a large feature space.This method has been successfully applied to microarray data for a prostate cancer study [22,23].Later, the EB method is extended to exome DNA sequencing data for predicting the risk of cardiovascular disease [24].Recently, a weighted EB model is derived to incorporate multiple traits, a quantitative disease trait, and a binary disease status, in whole-genome DNA sequencing analysis for improving the genetic mechanism underlying a hypertension disease [25].Here, with the collection of multiple RNA-seq data from the spinal cord and neocortex tissues of a mouse model, there remains a need to develop a new EB method which combines two RNA-seq data into a model for maximizing the effects of genetic markers, and effectively reducing the estimate bias in a high-dimensional data analysis.
In this study, we apply the EB method to the spinal cord and neocortex RNA-seq data, separately, for more accurately discovering the differentially expressed (DE) genes in the HPP disease study.More importantly, we further propose two integration methods, weighted gene approach and weighted Z method, to incorporate two RNA-seq data into a model for enhancing the effects of genetic variants in the diagnostics of HPP disease.We expect that the proposed integration methods can strengthen the genetic signals and im-prove the prediction performance, because some disease-causal genes may be differentially expressed at one specific tissue, not at both tissues.To test this conjecture, we compare the empirical Bayes method, which is applied to the single RNA-seq data, with two integration empirical Bayes methods jointly analyzing two RNA-seq data.The results will demonstrate the efficiency of our proposed integration methods.
This paper is outlined as follows.Section 2 discusses four EB methods.Section 3 proves the effectiveness of two integration methods.In the following Section 4, the specifics of a real RNA-seq data analysis are discussed, including a description of the datasets, distributions of the statistics, ranking of important features, and comparison of various methods.We conclude with a few remarks from the data analysis, and further discuss the limitations of current methods, and give some directions of future work in Section 5.

Step I: Gene Expression
The experimental procedures of this study were approved by the Institutional Animal Care and Use Committee (IACUC) of the University of Michigan (Protocol number: PRO00010860, date of approval: 22 August 2022).Here, two RNA-seq data are collected from spinal cord and neocortex tissues.Each data is composed of tens of thousands of genes collected from 16 mice.Those mice are classified into two categories, 8 mice are assigned to a wild type group and another 8 mice are in a knock-out group.A balanced design is applied here.The wild type is treated as a normal group, and the knock-out class is treated as a disease group.Specifically, two RNA-seq data share common samples, and the main discrepancy between two data may come from the assumption that combining DE genes detected from two tissues may strengthen the genetic signal in the diagnostics of HPP disease.Moreover, raw RNA-seq data is transformed via a binary logarithm function, so that its distributions based on the transformed data more closely resemble the normal distribution, which better aligns with the empirical Bayes method.The relevant workflow of study is summarized in Figure 1.

2.2.
Step II: Empirical Bayes Prediction Rule for a Spinal Cord RNA-Sequencing Data RNA-seq data collected from the spinal cord is critical to the central nervous system, and its analysis helps understand the genetic mechanism of the HPP disease.We assume that there are N genes X p = (X 1|p , . . ., X N|p ) where X p is a n × N matrix measuring N log transformed genes of n subjects, and p is a label of spinal cord.The corresponding standardized gene matrix W p = (W 1|p , . . ., W N|p ) is defined by an equation W i|p = where X i|p is one column of matrix X p measuring the i th transformed gene for all subjects, 1 n is a vector of size n containing ones, µ i,1|p and µ i,2|p denote the mean of the i th gene in the normal and disease groups, individually.σ i|p is the standard deviation of the i th gene, and n represents the total number of subjects consisting of n 1 controls (normal subjects) and n 2 cases (disease subjects) with n 1 = n 2 = n 2 .Thus, the prediction rule is to classify subjects into the disease group if where the parameter 2 ) measures the impact of the i th gene on the HPP disease.In this study, a balanced design is applied (n 1 = n 2 ), then, a threshold zero is selected to define the decision boundary.In the above equation, a vector 0 is used to allocate subjects having large prediction values to a disease group.
In reality, σ i|p is often unknown, but it can be estimated by a statistic , where s 2 i,1|p and s 2 i,2|p are sample variances of the i th gene in the normal class and disease group, respectively.Hence, the parameter δ i|p will be estimated by Note that Xi,k|p = is the sample mean of the i th gene in the normal or disease group where Y j|p (= 0 or 1) is the jth subject's disease status.To take the advantage of normal property, t i|p is converted into a normal variable Z i|p = Φ −1 (P(T ≤ t i|p )), where Φ −1 is the inverse of the cumulative distribution function of standard normal and P(T ≤ t i|p ) is a cumulative distribution function of t distribution with n − 2 degrees of freedom.Thus, Z i|p becomes an estimate of δ i when we analyze a spinal cord RNA-seq data, and it follows a normal distribution with mean δ i and the standard deviation close to 1 [22] Remark F. The standardized gene W i|p is estimated by Ŵi|p = . The prediction rule based on Z i|p will be defined below, ∑ i≤N Z i|p Ŵi|p > 0 However, the estimate Z i|p may result in a selection bias, wherein genetic effects are often overestimated in the above prediction rule.It is known that Bayesian methods may provide less biased estimates [26,27], and previous studies [28,29] have shown that the marginal density function of statistic Z i|p is capable of deriving the Bayesian estimate without knowing the prior of δ i .For any pair (Z, δ), the Bayesian estimate δ is obtained by the first derivative of the logarithm of the marginal density of Z [22], where l f (.) estimates the logarithm of the marginal density of Z, l f ′ (z) is its first derivative function, and s 2 (sample variance) is numerically calculated.Basically, the empirical Bayes method provides a numerical approximation of the Bayes estimates [22].
Finally, a subset of genes with large EB estimates denoted as δi|p define the following prediction rule: where a set I 1 collects all potential genes with strong EB estimates ( δi|p , i ∈ I 1 ) in a spinal cord RNA-seq analysis.

Step III: Empirical Bayes Prediction Rule for a Neocortex RNA-Sequencing Data
The second RNA-seq data is collected from the brain neocortex of the mouse, and it is also used to explore the genetic mechanism of HPP disease.Here, the feature dimension of the neocortex RNA-seq data is the same as that of the above spinal cord data.For N genes, X c = (X 1|c , . . ., X N|c ) denotes all genes measured from the neocortex tissue, and the corresponding standardized gene matrix W c = (W 1|c , . . ., W N|c ) is calculated by and µ i,2|c denote the mean of the i th gene in the normal and disease groups, individually, σ i|c denotes the standard deviation of the i th gene, and n represents the total number of subjects.Since two RNA-seq data share the common samples, n, n 1 , and n 2 can be defined in the same way.Thus, the prediction rule is to classify subjects to the disease group if where the parameter assesses the impact of the i th gene on the disease when we analyze neocortex RNA-seq data.Since an equal number of samples are assigned to the normal and disease groups, a balanced design may select a zero threshold to determine the decision boundary.
Similarly, an estimator t i|c of δ i is calculated below: Note that this statistic t i|c is also converted into a normal variable Z i|c = Φ −1 (P(T ≤ t i|c )), which has a selection bias in practice.We shrink Z i|c to obtain the relevant EB estimator δi|c for the i th gene (i = 1, ..., N).The final prediction rule of a neocortex RNA-seq data analysis is defined as where Ŵi|c is the i th sample standardized gene vector calculated from the neocortex data, and its formula is similar to that of Section 2.2, and a set I 2 collects all genes with important EB estimates ( δi|c , i ∈ I 2 ).

2.4.
Step IV: Empirical Bayes Prediction Rule When Integrating the Spinal Cord and Neocortex RNA-Seq Data 2.4.1.Weighted Gene Approach Since two RNA-seq data are collected from spinal cord and neocortex tissues, we expect that combining the genetic signals from multiple RNA-seq data may improve the genetic effects in the diagnostics of HPP disease.The idea of weighted gene method was first developed in the empirical Bayes model by integrating the synonymous gene and non-synonymous gene together for improving the cardiovascular disease prediction performance in an exome DNA sequencing study [24].Here, we extend this idea to combine two RNA-seq data acquired from the spinal cord and neocortex tissues, and it may maximize the genetic signals.
For all subjects, X p and X c denote N genes of the spinal cord and neocortex, individually, where p represents spinal cord, and c denotes neocortex.While we analyze two RNA-seq data, a weight (w i ) measuring the relative importance of the i th gene effect detected from the spinal cord data compared to that from the neocortex data is calculated by where p i|p and p i|c are p-values calculated from a simple logistic regression model which detects the i th gene effect on the HPP disease corresponding to the spinal cord data and neocortex data, individually.A larger weight suggests that the genetic effect identified from the spinal cord may have a relatively stronger association with the disease than that from the neocortex.A new weighted gene (X * i ) will make use of this weight to combine the genetic effects from two RNA-seq data, and it is defined below: These N weighted genes (X * 1 , . . ., X * N ) are capable of maximizing the genetic effects detected from the spinal cord and neocortex tissues.The relative empirical Bayes prediction rule will classify subjects to the disease if Note that δ i measures the genetic effect of the i th standardized weighted gene, where is the i th standardized weighted gene which is defined similarly as in the aforementioned section.Similarly, µ * i,1 and µ * i,2 are the mean values of the i th weighted gene corresponding to the normal group and the disease group, and σ * i is the standard deviation of the i th weighted gene.As before, the statistic (t * i ) is calculated to estimate the parameter δ i which reflects the effect of the i th weighted gene.
where X * i,1 and X * i,2 are sample means of the i th weighted gene in the normal class and disease class, respectively, s * i is the pooled sample standard deviation of the i th weighted gene, and i is converted into an estimator Z * i through an inverse standard normal function Φ −1 and a cumulative distribution function P(T ≤ t * i ).The normal property of Z * i guarantees the effectiveness of parameter δ i estimation.To avoid the selection bias, the empirical Bayes method will shrink the estimator Z * i to an EB estimate δi .The final prediction rule will be based on all important EB estimates, where a set I 3 collects all standardized weighted genes having strong EB estimates, Ŵ * i is the estimate of the i th standardized weighted gene combining spinal cord and neocortex genetic signals, and its standardized formula is defined similarly as in the previous section.

Weighted Z Method
The principle of weighted gene method is to calculate a weight (w i ) for each gene, and it enlarges the genetic signal from two RNA-seq data.An alternative way for strengthening the genetic signal is to calculate a weighted Z score, which combines a Z statistic detected from the spinal cord data with that from the neocortex data.We expect to increase the genetic effects when scanning two RNA-seq data.
Z i|p defined in Section 2.2 is a genetic estimate of the i th gene, and this statistic reflects the importance of the i th gene identified from spinal cord data.Similarly, Z i|c defined in Section 2.3 is also an estimate of the i th gene, and it explores the genetic effect calculated from the neocortex data.A weight zw i is defined below: where |Z i|p | and |Z i|c | are absolute values of statistics calculated from the spinal cord data and neocortex data, respectively.For a fixed sample data, the weight zw i reflects the relative importance of the i th gene effect detected from a spinal cord data compared to that from a neocortex data.A larger weight indicates that the i th gene detected from the spinal cord shows a stronger signal than that from the neocortex.To better capture the genetic effect from two RNA-seq data, a weighted Z statistic is calculated below: According to Sections 2.2 and 2.3, Z i|p and Z i|c follow the normal distribution with mean δ i and variance close to 1, respectively.The weighted Z w i is a function of two variables Z i|p and Z i|c .While data is fixed, this new statistic Z w i approximates a normal distribution with mean δ i and the standard deviation s Z w i (s , where ρ is the correlation coefficient between two statistics Z p and Z c calculated from the spinal cord and neocortex data, and it will help calculate the standard deviation of the statistic Z w i .In real application, this new statistic Z w i reflects the genetic effect of the i th gene after combining two statistics Z i|p and Z i|c .Then, we apply the empirical Bayes method to shrink this statistic (Z w i ), and the relevant EB estimate δi is estimated to avoid selection bias.The final prediction rule will be determined by all important EB estimates.

Simulation Simulation Analysis
To assess the performances of our proposed methods, a semi-parametric simulation method (SPsimSeq: [30]) is applied to generate two RNA-seq data corresponding to spinal cord and neocortex RNA-seq data.This simulation method requires to input raw count data, then, it is designed to maximally retain the characteristics of real RNA-seq data.In particular, two simulation data are capable of capturing the gene-wise distributions and the between-genes correlation structure of real source data.In the simulation scenario, 3000 genes and 16 samples are simulated.Specifically, around 2% of genes (65 genes) are selected to be DE genes which are important to HPP disease, and the remaining genes are null genes.A total of 16 samples are allocated to two classes, then, eight subjects are assigned to disease and eight subjects are assigned to normal.Figures 2 and 3 compare the variability and distribution of mean expression levels between simulation data and real data.It clearly illustrates that both simulation data have retain major characteristics of real spinal cord and neocortex RNA-seq data.
Feature selection is a key step in the risk prediction, where genes are ranked by their importance to the disease.All discussed methods are compared by their discoveries of true DE genes.We expect more DE genes selected by our proposed methods to possess larger effects and occupy top positions in the list.Table 1 summarizes the number of DE genes among top 100 or 200 genes identified by each method.The more the discovery of DE genes is, the better the corresponding method is.The first two methods (Spinal and Cortex) apply the empirical Bayes model to single spinal cord simulation data and neocortex data, respectively (Table 1).This shows that single spinal cord data analysis can discover 21 true DE genes among top 100/200 genes, and the neocortex data analysis only identifies five DE genes among top 100/200 genes.These two simulation results are consistent with the findings of real data analysis.Compared to spinal cord data, neocortex data shows poor genetic singles.Table 1 also summarizes two integration methods' results, such as weighted gene approach which detects 53 and 58 DE genes among top 100 genes and top 200 genes, respectively.This method tries to maximize the genetic signals from two RNA-seq data, thus, 53 selected DE genes consist of 44 spinal cord DE genes and nine neocortex DE genes, and 58 DE genes include 49 spinal cord DE genes and nine neocortex DE genes.The alternative integration method, weighted Z, can even detect a larger number of true genes, such as 56 DE genes among top 100 genes, and 61 DE genes among top 200 genes.This finding is also consistent with the real data analysis, where weighted Z performs better in the areas of prediction error and accuracy.In addition, we also display the details of top 30 genes ranked by various methods (Table 2).In particular, yellow color highlights the spinal cord DE genes, and cyan color denotes the neocortex DE genes.This illustrates that two integration methods are capable of strengthening the genetic signals from two simulation RNA-seq data.
We further compare the prediction errors of various methods.Both simulation data are split into two sets, a training dataset (50% samples) and a test dataset (50% samples).Each proposed model is fit into a training set, then its model estimate is applied to a test set to calculate the prediction error.We repeat this cross-validation (CV) procedure five times to calculate the average test error.Table 3 summarizes the error rate and its standard error of each method.The results demonstrate that two integration methods perform best, like weighted Z has the smallest error rate (0.16), and weighted gene approach (0.225), followed by single spinal cord data analysis (0.275), and single neocortex data analysis (0.3).In general, the simulation study helps understand RNA-seq data properties.(1) Two simulation data retain major characteristics of real spinal cord and neocortex RNA-seq data.(2) Two integration methods, weighted gene and weighted Z, detect a larger number of DE genes and receive a smaller prediction error, compared to the single spinal cord data analysis or single neocortex data analysis.(3) If we focus on two single data analyses (spinal cord and neocortex) compared to each other, the spinal cord data makes it easier to detect DE genes.It seems that the stronger effects of genes are detected from the spinal cord tissue.(4) The performances of simulation data are generally consistent with those of real data.

Real Application Data Analysis
Hypophosphatasia is a rare inheritable disorder caused by TNAP deficiency.Previous studies have demonstrated that TNAP deficiency results in sensorimotor dysfunction [3].Here, two RNA-seq data acquired from the brain neocortex and spinal cord of Alpl +/+ (wild-type) and Alpl −/− (Global TNAP knockout) mice have been collected.This transgenic mouse model of infantile HPP represents the more severe phenotype of HPP in the human population.We expect that RNA transcriptional profile data near candidate genes may enrich the effects of genetic variants that fall within a sequence data, and any new findings will help better understand the mechanisms of pediatric brain injuries during brain development.In this study, both RNA-seq data include 36,500 gene markers and 16 samples.According to one covariate, gender, male and female mice are equally allocated to normal and disease groups, as seen in Appendix A Table A5.Noticeably, the number of genes is much larger than the number of samples.Consequently, both spinal cord and neocortex data are cleaned to further reduce their feature dimensionality.The initiate analysis, providing log 2 fold-change, log fold-change standard error, and t statistic, serves as reference for cleaning the data.The criteria suggest that few genes having a small difference in fold change between the wild-type and knock-out mice, and having an extremely smaller standard error, should be removed.These genes may easily provide the false positive signals and increase the detection noise when we scan all genes in an RNA-seq study.After cleaning the data, 28,426 genes and 16 samples are included in two RNA-seq datasets.
It is known that the statistic Z estimates the genetic effect for each gene, and its normal property will assure the successful application of the empirical Bayes method.Figure 4 summarizes four histograms of all genes' Z scores calculated from the spinal cord, neocortex, weighted gene, and weighted Z methods, respectively.Specifically, the red color line in each subplot represents the standard normal density curve.Compared to the integration methods, the first two subplots corresponding to the spinal cord and neocortex methods show that Z statistics have a small tail, which makes the detection of novel genes more challenging.One possible reason is that these two analyses only consider single RNA-seq data information.If we further compare the subplots between the spinal cord and the neocortex, it may be revealed that the spinal cord analysis seems to have a heavier tail, and a good normal property, which may result in a better performance on genetic detections in the RNA-seq study.Conversely, two integration methods, weighted gene and weighted Z methods, display the heavier tails on both sides (negative/positive), which suggests we may easily detect more causal genes which are associated to the HPP disease.
One key step in the empirical Bayes method [22] is to rank the selected genes (features).All methods rank the most important genes by their respective empirical Bayes estimates.The top gene lists of four methods (spinal cord, neocortex, weighted gene, and weighted Z) are summarized.It is known that the genetic effects from the spinal cord and brain neocortex are important to the central nervous system.When we integrate two RNAseq data, we expect that the combined method will more efficiently search for the most important genetic markers, and its performance will be better than the analysis result based on a single RNA-seq data.Most importantly, HPP is characterized by the reduced serum alkaline phosphatase (ALP), and its molecular diagnosis is established by identifying the loss-of-function ALPL variants [31], which is the most important gene highly related to HPP disease.In this study, Table 4 displays top 10 most important genes from a spinal cord data analysis, and it successfully detects the Alpl gene (position: No. 4).Table 4 also summarizes top 10 genes based on a neocortex data analysis, and it also detects Alpl, but it is ranked much lower at No. 26.It is interesting to see that another gene, Cirbp, is ranked much higher (position: No. 2).This gene (Cirbp) is related to the cold inducible RNA binding protein, and may be important in the diagnosis of the HPP disease.Since two RNA-seq data are collected, it is capable of proposing two integration methods described in Sections 2.4.1 and 2.4.2 to strengthen as many genetic signals as possible.In particular, Table 5 summarizes top 10 gene list from the weighted gene method, and its list includes Alpl (Position: No. 6) and Cirbp (Position: No. 9).It seems that the weighted gene method can identify these two genes among top 10 genes by maximizing the genetic effects from two RNA-seq datasets.Additionally, more genes, namely GM13230, zfp990, Tmprss11d, and Eno1b, detected by the single spinal cord data analysis, are also observed in the top four gene positions.The second integration method, weighted Z method, displays its top 10 genes in Table 5.It also detects two genes (Alpl and Cirbp), but two genes (Cirbp and Fkbp5) identified by the single neocortex data analysis are ranked at the front, followed by the genes Alpl, GM13230, zfp990 and Eno1b detected by the single spinal cord data analysis.This may suggest that weighted Z method provides an alternative way to maximize the genetic effects from two RNA-seq data.When we extend the top 10 genes to the top 40 genes, it is better to see that more top-ranked genes detected by a spinal cord data analysis or a neocortex data analysis are selected by the integration methods.The relevant top 40 gene lists are summarized in Appendix A (Tables A1-A4).Generally, the most important HPP disease causal gene, Alpl, is successfully detected by four methods, especially the integration methods which are capable of maximizing multiple gene signals from a spinal cord data and a neocortex RNA-seq data.HPP is a complex disease, and its diagnosis is based on the collective actions of multiple genes.Identifying more genes will help us better understand their biological effects in the diagnosis of HPP disease.
The identifications of most important genes complete the feature selection procedure.The next step is to evaluate the risk prediction based on the selected candidate genes.We split samples into a train and a test set.Since the sample size is rather small, 50% of the subjects are randomly assigned to a train set, and the remaining goes to a test group.This cross-validation is repeated five times to calculate the average test error.Table 6 summarizes and compares the prediction errors of four methods.Specifically, the prediction error of a neocortex data analysis gives the largest error rate (0.15).In fact, this result is consistent with its poor performance in the Z histogram.While the EB prediction rule applies to a spinal cord data, it performs slightly better and a smaller prediction error (0.125) is obtained.The proposed integration methods, weighted gene and weighted Z methods, expect to better search the most important genes for risk prediction.In particular, the error rate of weighted gene method is 0.1, which is smaller than the error rates based on the single spinal cord and the neocortex data analysis.Weighted Z method combines two Z statistics detected from two RNA-seq datasets to select the causal genes, and those selected genes will be separately applied to spinal cord and neocortex data to compute the test error rates.The final error is based on the average value of two data results, and it is around 0.0625, which shows the minimum prediction error.In summary, the integration methods, weighted Z and weighted gene methods, provide the smaller prediction error rates, and these two methods may better explore the genetic effects in the risk prediction of HPP disease.
We also calculate the area under the receiver operating characteristic (ROC) curve (AUC) to compare the prediction accuracy among four different methods.Table 7 compares the prediction accuracy across four methods.Specifically, the neocortex data analysis gives the smallest prediction accuracy (AUC: 85%), followed by the spinal cord data analysis, where its AUC achieves 87.5%; then, weighted gene approach achieves an AUC of 90%, and weighted Z method has the highest prediction accuracy (AUC: 93.75%).This analysis reveals the importance of empirical Bayes estimates in developing a large-scale risk prediction model, and two integration methods further improve the prediction accuracy while jointly analyzing two RNA-seq data.Figure 5 illustrates the comparison of ROCs across four methods under the cross-validation procedure.This figure demonstrates that two integration methods outperform the individual RNA-seq data analysis in terms of prediction accuracy.The results imply that our proposed methods are efficient prediction tools for a large-scale RNA-seq data analysis.

Distribution of Spinal
Cord Z Scores

Discussion
Hypophosphatasia is a rare inherited metabolic disorder caused by tissue-nonspecific alkaline phosphatase deficiency which influences cell function and survival [3].It manifests as a variety of clinical symptoms where more severe and early onset cases are often characterized by neurological symptoms, such as sensorimotor dysfunction [3] or epileptic seizures [6].However, these neurological manifestations are poorly characterized, particularly in children.RNA-seq data offer a high-resolution and highly accurate transcript profile, which is an important tool for predicting disease risk.In this study, two RNA-seq data are acquired from spinal cord and neocortex tissues of a mouse model to explore the genetic mechanism underlying the diagnostics of HPP disease.Typically, RNA-seq data result in a large number of predictors and a limited number of samples, which challenges many traditional statistical methods because these methods may easily overestimate the genetic effects due to a selection bias.A method such as empirical Bayes [22] attempts to mitigate this problem by shrinking the estimates in the gene expression study.Therefore, we propose two integration methods, weighted gene and weighted Z, to jointly analyze multiple RNA-seq data for maximizing the genetic signals in an empirical Bayes prediction rule.This study will help develop age-specific strategies and improve neurological outcomes in patients.
This project focuses on the discoveries of differentially expressed genes from multiple RNA-seq data, but HPP disease may be caused by the complex relationships between genetic factors and non-genetic factors.In real applications, genes often interact with each other or act with the environment factors, therefore, research adding gene-gene and gene-environment interactions seems to be beneficial.More importantly, the differential expressed genes in mammals play an important role in controlling brain growth and development, especially for children.The malfunction of genes at any developmental stage could lead to substantially abnormal characters such as genetic disorders.As a highly complex process, changes in gene expression can redirect its developmental trajectory to better adapt to environmental conditions.For this reason, incorporating such information into the empirical Bayes model should provide more information about the genetic architecture of a dynamic developmental trait.
The long-term goal is to discover the interaction between genetic transcriptome and metabolic markers underlying the diagnostics of HPP disease.In addition to RNA-seq data, we have collected metabolome data from the spinal cord and neocortex tissues of mice.These multiple omics data will help develop the age-specific and pathway-specific personalized/targeted treatments.Since metabolites data directly reflect the changes in disease phenotype and ensuing effects of post-translational modifications [32], it is increasingly collected for biomarker discovery.Current integration approaches fail to capture complex or indirect relationship between transcripts and metabolites, furthermore, the pathway methods are limited to metabolites, and only a small fraction of metabolites have been mapped to it.A promising method IntLIM [33] integrates trancriptomic and metabolomic data via a linear model, which provides a way to discover different performances of the gene-metabolite interaction between normal and disease groups.Thus, our future work is going to integrate and augment different omics data, such as RNA-seq data and metabolomics data, for recognizing the complex architecture of markers identified from multiple omics data during the development of HPP disease.
The most difficult obstacles in this study is the small number of samples (n = 16).Due to the input uncertainty, our analysis may make the genetic estimates not very stable.Specifically, while calculating the average prediction error, it is impossible to strictly follow the cross-validation procedure.We have to split raw data into 50-50 subsets, and repeat it five times.This procedure leads to a potential issue (a kind of false positive) that five training sets/five test data share a large proportion of common samples.However, it is rather difficult to enlarge the sample size due to the budget limitation.For this reason, we plan to work on a new theoretical method, distribution robust optimization (DRO) approach, to obtain steady genetic estimates.DRO method [34] is attractive because it combines features of robust optimization method and stochastic optimization method to inherit the benefits of both methods.In general, DRO can overcome the conservative estimates of the robust optimization method without the exact distribution required in stochastic optimization.Thus, a new EB model incorporating the principle of DRO method may take the advantage of its theoretical properties and computational advantage, which will benefit from both the discovery of DE genes and the genetic interaction between metabolomic markers and DE genes when analyzing a small number of samples.The ranking of the top 40 genes using the EB method on the neocortex data; EB: empirical Bayes estimate; Neocortex: neocortex RNA-seq data.

Figure 1 .
Figure 1.The workflow of this study.

Figure 4 .Figure 5 .
Figure 4. Distribution of Z statistics generated by a spinal cord data, a neocortex data, and two integration methods.

Table 1 .
Number of true genes in the top 100/200 genes.
Note: Spinal: EB method is applied to a single spinal cord simulation data.Cortex: EB method is applied to a single neocortex simulation data.Weighted Genes (p): the weighted gene approach is jointly applied to two simulation data.Weighted Z: the weighted Z method is jointly applied to two simulation data.Row 1 is used to calculate how many true DE genes are among top 100 genes detected by various methods.Row 2 is used to calculate how many true DE genes are among top 200 genes identified by various methods.
Note: yellow color denotes the spinal cord DE genes.Cyan color represents the neocortex DE genes.

Table 3 .
Prediction errors of various methods.

Table 4 .
Top 10 gene list for a spinal cord data analysis and a neocortex data analysis.

Table 5 .
Top 10 gene list for two integration methods, weighted gene and weighted Z.
Note: Top 10 genes based on two integration methods; Weighted Gene: the weighted gene method; Weighted Z: the weighted Z method.

Table 6 .
Average prediction error of four methods.
Note: Error denotes the average prediction error; SD is a standard deviation of prediction error; Spinal cord: a spinal cord RNA-seq data analysis; Neocortex: a neocortex RNA-seq data analysis; Weighted Gene: the weighted gene method; Weighted Z: the weighted Z method.

Table 7 .
The prediction accuracy of four methods.
AUC: is the area under the receiver operating characteristic.Spinal cord: a spinal cord RNA-seq data analysis; Neocortex: a neocortex RNA-seq data analysis; Weighted Gene: the weighted gene method; Weighted Z: the weighted Z method.

Table A2 .
Top 40 most important genes for the EB classifier using a neocortex data analysis.

Table A3 .
Top 40 most important genes for the EB classifier using the weighted gene method.