MetPC: Metabolite Pipeline Consisting of Metabolite Identification and Biomarker Discovery Under the Control of Two-Dimensional FDR

Due to the complex features of metabolomics data, the development of a unified platform, which covers preprocessing steps to data analysis, has been in high demand over the last few decades. Thus, we developed a new bioinformatics tool that includes a few of preprocessing steps and biomarker discovery procedure. For metabolite identification, we considered a hierarchical statistical model coupled with an Expectation–Maximization (EM) algorithm to take care of latent variables. For biomarker metabolite discovery, our procedure controls two-dimensional false discovery rate (fdr2d) when testing for multiple hypotheses simultaneously.


Metabolite identification: MetID
We perform identification by calling the MetID function, which requires some input such as dissimilarity score, competition score and initial parameters.
# Metabolite identification MetID(500,pldata,0.8,obdata,muT=2,muF=15,muF2=50,sigmaT=3,sigmaF=15,sigmaF2=50) The first argument 500 is the number of iterations of the EM algorithm and 0.8 is used as the cutoff of the confidence measure. pldata is library data after peak merging and obdata is score calculated by using sample and library data. The last 6 arguments are initial values of mean and variance in three component normal mixture model. After performing the process of identification, we get the results below:

Parameter estimation by the EM-algorithm
For parameter estimation, we assume that score density belongs to the two-or three-component normal mixture. Thus, we implemented three functions: estpar tf, estpar ttf, estpar tff. In case of two-component normal mixture, estpar tf is used. For three-component normal mixture, we consider two different scenarios. That is, two component normal mixture can be considered for true score density or false score density depending on the situation. estpar ttf is used when the distribution of true score is a mixture model while estpar tff is used when the distribution of false score is a mixture model. As a quick check for the density estimation, we suggest to compare it with the kernel density estimator.
For illustration purpose, we here used estpar tff to estimate the parameters and considered 500 iterations of the EM algorithm. The following code generate Figure 3, which provides trace plots for four parameter estimates selected.

Kernel density estimator
Kernel density estimator (non-parametric version) is used for two different purposes. It is first used when deciding the type of normal mixture: two-or three-component normal mixture. Also, it can be used to check the accuracy of parameter estimates by looking at the overlap of two density estimates.
Two types of density estimates are included in Figure 4. It seems that normal mixture density estimate overlaps the kernel density estimate very well.

Biomarker Discovery
The discovery of biomarker metabolites is done under the control of two dimensional local false discovery rate (2d-fdr), which was implemented by Ploner et al. (2006). For biomarker discovery, another data set is considered. The data is pre-processed before it is used, i.e., log-transformation and standardization. The following code shows how to conduct biomarker discovery by using the fdr2d function.
# 2d-fdr fdr <-fdr2d(ctdat, colnames(ctdat), nperm=500) summary(fdr) nperm is the number of permutations of group labels that is used for the estimation of 2d-fdr. Here, we considered 500 permutations. The following code shows how to generate two plots: tornado and volcano plot.

Software availability
The current version of bioinformatics tool is available at https://github.com/jjs3098/CNU-Bioinformatics-Lab. Furthermore, example data used in our paper are provided as well. The snapshot of the website is given in Figure 6.