This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).
New metabolomics applications of ultra-high resolution and accuracy mass spectrometry can provide thousands of detectable isotopologues, with the number of potentially detectable isotopologues increasing exponentially with the number of stable isotopes used in newer isotope tracing methods like stable isotope-resolved metabolomics (SIRM) experiments. This huge increase in usable data requires software capable of correcting the large number of isotopologue peaks resulting from SIRM experiments in a timely manner. We describe the design of a new algorithm and software system capable of handling these high volumes of data, while including quality control methods for maintaining data quality. We validate this new algorithm against a previous single isotope correction algorithm in a two-step cross-validation. Next, we demonstrate the algorithm and correct for the effects of natural abundance for both ^{13}C and ^{15}N isotopes on a set of raw isotopologue intensities of UDP-N-acetyl-D-glucosamine derived from a ^{13}C/^{15}N-tracing experiment. Finally, we demonstrate the algorithm on a full omics-level dataset.
Stable isotope tracing has long been used to decipher pathways in cellular metabolism [
However, to be able to properly quantify the relative amount of each isotopologue in a SIRM experiment, the contribution of natural abundance (NA) must be factored out of each isotopologue peak. We previously reported the development of an algorithm specifically tailored for correcting FTMS SIRM isotopologue peaks [
The analysis of high-throughput metabolomics experiments requires the development of an integrated, high-performance system capable of performing the natural abundance correction on thousands of isotopologue peaks in a timely manner. Such development involves many considerations of computational and software architecture and best practices to have a working, easily extensible system. Below we describe the architecture of such a software system, verification of its correctness and its utility for performing natural abundance correction of large numbers of isotopologue peaks on reasonable timescales.
Correcting isotopologue peak intensities from ultra-high FT-MS experiments is accomplished using the previously derived equations from [
As previously described, correction of the isotopologue intensities is performed on each isotopologue peak in turn, and iterated until convergence [
The P- and S-correction terms are generated separately for each element and possible combination of x and i. Equation (2) shows the calculation of the P-correction terms for ^{13}C, while Equation (3) gives the calculation of the S-correction terms.
The Python scripting language [
The software system takes a large collection of peak intensities that represent multiple molecular entities (data collection), and performs the correction on the set of peaks corresponding to the isotopologues of each molecular entity in turn (dataset). For a given data collection, a configuration must be defined that determines which columns contain relevant information such as the peak intensities, molecular formulas and isotopologue numbers, as well as the location of the input and output files. Initialization causes all the peak data to be read from the file and the generation of both P and S lookup tables as caches (see below for description of caching and data generated). Correction actually corrects each set of isotopologue peaks (see section 3.4 below for a description of the correction algorithm), using the previously cached values from the P and S lookup tables.
To accommodate a general approach, Equations (2) and (3) are pre-computed before-hand and stored in 2D and 1D look up tables respectively, which we refer to as “P” and “S” tables. This is useful as the same value of P and S will be used multiple times to correct the peak intensities in a given dataset. However, it is important to note that the values calculated in Equations (2) and (3) are only dependent on the labeling element’s maximum count for a given molecule, and not the molecule itself. Therefore, the same pre-computed values can be used for any molecules with the same number of atoms for the natural abundance correction element. Moreover, many experimental datasets include replicate and time series entries for the same molecule, which augments the utility of caching these tables.
The implementation of Equation (1) encompasses a general strategy that is applicable for 1, 2 or 3 labeling sources (see the derivation in the
Procedural diagram of the isotopic natural abundance correction algorithm. Starting with the shape and order of the set of observed isotopologues, the algorithm is initialized, followed by the calculation of the P and S tables or their recovery from a cache. Next, the corrected isotopologue intensities (I
The shape and order of I
To increase maintainability and to reduce the complexity of the code base that represents this algorithm, the proper functions for iteration are determined during NACorrector’s object initialization (in the constructor method). If the algorithm is initialized to operate on datasets with only a single labeling source, the standard python function “range” is used. However, if the algorithm is initialized to operate on datasets that have multiple labeling sources, the penultimate and ultimate iteration functions replace the use of the range function where appropriate. The P and S lookups used in the algorithm are also tailored to the dimensionality of the datasets undergoing natural abundance correction and operate in tandem with the penultimate and ultimate iteration functions.
Here, if the algorithm is to operate on a multidimensional dataset, the function pointers for “nacrange” and “xnacrange” are replaced with the ultimate and penultimate iteration functions respectively. This occurs only once, during the initialization phase of the algorithm and doing so reduces the logical complexity of the calculations immensely. If the algorithm is initialized for data with only one labeling source (
The PyNAC (black square) module encompasses all classes and functions related to our implementation, however only the Core submodule (green) implements the actual algorithm. NACorrector is the actual algorithm class, and it is supported by NAProduct and NASumProduct, which are classes that represent the P and S lookup tables respectively. PenultimateNACIter and UltimateNACIter are special iteration classes that return tuple indices describing a location in a multidimensional array. Each takes a stopping criterion: a tuple of length
Algorithm generalization and class relations in the modularization of the code. (
Included in the correction analysis are several data quality control measures. First, the data read from an input file is checked to ensure that each peak conforms predictably to the specifications of the configuration. These checks include insuring that the isotope count for a given peak does not exceed the maximum number of atoms for that element specified by the peak’s molecular formula. Second if two peaks with the exact same isotopic composition are found to belong to the same data set, the second peak is flagged as being a duplicate. If a particular peak fails one of these checks, the specific error message related to it is appended at the end of its row in the output file. Generally these errors occur when the correction analysis has been misconfigured, however they could also occur in data files that have been corrupted.
In addition to these basic checks, the correction analysis also allows for the configuration of a predicted peak inclusion threshold. This threshold can be defined as a percent of the minimum, maximum, or average peak intensity for the entire data collection or for each data set individually. If a peak is predicted at or above this threshold value, but not observed in the original data, the peak is added to the output file with special notation to alert researchers that the peak was predicted above the specified threshold but not observed in the data file they supplied. The inclusion of these predicted peaks is an important secondary check for researchers. Peak identification must be carried out before natural abundance correction. However, alerting the researcher that there are significant peaks predicted, but missing from the data collection, can insure better data quality. If, for example, many peaks are predicted but missing across many data sets in the data collection, a researcher may re-evaluate her methods of peak identification, and subsequently go back to the raw FTMS data to either identify the missing peaks manually or relax the restrictions for the software identifying the peaks.
The calculation of the P correction terms is implemented as an interleaving for-loop constructed in such a way as to emulate a full expansion of the binomial term and the exponents, see Equation (2), while mitigating some of the effects of multiplying very large and small double precision values together. To verify that the interleaving (org) does in fact mitigate these types of errors, alternative methods for calculating P correction terms were tested using: (i) “factorials” from Pythons math module (choose); (ii) the “comb” function from SciPy, which is an “exact” multiplicative calculation (comb); (iii) the “log-gamma” function from SciPy (comb2); and (iv) a log10 version of the algorithm (logReal). P correction terms for the full range of n and k using an iMax of 500 and the natural abundance of deuterium (0.00015) were generated using each of the methods. Relative differences between all of the methods were calculated.
The singly labeled ^{13}C data is from glycerophospholipids separated from crude cell extracts derived from MCF7-LCC2 cells in tissue culture after 24 h of labeling with uniformly labeled ^{13}C-glucose. The doubly labeled ^{13}C/^{15}N data is from polar metabolites separated from crude cell extracts derived from MCF7-LCC2 cells in tissue culture after 24 h of labeling with uniformly labeled ^{13}C/^{15}N glutamine. Samples were directly infused in positive (glycerophospholipid) and negative (metabolites) ion modes on a hybrid linear ion trap 7T FT-ICR mass spectrometer (Finnigan LTQ FT, Thermo Electron, Bremen, Germany) equipped with a TriVersa NanoMate ion source (Advion BioSciences, Ithaca, NY, USA), with peaks identified as previously described [
We used a progressive approach to cross-validate, in the analytical sense, all parts of both the single- and multi-isotope implementations of the algorithm. First, we performed a cross-validation between the single-isotope Python implementation and the original single-isotope Perl implementation from Moseley, 2010 [
Comparison of the old Perl and new Python single-isotope algorithm implementations using isotopologues of UDP-GlcNAc.
^{13}C Count ^{a} | Intensity ^{b} | Python (New) ^{c} | Perl (Old) ^{d} | Difference |
---|---|---|---|---|
5 | 187.9 | 214.81 | 214.81 | 2.27 × 10^{−}^{10} |
6 | 60.5 | 39.81 | 39.81 | 1.79 × 10^{−}^{11} |
7 | 109.8 | 116.15 | 116.15 | 1.78 × 10^{−}^{10} |
8 | 418.4 | 449.36 | 449.36 | 3.58 × 10^{−}^{10} |
9 | 23.1 | 0 | 0 | 0 |
10 | 165 | 176.39 | 176.39 | 3.68 × 10^{−}^{10} |
11 | 1438 | 1,523.77 | 1,523.77 | 2.63× 10^{−}^{9} |
12 | 1,215.9 | 1,183.78 | 1,183.78 | 3.59 × 10^{−}^{9} |
13 | 4,235.8 | 4,360.57 | 4,360.57 | 3.63 × 10^{−}^{9} |
14 | 1,562.5 | 1,420.73 | 1,420.73 | 2.17 × 10^{−}^{9} |
15 | 1,253.9 | 1,231.68 | 1,231.68 | 4.81 × 10^{−}^{9} |
16 | 175.8 | 149.9 | 149.9 | 4.44 × 10^{−}^{10} |
^{a} Zero valued isotopologue intensities have been omitted from the table for the sake of brevity; ^{b} Observed uncorrected isotopologue intensities; ^{c} Corrected intensities using the Python implementation; ^{d} Corrected intensities using the older Perl implementation.
Now the multi-isotope Python implementation is cross-validated against the single-isotope Python implementation via the creation and use of a simulated multi-isotope isotopologue intensity dataset.
Using the validated addNA function from the single-isotope Python implementation, we added the effects of natural abundance to both a ^{13}C simulated dataset and a ^{15}N simulated dataset, with the results also shown in
Validation of the multi-isotope natural abundance correction algorithm. (
Simulated 13C and 15N single-isotope isotopologue intensity datasets.
0.5 | 0 | 0 | 0.15 | 0.1 | 0 | 0 | 0 | 0 | 0.25 | |
0.4523 | 0.0456 | 0.0020 | 0.1403 | 0.1040 | 0.0056 | 1.2 × 10^{−}^{4} | 1.4 × 10^{−}^{6} | 7.6 × 10^{−}^{9} | 0.25 | |
0.5 | 0 | 0 | 0.1 | 0 | 0 | 0.4 | - | - | - | |
0.4890 | 0.0109 | 0.0001 | 0.0989 | 0.0011 | 4 × 10^{−}^{6} | 0.4 | - | - | - |
Note: The addNA rows are the results produced by the addNA function from the single-isotope Python implementation. Each row of values is normalized to a sum of 1.
The P correction terms generated using the interleaving method were compared to alternative implementations of Equation (2) (see Methods) using various methods, including the original interleaving (org), factorials (choose), SciPy combinatorials (comb, comb2), and a log-version of the interleaving algorithm (logReal). The full set of pairwise differences is shown in
Maximum differences between each of the methods used to calculate the
org | comb | comb2 | choose | logReal | |
---|---|---|---|---|---|
0 | −2.36 × 10^{−}^{16} | −5.67 × 10^{−}^{14} | −2.36 × 10^{−}^{16} | −2.36 × 10^{−}^{15} | |
- | - | −5.66 × 10^{−}^{14} | 0 | −2.25 × 10^{−}^{15} | |
- | - | - | 5.66 × 10^{−}^{14} | 5.48 × 10^{−}^{14} | |
- | - | - | - | −2.25 × 10^{−}^{15} |
Corrected and observed ^{13}C/^{15}N isotopologues of UDP-GlcNAc. Each graph represents a set of ^{13}C-labeled isotopologues with a specific number of ^{15}N nuclei incorporated. I_{M+i,0}, I_{M+i,1}, I_{M+i,2}, and I_{M+i,3} represent 0,1,2, and 3 ^{15}N nuclei. Observed intensities are in red and the isotopic natural abundance corrected intensities are in blue. The calculation of the corrected intensities required 12 iterations of the algorithm.
To test the effect of caching the P and S table calculations on the running time of the software, we ran it both in a default mode where caching is enabled, and alternatively forcing the recalculation of the P and S tables for each data set of peaks. With caching enabled, the run time averaged 530 s (9 min). Without caching enabled, the run time averaged 890 s (15 min). Both runs were performed on an Intel^{®} Xeon X5650 processor running at 2.67 GHz. Also, these timings used a data file of 9,066 different metabolites, with an average of 5 isotopologue peaks per metabolite.
Correction for the effects of natural abundance for multiple isotopes simultaneously is both computationally feasible and numerically stable when the raw isotopologues are isotopically resolved and identified. In addition, our algorithm is numerically stable both with respect to increasing isotope incorporation and to increasing dimensionality of the correction due to multiple isotopes. In addition, these corrections of isotopologue intensities are required before further quantitative analyses can be applied to SIRM experimental datasets, especially for determining metabolic flux. In general, SIRM experiments can generate massive volumes of data in relatively short periods of time. A single experimental dataset may contain well over 100,000 isotopologue intensities and can be collected in as little as five minutes with current FT-ICR mass spectrometers, like the one described in the Methods section. This makes natural abundance isotopic correction a high throughput computational problem. Fortunately, our current algorithm can correct for natural abundance on this time scale (
The software system is available as a tarball in the supplemental materials, and includes an example of running the software on a file to correct multiple compounds in a single file. The software may also be downloaded from [
We thank Richard Higashi and Pawel Lorkiewicz for support and helpful discussion. This work was supported in part by DOE DE-EM0000197, NIH P20 RR016481S1, NIH 1R01ES022191-01, and NSF 1252893.
The authors declare no conflict of interest.
Supplementary Materials (ZIP, 1100 KB)