A Comprehensive Plasma Metabolomics Dataset for a Cohort of Mouse Knockouts within the International Mouse Phenotyping Consortium

Mouse knockouts facilitate the study ofgene functions. Often, multiple abnormal phenotypes are induced when a gene is inactivated. The International Mouse Phenotyping Consortium (IMPC) has generated thousands of mouse knockouts and catalogued their phenotype data. We have acquired metabolomics data from 220 plasma samples from 30 unique mouse gene knockouts and corresponding wildtype mice from the IMPC. To acquire comprehensive metabolomics data, we have used liquid chromatography (LC) combined with mass spectrometry (MS) for detecting polar and lipophilic compounds in an untargeted approach. We have also used targeted methods to measure bile acids, steroids and oxylipins. In addition, we have used gas chromatography GC-TOFMS for measuring primary metabolites. The metabolomics dataset reports 832 unique structurally identified metabolites from 124 chemical classes as determined by ChemRICH software. The GCMS and LCMS raw data files, intermediate and finalized data matrices, R-Scripts, annotation databases, and extracted ion chromatograms are provided in this data descriptor. The dataset can be used for subsequent studies to link genetic variants with molecular mechanisms and phenotypes.

Metabolites 2019, 9, 101 2 of 14 known as gene pleiotropy. Similarly, genetic variants were also found to be associated with more than one phenotype in population level genome wide association studies (GWAS) [2][3][4][5]. GWAS catalogues such as the database of Genotypes and Phenotypes (dbGaP) started associating various phenotypes with genetic variants [6], but such associations lack causal relationships. Gene functions can be characterized on different biological levels from metabolite to cellular to whole-body phenotypes.
Here, animal models help chart molecular pathways from genetic variant to phenotype [7]. The International Mouse Phenotyping Consortium (IMPC) is a network of centers with expertise in mouse genetics and phenotyping. The IMPC has established pipelines to generate knockout mice for over 7000 genes and aims to cover all 20,000 protein coding genes in mice [8,9]. The consortium has also identified mouse models for 360 diseases among the first 3328 mouse knockouts phenotyped [10]. The IMPC uses high throughput assays to measure phenotypes throughout the life of a knockout mouse and have successfully associated 974 genes with metabolic phenotypes and diseases [11]. Biomedical researchers can access IMPC services to receive specific knockout biospecimens and search associated phenotype data using the mousephenotype.org website. All the IMPC generated data are publicly available at http://www.mousephenotype.org.
Up to 10% of all human genes are involved in operation and regulation of metabolism [12] and it is well known that metabolism is dysregulated in many diseases. Several genes have well-characterized metabolic phenotypes that can be detailed by associating changes in metabolite levels (such as high cholesterol or low plasma uric acid) with genetic variants. Currently, the IMPC measures only a few metabolic endpoints such as body mass, plasma triglycerides, glucose tolerance, and basal blood glucose levels, warranting the need to expand their metabolic phenotype spectrum [11]. Over the past 20 years, metabolomics [13][14][15] has achieved an increased breadth and depth of analysis due to advances in sensitivity and accuracy of mass spectrometers, and up to 900 identified metabolites can be measured in blood plasma [16].
In this data descriptor, we provide a comprehensive metabolomics dataset and a phenotype dataset for plasma specimens of 30 mouse knockouts and their strain-matched wild type controls. Data were acquired by integrating three non-targeted assays (on primary metabolism, biogenic amines and complex lipids) with two targeted assays (oxylipins and combined bile acids and steroids), using both GC-TOFMS and different LC-MS protocols.

Data Description
Raw GC-TOFMS and LC-MS mass spectra files are available at the NIH Metabolomics Workbench database (http://metabolomicsworkbench.org) (Accession number ST001154). Processed data matrices for all assays are provided in Table S11. The filtered metabolomics dataset is provided in the Table S12. Phenotype data for the mouse strains is provided at (Data citation 10). Data dictionary (Table S13), data matrix (Table S14), and sample metadata (Table S15) are provided in the supplementary section. Data file to sample label mapping is provided in the Table S10. The file also contains sample labels to IMPC accession IDs so metabolite to phenotype data can be linked. Analysis sequences for each assay are provided in Table S17 to check for batch effects or systematic error within the datasets. Annotation files are provided in the Tables S3-S7. Multiple reaction monitoring (MRM) transitions for the targeted assays are provided in Tables S8 and S9. Processed results for each assay are provided in  the Supplemental Table S11.
To ensure a high data quality dataset, the following strategies were adopted while analyzing these samples: (a) use of internal standards mixture, (b) analysis of quality control blood plasma samples; (c) analysis of blank samples to monitor carry-over and chemical artifacts including laboratory contaminants; (d) removal of multiple metabolite detections in different metabolomic platforms; (e) signal corrections using SERRF normalization for GC-TOFMS data, (f) removal of compounds with > 50% missing values, (g) removal of compounds with > 50% RSD technical variance, (h) use of curated annotation databases to form a target list for peak intensity data processing; and (i) mapping peaks with compound identifiers and SMILES code for informatics analyses.
To show the technical reproducibility of the utilized LC-MS assays, RSD for peak heights of the internal standards were computed. Table 1 shows the RSD values for these standards. Median RSDs for the detected compounds were 8% (GCMS), 11.5% (CSH-POS), 13% (CSH-NEG), 12% (HILIC-POS), and 52% (HILIC-NEG). No batch effect was observed from the HILIC-POS, CSH-POS, and CSH-NEG datasets. For HILIC-NEG, four batches were observed, and the signals were corrected using the median-batch normalization, dividing the value for each metabolite by its median value within a batch. Targeted assays utilized a ten-point calibration curve to calculate molar concentrations of the target analytes. Values that did not pass the limit of quantification were not included in the data matrix, leading to many missing values.

IMPC Consortium, Mouse Knockout Selection and Plasma Samples
The International Mouse Phenotyping Consortium (www.mousephenotype.org) provided blood plasma samples for 30 knockout strains ( Table 2). For each knockout line, three male and three female mice were selected, and a total of 40 C57BL/6NCrl baseline control wild-type mice were used to match the knockout strains. Plasma samples were shipped to the West Coast Metabolomics Center (WCMC; http://metabolomics.ucdavis.edu) on dry ice. Samples were stored at −80 • C until analyzed. Each sample was assigned a unique identifier according to the sampling date and time at The Centre for Phenogenomics (TCP) (See Table S15). Twenty additional human pool plasma samples (BioIVT, previously known as BioreclamationIVT) and up to 10 method blanks were analyzed along with the mouse plasma samples for each analytical assay. Mouse knockout were selected if (1) plasma sample for a knockout was already available at IMPC (2) PubMed literature searches for the gene yielded some papers in reference to metabolism and (3) gene assayed in an IMPC proteomics assay.
All experimental procedures on animals received approval from the Animal Care Committee of The Centre for Phenogenomics and were conducted in accordance with the guidelines of the Canadian Council on Animal Care. TCP's approved lincense numbers are -Animal Use Protocol (AUP) 0153, 0275, 0277, 0279. Additionally, all animal production followed the Animal Research: Reporting of in vivo Experiments (ARRIVE) guidelines within the context of the International Mouse Phenotyping Consortium (IMPC). The human plasma samples were commercially acquired from BioIVT and their use was approved by the Independent Institutional Review Board, Florida. The study identification number for BioIVT plasma samples is 201209942.

Metabolomics Facility
Metabolomics data for the mouse plasma were acquired using seven analytical assays using GC-MS and LC-MS platforms ( Table 1). All LC-MS methods were performed using electrospray ionization (ESI). These assays are routinely used to generate metabolomics data at the WCMC for almost 30,000 samples per year, including many blood samples [13,15,17,18]. The WCMC use large and validated lists of metabolite targets (Tables S3-S9), large mass spectral libraries from the MassBank of North America (MoNA available at http://massbank.us) to annotate novel compounds, standardized samples preparation and data acquisition methods, robust data processing using freely available MS-DIAL [19], SERRF software [20], the BinBase mass spectral database [21] for covering over 150,000 GC-TOFMS samples analyzed over the past 15 years, and a variety of data analysis and interpretation tools, including statistics [22], pathway and network mapping [23], and metabolite enrichment analysis [24]. Figure 1 shows the overview of the metabolomics data generation and quality control workflow.  Figure 1. Overview of the metabolomics data generation and quality control workflow for 220 knockout mouse plasma (KOMP2) samples. A less stringent relative standard deviation (RSD) and sample to blank ratio were used because the effect size of two or more show a major effect. As raw spectra files are provided for this study, a user can re-generate the data matrix with different thresholds. Abbreviation: GCMS-gas chromatography and mass spectrometry, LCMS-liquid chromatography and mass spectrometry and ESI-electrospray ionization.

Figure 1.
Overview of the metabolomics data generation and quality control workflow for 220 knockout mouse plasma (KOMP2) samples. A less stringent relative standard deviation (RSD) and sample to blank ratio were used because the effect size of two or more show a major effect. As raw spectra files are provided for this study, a user can re-generate the data matrix with different thresholds. Abbreviation: GCMS-gas chromatography and mass spectrometry, LCMS-liquid chromatography and mass spectrometry and ESI-electrospray ionization.

Gas Chromatography and Mass Spectrometry
Every acquired GC-TOFMS spectrum for blood specimens has been stored in the BinBase database for past 15 years at the WCMC. The database contain over 150,000 samples which can be queried through the BinVestigate web GUI (https://binvestigate.fiehnlab.ucdavis.edu/#/) for identified or unknown metabolites that are confidently detected in over 100 tissues and species [21]. The BinBase algorithm [25] utilizes this annotation database to generate a raw result data matrix (Table S11). The current BinBase annotation database is provided in supplementary Table S3 with 1205 annotated spectra for 588 unique compounds detected in biological samples.

Hydrophilic Interaction Liquid Chromatography (HILIC) Mass Spectrometry
A database of target metabolites detected in HILIC-ESI-MS using both positive or negative electrospray mode are provided in supplementary Tables S4 and S5. This target database was generated by searching MS/MS spectra for blood specimens acquired in past three years against the NIST17 MS/MS, the LipidBLAST [26] and MoNA databases, in addition to a specific HILIC-retention time MS/MS mass spectral library of 1200 authentic standards [27]. For negative ESI mode, the HILIC-NEG annotation database yielded 107 identified compounds in the mouse plasma data set presented here using mass-to-charge (m/z), retention time (RT), and fragmentation spectra (MS/MS) match. An additional 45 compounds were annotated by m/z and MS/MS fragmentation matches and one compound was annotated by m/z and RT match. The abundance of this one compound was too low to trigger an experimental MS/MS event in data dependent MS/MS data acquisition methods. For the positive ESI mode, the HILIC-POS annotation database of the mouse plasma dataset presented here yielded 84 compounds that were annotated by m/z, RT, and MS/MS matching, 86 compounds annotated by m/z, and MS/MS data only, and 28 compounds were annotated by m/z and RT match.

Charged Surface Hybrid Liquid Chromatography (CSH) and Mass Spectrometry
The CSH database for target mouse plasma lipids for positive and negative electrospray modes is provided in the supplementary Tables S6 and S7. The database is generated by searching MS/MS spectra for blood specimens acquired in past seven years against NIST17 MS/MS database and LipidBLAST mass spectral libraries. The CSH-NEG annotation database contains 215 verified lipids with m/z and MS/MS match; the CSH-POS annotation database contains 304 compounds with validated m/z and MS/MS match.

Sample Preparation
One milliliter of degassed, −20 • C cold solvent mixture of acetonitrile (ACN):isopropanol (IPA):water (H 2 O) (3:3:2, v/v/v) was added to each 20 µL mouse plasma aliquot. Samples were vortexed for 10 seconds, shaken for 5 min and then centrifuged for 2 min at 14,000 rcf (relative centrifugal force). Two 450 µL supernatant aliquots were transferred to new tubes. To remove any excess protein, the supernatant was extracted with 500 µL 1:1 acetonitrile:water and vortexed for 10 seconds, centrifuged for 2 min at 14,000 rcf. The supernatant was transferred to a clean tube and then dried down in a CentriVap concentrator. For derivatization, 10 µL of methoxyamine hydrochloride in pyridine (40 mg/mL) was added to each sample and then shaken at 30 • C for 90 min. Then 90 µL of N-methyl-N-(trimethylsilyl) trifluoroacetamide (MSTFA, Sigma-Aldrich) was added for trimethylsilylation. C8-C30 fatty acid methyl esters (FAMEs) were added as internal standard (See Supplementary Table S18) for retention time correction. Samples were shaken for 30 min at 37 • C. These derivatized samples were analyzed by GC-MS using a Leco Pegasus IV time of flight mass spectrometer. For more details see [28].

Data Acquisition
An Agilent 6890 gas chromatography instrument equipped with a Gerstel automatic linear exchange systems (ALEX) which included a multipurpose sample dual rail and a Gerstel cold injection system (CIS). The CIS temperature program was: 50 • C to 275 • C final temperature at a rate of 12 • C/s and held for 3 min. Injection volume was 0.5 µL with 10 µL/s injection speed. Injection mode was splitless with a purge time of 25 seconds. Injector liner was changed after every 10 samples. Injection syringe was washed with 10 µL of ethyl acetate before and after each run. A Rtx-5Sil MS column (30 m length, 0.25 mm i.d., 0.25 microM 95% dimethyl 5% diphenyl polysiloxane film). An additional 10 m integrated guard column was used. Mobile phase was 99.9999% pure Helium gas with a flow rate of 1 mL/min. GC temperature program was: held at 50 • C for 1 min, ramped at 20 • C/min to 330 • C and then held for 5 min. A Leco Pegasus IV time of flight mass spectrometer was used to acquire data. The transfer line temperature between gas chromatograph and mass spectrometer was set to 280 • C. Electron ionization at −70 V was employed with an ion-source temperature of 250 • C. Acquisition rate was 17 spectra/second with a scan mass range of 85-500 Dalton (Da).

Data Processing
Raw GC-TOF MS data files were preprocessed directly after data acquisition and stored as ChromaTOF-specific peg files, as generic txt result files and additionally as generic ANDI MS cdf files. ChromaTOF version 4.0 was used for data preprocessing without smoothing, 3 s peak width, baseline subtraction just above the noise level, and automatic mass spectral deconvolution and peak detection at signal/noise (s/n) levels of 5:1 throughout the chromatogram. Results in .txt format were exported to a data server with absolute spectra intensities and further processed by a filtering algorithm implemented in the metabolomics BinBase database. The BinBase algorithm (rtx5) used the following settings: validity of chromatogram (10 7 counts/s), unbiased retention index marker detection (MS similarity > 800, validity of intensity range for high m/z marker ions), retention index calculation by 5th order polynomial regression. Spectra were cut to 5% base peak abundance and matched to database entries from most to least abundant spectra using the following matching filters: retention index window ±2000 units (equivalent to about ±2 s retention time), validation of unique ions and apex masses (unique ion must be included in apexing masses and present at >3% of base peak abundance), mass spectrum similarity must fit criteria dependent on peak purity and signal/noise ratios and a final isomer filter. Failed spectra were automatically entered as new database entries if signal/noise ratios were larger than 25 and mass spectral purity better than 80%. All thresholds reflect settings for ChromaTOF v. 4.0. Quantification was reported as peak height using the unique ion as default, unless a different quantification ion was manually set in the BinBase administration software BinView. A quantification report table was produced for all database entries that were positively detected in more than 10% of the samples of this mouse knockout study. A subsequent post-processing module was employed to automatically replace missing values from the .cdf files. Prior to statistical analyses, data were filtered by combining multiple signals associated with each unique metabolite due to derivatization reactions. All metabolic signals were discarded if s/n > 3 in comparison to blanks, or if replaced values were >3 the intensity of truly detected values. Data were normalized using a random forest algorithm-based signal correction method [20] available at (http://serrf.fiehnlab.ucdavus.edu).

Sample Preparation
Metabolites were extracted from 20 µL of mouse plasma using 1 mL of degassed, −20 • C cold mixture of ACN:IPA:H 2 O (3:3:2, v/v/v). Samples were vortexed for 10 seconds, shaken for 5 min and then centrifuged for 2 min at 14,000 rcf. Two 450 µL supernatant aliquots were transferred to new tubes. One tube was stored as a backup aliquot and another was dried in a SpeedVac concentrator. Sample were re-suspended with 100 µL of ACN:H 2 O (80:20, v/v) which contained deuterium labeled internal standards (See Supplementary Table S18) prior to injection.

Data Processing
Raw data files were converted to the mzML format using the ProteoWizard MSConvert utility. For each m/z values ion chromatogram was extracted with m/z thresholds of 0.005 Da and retention time threshold of 0.10 min. Apex of the extracted ion chromatograph was used as peak height value and exported to a text file. Peak height files for all the samples were merged together to generate a data matrix. Targeted peak height signal extraction was performed using an R script which is provided at the GitHub repository (https://github.com/barupal). HILIC-POS data were not normalized because no batch effect was observed (Supplementary Figure S1). HILIC-NEG data were normalized by the median value for each batch to remove batch effects.

Sample Preparation
Lipids were extracted from a 20 µL aliquot of plasma. 225 µL of cold methanol (MeOH) containing a mixture of deuterated lipid internal standards (See Supplementary Table S18) was added and samples were vortexed for 10 s. Then 750 µL of methyl tertiary-butyl ether (MTBE) was added. Samples were vortexed for 10 s and shaken for 5 min at 4 • C. Next, 188 µL water was added and samples were vortexed for 10 s and centrifuged for 2 min at 14000 rcf. Two 350 µL aliquots from the non-polar layer were prepared. One aliquot was stored at −20 • C as a backup and the other was evaporated to dry in a SpeedVac. Dried extracts were resuspended using a mixture of methanol/toluene (9:1, v/v) (60 µL) containing an internal standard [12-[(cyclohexylamino) carbonyl]amino]-dodecanoic acid (CUDA)] used as a quality control. Method blanks and pooled human plasma (BioIVT) were prepared along with the study samples for monitoring the data quality.

Data Processing
Raw data files were converted to the mzML format using the ProteoWizard MSConvert utility. For each m/z values ion chromatogram was extracted with m/z thresholds of 0.005 Da and retention time threshold of 0.10 min. Apex of the extracted ion chromatograph was used as peak height value and exported to a txt file. Peak height files for all the samples were merged together to generate a data matrix. Targeted peak height signal extraction was performed using an R script that is available at https://github.com/barupal. Extracted ion chromatograms for each peak were saved as pictures. CSH-POS and CSH-NEG data matrices were generated. No normalization was applied as minimum signal drift was observed during analysis (Supplementary Figure S1). Extracts were analyzed by liquid chromatography (Waters ACQUITY UPLC I-Class system) coupled to a Sciex 6500+ QTRAP hybrid, triple quadrupole linear ion trap mass spectrometer. 5 µL of each extract was injected. Scheduled multiple reaction monitoring (MRM) was performed with optimized collision energies, de-clustering potentials, and collision cell exit potentials for individual analyte. A LC-MRM targeted method was used to analyze both bile acids and steroids with positive and negative polarity switching. Oxylipins were analyzed in another LC-MRM method in negative ionization mode only. All analytes were quantified against 6-point calibration curves using internal standards. Turbo Spray Ion Source parameters are: curtain gas (CUR) 25 psi, nebulizer gas (GS1) 50 psi, turbo-gas (GS2) 50 psi, electrospray voltage −4.5 kV/+3 kV, and source temperature 525 • C. Nitrogen was used as the collision gas. Software Analyst 1.6.3 and MultiQuant 3.0.2 (AB Sciex) were used for data acquisition and quantification. MRM transitions for the analytes are provided in the supplementary Tables S8 and S9.

Data Processing
MultiQuant version 3.0.2 was used for the peak integration and peak area computation. Peak integration settings were: Gaussian smooth width at 1.0 points, retention half window at 10-15 s, updated expected RT checkbox 'NO', minimum peak width at 8 points, minimum peak height at 750, noise at 40%, baseline subwindow at 1.7 min and peak splitting at 3 points. Multi-Quant software was also used for computing the molar concentrations for the analytes by using calibration curves created using internal standards as described in the supplementary file (Tables S8 and S9).

Data Merging and Filtering
Data matrices from each platform were combined to generate a joint dataset for all the samples. It contained a total of 1215 signals of identified metabolites (Table S12). Afterwards, signals were retained if relative standard deviation (RSD) was better than 50% and if fewer than 50% missing values were observed (Table S13). The median RSD for compounds in QC sample was less than 20% for all assays except the HILIC-NEG mode data. Overall, up to 70% compounds have a QC RSD of less than 20% across all assays. A majority of labelled internal standards showed a relative standard deviation of less than 20% in LC-MS assays (Table 3). We also justify that for gene knockout experiments, investigators are usually interested in two or more folds effect sizes, so a 50% threshold should not compromise the statistical power if large effect sizes are sought. For metabolites that were detected in multiple platforms, data with the lowest relative standard deviation in the quality control samples were retained. The filtered dataset had 832 metabolites (Table S14). The simplified molecular-input line-entry system (SMILES) codes for all annotated lipids were obtained from the LipidBlast MSP file or from the PubChem Compound Identifier Exchange service (https://pubchem.ncbi.nlm.nih.gov/idexchange/idexchange.cgi) and provided in the data dictionary (Table S13). Chemical classes for the identified compounds were estimated using the ChemRICH software. Sample metadata is provided in the Table S15.

Phenotype Dataset
The phenotype dataset for each mouse knockout strainwas downloaded from the IMPC database (www.mousephenotype.org) using their R-package IMPCData. First, allele accession numbers were matched to the IMPC database identifiers. Then, for each mouse accession, phenotype data were retrieved using the mouse strain identifier and phenotype identifiers (Table S1). The overall phenotype dataset is provided in the Table S2.

User Notes
Users can utilize raw spectra files, processed results, and the merged metabolomics dataset for the integration of phenotype and metabolomics dataset for each knockout strain. Raw spectra files should be used to check the quality of detected peaks and to annotate unknown metabolites with new mass spectral libraries. Raw data files can be converted to mzML format for importing in other software such as mzR or MZ-Mine. Proper data transformation and scaling for each data matrix from the assays is recommended before performing univariate and multi-variate statistical analysis. The dataset is particularly interesting for researchers who focus on the biological functions of the 30 genes studied here, specifically, their potential roles in metabolism. We performed a ChemRICH class annotation for the structurally identified compounds and found that almost 80 chemical classes werecovered. These chemical groups can be associated with genes and with phenotypes. We foresee this dataset's use in developing next generation bioinformatics as well as in teaching courses for metabolomics and as a test case for benchmarking software. As we have provided the annotation database, mass spectral libraries and protocol details, these resources can be used to re-create similar datasets for other cohorts of the blood plasma specimens.