WiPP: Workflow for Improved Peak Picking for Gas Chromatography-Mass Spectrometry (GC-MS) Data

Lack of reliable peak detection impedes automated analysis of large-scale gas chromatography-mass spectrometry (GC-MS) metabolomics datasets. Performance and outcome of individual peak-picking algorithms can differ widely depending on both algorithmic approach and parameters, as well as data acquisition method. Therefore, comparing and contrasting between algorithms is difficult. Here we present a workflow for improved peak picking (WiPP), a parameter optimising, multi-algorithm peak detection for GC-MS metabolomics. WiPP evaluates the quality of detected peaks using a machine learning-based classification scheme based on seven peak classes. The quality information returned by the classifier for each individual peak is merged with results from different peak detection algorithms to create one final high-quality peak set for immediate down-stream analysis. Medium- and low-quality peaks are kept for further inspection. By applying WiPP to standard compound mixes and a complex biological dataset, we demonstrate that peak detection is improved through the novel way to assign peak quality, an automated parameter optimisation, and results in integration across different embedded peak picking algorithms. Furthermore, our approach can provide an impartial performance comparison of different peak picking algorithms. WiPP is freely available on GitHub (https://github.com/bihealth/WiPP) under MIT licence.


Gas Chromatography
Gas chromatographic separation of compounds was performed, as previously described [25], on an Agilent 6890N (Agilent, Santa Clara, CA, USA), equipped with a VF-5ms column of 30 m length (Varian, Palo Alto, CA, USA). The initial temperature of 67.5 °C was held constant for 2 min before heating with a temperature gradient of 5 °C min −1 until 120 °C, followed by a gradient of 7 °C min −1 until 200 °C, followed by the third final gradient of 12 °C min −1 until 320 °C , where the temperature was then held at 320 °C for 6 min. The transfer line was kept at 250 °C throughout. A cold injection system was used with a matching baffled deactivated liner (CIS4, Gerstel, Mülheim an der Ruhr, Germany), operating in split mode (split 1:5, injection volume 1 μL), with the following temperature gradient applied: Hold of the initial temperature of 80 °C for 0.25 min, followed by a temperature increase of 12 °C s −1 to 120 °C, followed by a temperature increase of 7 °C s −1 to 300 °C with a hold time of 2 min.

Sequence Setup
Samples were measured in 10 blocks of decreasing dilution (1:100, 1:10, 1:1) with two washes (containing only MSTFA and retention index standards) in between to counteract possible carryover.

Low Resolution Dataset
Derivatization and gas chromatographic separation were carried out as described above. An MS measurement was performed on a Pegasus 4D-TOF-MS-System (LECO Corp., St. Joseph, MI, USA) with 1 Da mass resolution and −70 eV electron impact ionization and acquisition voltage of 1700 V, complemented with an auto-sampler (MultiPurpose Sampler 2 XL, Gerstel, Mülheim an der Ruhr, Germany). Spectra were recorded in a mass range of 60 m/z-600 m/z with an acquisition rate of 10 scans/s.

High Resolution Dataset
Derivatization and gas chromatographic separation were done as described above. MS measurement was performed on a 7200 Q-TOF (Agilent, Santa Clara, CA, USA) with a mass accuracy of 5 ppm, using −70 eV EI. Spectra were recorded in a mass range of 60 m/z-600 m/z with an acquisition rate of 10 scans/s.

Library Matching
Detected peaks were matched against an in-house library based on the cosine similarity of normalized spectra and the RI difference. The similarity was reduced by a factor corresponding with RI deviation (0.97, 0.95, 0.93, 0.85 for RI differences larger than 1.5, 3, 4, 5, respectively). Scores above 0.9 were considered as matches.

Peak Set Scoring
To score the quality and quantity of detected peaks, both assigned peak classes and number of peaks within each class were considered. Therefore, a class-score was assigned to each of the seven peak classes according to their quality. Correctly detected peaks were rewarded with positive classscores and detected peaks, which were poor quality, were penalized with negative class-scores. The overall score for all detected peaks is the sum of the number of peaks within each class multiplied by the corresponding class-score. The scoring function can be written as: • where i is the class, n is the number of detected peaks within this class, and r is the reward/penalty value for i. Class-scores can be defined in the config file> "optimisation-score", or default values were retained. The class-scores balanced the coverage of compound-related peaks with the detection of poor quality or noise peaks. Naturally, the highest quality, true positive peaks were scored highest (default: 2) and apex shifted peaks were scored second highest (default: 1). All other classes should be scored negatively, as they represented poor quality peaks (default: -2). Wrongly detected peak borders indicated unsuitable parameters, while true peaks, which were incorrectly classified as noise, indicated algorithmic difficulties on the data. Thus, exact class-scores were context-dependent and might vary with instrumentation, sample, and biological question.

Peak Sampling for Training Data Generation
Peak picking algorithms with a range of parameters were run on all training samples. This produced a pool of training peaks which could be used for manual annotation. The peak-pool contains duplicates, as different parameters might result in the selection of the same peak. To minimize the number of duplicates, peaks within the pool were merged based on their retention time in three subsequent steps: First, peaks with the same apex but slightly varying borders; second, peaks sharing the same borders but a slightly varying apex; third, peaks with slightly varying borders and a slightly varying apex. The threshold for the merging steps was user defined and could be adjusted in the config file> merging (default: 0.2).
Subsequently, a user defined number n of peaks (default: 200; config file> training_datageneral) was sampled from the merged-peak-pool. To ensure that peaks were sampled from the whole retention time range, peaks were sorted by retention time and split into 10 equally sized groups. From within each group, 10 peaks were sampled uniformly without replacement. Remaining peaks (if any) were sampled uniformly without replacement independent of the retention time. If the pool contained less than n peaks, all peaks were plotted.

Peak processing for SVM classification
For peak classification via SVM, the original peak data (a N x M intensity matrix, where N is the number of recorded m/z values (extracted-ion chromatograms, hereafter EICs) and M is the number of recorded spectra within the peaks RT) was processed by the following steps: 1. Linear interpolation in the RT dimension. This is necessary because machine learning classification algorithms require input data with a fixed shape, but detected peaks vary in width and, therefore, in shape (the M dimension of the intensity matrix). To affect as few peaks as possible, the average peak width of the training data set was used as the reference width. After this step, the intensity matrix of all peaks was of shape N x O, where O is the number of spectra corresponding with the average peak width.
2. Normalization to [0,1]. By normalization, low intensity peaks did not differ from high intensity peaks anymore, thus, only peak shape and not absolute intensity values were considered.
3. Reordering of the rows (EICs) of the intensity matrix in descending order, based on the maximum normalized intensity value within each row. This step avoided training on particular rows (m/z values).
4. Conversion of the processed intensity matrix to an intensity vector. Machine learning classification algorithms require a vector as an input, therefore, the rows of the intensity matrix were placed into a single row. The intensity vector is of length N*O.
After processing, a peak was represented by a vector of size N*O with values between 0 and 1. These vectors could be placed in a N*O dimensional space.
During training, a SVM segmented this space into X compartments, where X is the number of predefined classes. The segmentation was performed in a way that best fit the classified training data. To classify a new vector, it was simply placed in the N*O dimensional space, and the corresponding compartment was the predicted class. Figure S1. Parameter optimisation results. Each of the cells correspond to a parameter set for the matchFilter algorithm and contain the computed scores (see scoring function in supplementary methods). The optimal parameters for this dataset were assessed to be i) Full width at half maximum height (FWHM) = 2 ii) signal to noise ratio = 0.5. For clarity purposes, only two parameter optimisations are shown here. The grid search may cover more parameter dimensions when three or more parameters are optimised.     Table 2. Parameter set and ranges used to generate the training data. * High resolution data only.

Chromatof parameter
Value Data reduction rate 4 Cut mass range 70-600 Baseline offset 1 (just above the noise) Number of points for smoothing Auto Max number of peaks 600 Signal to noise ratio (S/N) 20