# Partial Least Squares with Structured Output for Modelling the Metabolomics Data Obtained from Complex Experimental Designs: A Study into the Y-Block Coding

^{1}

^{2}

^{*}

## Abstract

**:**

**Y**that better reflects the experimental design than simple regression or binary class membership coding commonly used in PLS modelling. The new design of

**Y**coding was based on the same principle used by structural modelling in machine learning techniques. Two real metabolomics datasets were used as examples to illustrate how the new

**Y**coding can improve the interpretability of the PLS model compared to classic regression/classification coding.

## 1. Introduction

**X**(i.e., the observed data generated by the instruments) to a series of sub-matrices according to the experimental design and perform principal component analysis (PCA) on the decomposed sub-matrices to study the effect of each factor separately. In addition, multi-block models, such as multi-block principal component analysis (MB-PCA) [5], have also been successfully employed to analyse such datasets by repartitioning

**X**into blocks according to the experimental design, and then performing MB-PCA on the repartitioned multi-block data [6,7]. In addition, multiple supervised models, mostly based on well-known partial least squares (PLS) [1,8] have also been proposed in the literature using a similar methodology, such as priority PLS [9], ANOVA-PLS [10], ANOVA-target projection (ANOVA-TP) [11], and multi-block orthogonal PLS [12]. All of these methods have, to date, focused on processing the

**X**matrix: where

**X**is either re-partitioned into blocks according to the experimental design (multi-block approaches) or decomposed into a series of sub-matrices (ANOVA approaches). However, designing the response matrix

**Y**according to the information in the DOE and to build a supervised model to fit the designed

**Y**may also be an efficient method to analyse the data generated by the DOE. In fact, this type of method has already been reported previously, albeit in a rather ad hoc manner [13]. In our present study we aim to investigate such a methodology within the framework of structural modelling and propose a workflow for general use.

**Y**is usually categorised into two types: regression and classification. If the coded output is a series of continuous numbers (e.g., different concentrations of a specific metabolite, time points, temperatures, and so on), these numbers can be directly used as

**Y**and the corresponding model is called a regression model (e.g., PLS-R). By contrast, if the target is a number of different groups (classes), such as different types of bacteria or different diseases, then

**Y**is normally coded as a binary matrix in which one column represents one distinct group while each row is the target vector of a sample. A sample of a specific class has its element in the corresponding column coded as “1” and all other elements coded as “0”. The regression models are most suitable for modelling a series of continuous or at least ordinal (e.g., ranks) numbers, while classifications are most suitable for discriminating a set of categories “in parallel”; i.e., there is no particular spatial relationship between these categories.

**Y**and the (usually human) interpretations on the predicted outputs afterwards, this can be easily adapted by any software package which supports PLS.

## 2. Results

#### 2.1. Riboswitch

**Y**vectors in PLS-DA, it is also possible to train two separate PLS-DA models, one for each factor. Thus, we have also trained two PLS-DA models, one for strain classification and another for inducer classification. The PLS-DA model focused for strain classification and the average CCR was 78.51%, which is a significant improvement compared to a full 20-class model, but still slightly worse than the model using structured output PLS. For the inducer condition model the averaged CCR was 54.9%, which was the worst prediction accuracy for the three types of coding methods investigated.

#### 2.2. Propranolol

#### 2.3. Significant Metabolites Discovery

## 3. Discussion

**Y**based on the experimental design. After modelling these can then be interpreted by producing a series of confusion matrices to gain insights into the modelled patterns in the data. While it is also possible to inspect PLS scores to visualise the pattern, this is usually not easy for PLS using a structured output as the high complexity in

**Y**usually requires a large number of latent variables (PLS component) to model such complexity sufficiently. Thus, it is not realistic to expect that the overall pattern can always be well represented by, first, a few PLS components, and one may be tempted to plot any latent variables against each other to get the “desired” picture (a practice that is not very objective). Another concern in visualising PLS scores is that it can only present the results of one specific split of the training and test sets while, for robust modelling, it is better to test multiple combinations of training and test sets to get a robust estimation of the errors and prevent getting over optimistic results because of a “lucky” split. Finally, on this point the need to interpret the

**Y**predictions rather than the PLS scores has been highlighted in [17,25].

**Y**and interpret the results is an interesting open research question. Finally, we have demonstrated that the variable importance statistics in PLS modelling, such as VIP scores, can also be used for significant metabolite identification. This can be considered as a major advantage of PLS compared to more “black-boxed” machine learning techniques, such as S-SVM or neural networks.

**Y**and the observed

**X**while methods like S-SVM allows the user-defined error set to be used directly and optimise the model towards minimising such errors. This means that, for PLS, the scale of the coded output will have an influence on the final solution and the results will favour minimising the error of the columns in

**Y**having larger variance. Therefore, if the structured output has multiple blocks, it is important to ensure that different blocks have comparable variance to prevent the block with the largest variation from dominating the results.

## 4. Materials and Methods

#### 4.1. Riboswitch Experiment

#### 4.1.1. Materials, Strains, and Culture Conditions

#### 4.1.2. GC-MS Analysis

_{600nm}= 0.1, followed by incubation at 37 °C at 200 rpm shaking for 3 h. Upon reaching the OD

_{600nm}= 0.5 the samples were exposed to one of the inducing conditions (Table 9), and the incubation temperature was decreased to 20 °C at 200 rpm for 8 h in shaking incubators, which sums up to a total of 11 h of incubation. Fifteen millilitre samples from each flask were quenched using 30 mL, 60% aqueous methanol (−48 °C) following procedures described in previous studies [26]. The extraction protocol was also adapted from [26] with the exception of centrifugation speed being set at 15,871× g. All extracts were normalized according to OD

_{600nm}followed by combining 100 µL from each of the samples in a new tube, to be used as the quality control (QC) sample. One-hundred microlitre internal standard solution (0.2 mg/mL succinic-d

_{4}acid, 0.2 mg/mL glycine-d

_{5}, 0.2 mg/mL benzoic-d

_{5}acid, and 0.2 mg/mL lysine-d

_{4}) was added to all the samples (including QCs) followed by an overnight drying step using a speed vacuum concentrator (Concentrator 5301, Eppendorf, Cambridge, UK).

#### 4.2. Propranolol Experiment

#### 4.2.1. Materials, Strains, and Culture Conditions

#### 4.2.2. Sample Collection and GC-MS Analysis

_{5}, 0.2 mg/mL benzoic-d

_{5}acid, 0.2 mg/mL lysine-d

_{4}, and 0.2 mg/mL succinic-d

_{4}acid) to all samples. The samples were lyophilized for 16 h by speed vacuum concentrator (concentrator 5301; Eppendorf, Cambridge, UK), and then the pellet was stored at −80 °C for further analysis.

#### 4.3. PLS Modelling

#### 4.3.1. Structured Output Coding, Error Evaluation and Results Interpretation

#### • Riboswitch Data

#### • Propranolol Data

^{2}, or Q

^{2}[1]. Additionally, the number of different points were very limited (four for dosage and only three for time), a plot of predicted vs. known values could not show a clear monotonic changing trend, either. By assigning the “raw” outputs in prediction to the nearest target, a confusion matrix can be calculated and the types of samples with larger misclassification error between each other can be considered as closer-related types, and vice versa.

#### 4.3.2. PLS Modelling

**X**and

**Y**matrices (denoted as $\overline{\mathit{x}}$ and $\overline{\mathit{y}}$, respectively) were calculated, recorded, and subtracted from

**X**and

**Y**, respectively. The PLS model was then built between the mean centred

**X**and

**Y**. Then, in the validation or blind test, $\overline{\mathit{x}}$ was subtracted from

**X**in the validation/test set and the trained PLS model was applied to calculate the predicted

**Y**(denoted as $\widehat{\mathit{Y}}$). The final prediction was $\widehat{\mathit{Y}}$ with $\overline{\mathit{y}}$ added back into it.

#### 4.3.3. VIP Scores for Significant Metabolite Identification

**Y**. To simplify the task of inspection the VIP scores were summarised according to the blocks. This is done by taking the maximum of the VIP scores of the

**Y**variables within the group. For the riboswitch dataset, the VIP scores for strain classification were the maximum VIP scores of the first five columns, and those for inducer condition classification were the maximum VIP scores of the last four columns. For the propranolol dataset, the VIP scores for the strain classification were the maximum VIP scores of the first three columns; those for dosage and time modelling were the VIP scores of the fourth and fifth column, respectively.

## 5. Conclusions

## Supplementary Materials

## Acknowledgments

## Author Contributions

## Conflicts of Interest

## References

- Brereton, R.G. Chemoemtrics: Data Analysis for the Laboratory and Chemical Plant; Wiley: New York, NY, USA, 2003. [Google Scholar]
- Timmerman, M.E. Multilevel component analysis. Br. J. Math. Stat. Psychol.
**2006**, 59, 301–320. [Google Scholar] [CrossRef] [PubMed] - Harrington, P.B.; Vieira, N.E.; Espinoza, J.; Nien, J.K.; Romero, R.; Yergey, A.L. Analysis of variance-principal component analysis: A soft tool for proteomic discovery. Anal. Chim. Acta
**2005**, 544, 118–127. [Google Scholar] [CrossRef] - Smilde, A.K.; Jansen, J.J.; Hoefsloot, H.C.J.; Lamers, R.-J.A.N.; van der Greef, J.; Timmerman, M.E. ANOVA-simultaneous component analysis (ASCA): A new tool for analysing designed metabolomics data. Bioinformatics
**2005**, 21, 3043–3048. [Google Scholar] [CrossRef] [PubMed] - Smilde, A.K.; Westerhuis, J.A.; de Jong, S. A framework for sequential multiblock component methods. J. Chemometr.
**2003**, 17, 323–337. [Google Scholar] [CrossRef] - Kassama, Y.; Xu, Y.; Dunn, W.B.; Geukens, N.; Anné, J.; Goodacre, R. Assessment of adaptive focused acoustics versus manual vortex/freeze-thaw for intracellular metabolite extraction from Streptomyces lividans producing recombinant proteins using GC-MS and multiblock principal component analysis. Analyst
**2010**, 135, 934–942. [Google Scholar] [CrossRef] [PubMed] - Xu, Y.; Cheung, W.; Winder, C.L.; Goodacre, R. VOC-based metabolic profiling for food spoilage detection with the application to detecting Salmonella typhimurimum contaminated pork. Anal. Bioanal. Chem.
**2010**, 397, 2439–2449. [Google Scholar] [CrossRef] [PubMed] - Wold, S.; Sjöström, M.; Eriksson, L. PLS-regression: A basic tool of chemometrics. Chemometr. Intell. Lab.
**2001**, 58, 109–130. [Google Scholar] [CrossRef] - Höskuldsson, A. Experimental design and priority PLS regression. J. Chemometr.
**1996**, 10, 637–688. [Google Scholar] [CrossRef] - Thissen, U.; Wopereis, S.; van den Berg, S.A.; Bobeldijk, I.; Kleemann, R.; Kooistra, T.; van Dijk, K.W.; van Ommen, B.; Smilde, A.K. Improving the analysis of designed studies by combining statistical modelling with study design information. BMC Bioinform.
**2009**, 10, 52–67. [Google Scholar] [CrossRef] [PubMed] - Marini, F.; de Beer, D.; Joubert, E.; Walczak, B. Analysis of variance of designed chromatographic data sets: The analysis of variance-target projection approach. J. Chromatogr. A
**2015**, 1405, 94–102. [Google Scholar] [CrossRef] [PubMed] - Boccard, J.; Rudaz, S. Exploring Omics data from designed experiments using analysis of variance multiblock Orthogonal Partial Least Squares. Anal. Chim. Acta
**2016**, 920, 18–28. [Google Scholar] [CrossRef] [PubMed] - Martens, M.; Bredie, W.L.P.; Martens, H. Sensory profiling data studied by partial least squares regression. Food Qual. Prefer.
**2000**, 11, 147–149. [Google Scholar] [CrossRef] - Bakir, G.; Taskar, B.; Hofmann, T.; Schölkopf, B.; Smola, A.; Vishwanathan, S.V.N. Predicting Structured Data; MIT Press: Cambridge, MA, USA, 2007. [Google Scholar]
- Tsochantaridis, I.; Joachims, T.; Hofmann, T.; Altun, Y. Large Margin Methods for Structured and Interdependent Output Variables. J. Mach. Learn. Res.
**2005**, 6, 1453–1484. [Google Scholar] - Schulz, H.; Behnke, S. Structured Prediction for Object Detection in Deep Neural Networks. In Artificial Neural Networks and Machine Learning—iCANN 2014; Wermter, S., Ed.; Springer: Berlin/Heidelberg, Germany, 2014. [Google Scholar]
- Gromski, P.S.; Muhamadali, H.; Ellis, D.I.; Xu, Y.; Correa, E.; Turner, M.L.; Goodacre, R. A tutorial review: Metabolomics and partial least squares-discriminant analysis—A marriage of convenience or a shotgun wedding. Anal. Chim. Acta
**2015**, 879, 10–23. [Google Scholar] [CrossRef] [PubMed] - Morra, R.; Shankar, J.; Robinson, C.; Halliwell, S.; Butler, L.; Upton, M.; Hay, S.; Micklefield, J.; Dixon, N. Dual transcriptional-translational cascade permits cellular level tuneable expression control. Nucl. Acids Res.
**2016**, 44. [Google Scholar] [CrossRef] [PubMed] - Muhamadali, H.; Xu, Y.; Morra, R.; Trivedi, D.K.; Rattray, N.J.W.; Dixon, N.; Goodacre, R. Metabolomic analysis of riboswitch containing E. coli recombinant expression system. Mol. Biosyst.
**2016**, 12, 350–361. [Google Scholar] [CrossRef] [PubMed] - Sayqal, A.; Xu, Y.; Trivedi, D.K.; AlMasoud, N.; Ellis, D.I.; Rattray, N.J.W.; Goodacre, R. Metabolomics analysis reveals the participation of efflux pumps and ornithine in the response of Pseudomonas putida DOT-T1E cells to challenge with propranolol. PLoS ONE
**2016**. [Google Scholar] [CrossRef] [PubMed] - MTBLS320: Metabolomics Analysis Reveals the Participation of Efflux Pumps and Ornithine in the Response of Pseudomonas putida DOT-T1E Cells to Challenge with Propranolol. Available online: http://www.ebi.ac.uk/metabolights/MTBLS320 (accessed on 26 October 2016).
- Chong, I.; Jun, C. Performance of some variable selection methods when multicollinearity is present. Chemometr. Intell. Lab.
**2005**, 78, 103–112. [Google Scholar] [CrossRef] - Sumner, L.W.; Amberg, A.; Barrett, D.; Beger, R.; Beale, M.H.; Daykin, C.; Fan, T.W.-M.; Fiehn, O.; Goodacre, R.; Griffin, J.L.; et al. Proposed minimum reporting standards for chemical analysis. Metabolomics
**2007**, 3, 211–221. [Google Scholar] [CrossRef] [PubMed] - Currie, F.; Broadhurst, D.I.; Dunn, W.B.; Sellick, C.A.; Goodacre, R. Metabolomics reveals the physiological response of Pseudomonas putida KT2440 (UWC1) after pharmaceutical exposure. Mol. Biosyst.
**2016**, 12, 1367–1377. [Google Scholar] [CrossRef] [PubMed] - Westerhuis, J.A.; Hoefsloot, H.C.J.; Smit, S.; Vis, D.J.; Smilde, A.K.; van Velzen, E.J.J.; van Duijnhoven, J.P.M.; van Dorsten, F.A. Assessment of PLSDA cross validation. Metabolomics
**2008**, 4, 81–89. [Google Scholar] [CrossRef] - Winder, C.L.; Dunn, W.B.; Schuler, S.; Broadhurst, D.; Jarvis, R.; Stephens, G.M.; Goodacre, R. Global metabolic profiling of Escherichia coli cultures: An evaluation of methods for quenching and extraction and intracellular metabolites. Anal. Chem.
**2008**, 80, 2939–2948. [Google Scholar] [CrossRef] [PubMed] - Wedge, D.C.; Allwood, J.W.; Dunn, W.; Vaughan, A.A.; Simpson, K.; Brown, M.; Priest, L.; Blackhall, F.H.; Whetton, A.D.; Dive, C.; et al. Is serum or plasma more appropriate for intersubject comparisons in metabolomics studies? An assessment in patients with small-cell lung cancer. Anal. Chem.
**2011**, 83, 6689–6697. [Google Scholar] [CrossRef] [PubMed] - Fiehn, O.; Kopka, J.; Trethewey, R.N.; Willmitzer, L. Identification of Uncommon Plant Metabolites Based on Calculation of Elemental Compositions Using Gas Chromatography and Quadrupole Mass Spectrometry. Anal. Chem.
**2000**, 72, 3573–3580. [Google Scholar] [CrossRef] [PubMed] - Begley, P.; Francis-McIntyre, S.; Dunn, W.B.; Broadhurst, D.I.; Halsall, A.; Tseng, A.; Knowles, J.; Goodacre, R.; Kell, D.B. Development and performance of a GC-TOF-MS analysis for large-scale untargeted metabolomic studies of human serum. Anal. Chem.
**2009**, 81, 7038–7046. [Google Scholar] [CrossRef] [PubMed] - Dunn, W.B.; Broadhurst, D.; Begley, P.; Zelena, E.; Francis-McIntyre, S.; Anderson, N.; Brown, M.; Knowles, J.D.; Halsall, A.; Haselden, J.N.; et al. Procedures for large-scale metabolic profiling of serum and plasma using gas chromatography and liquid chromatography coupled to mass spectrometry. Nat. Protoc.
**2011**, 6, 1060–1083. [Google Scholar] [CrossRef] [PubMed] - Ramos, J.L.; Duque, E.; Huertas, M.J.; Haidour, A. Isolation and expansion of the catabolic potential of a Pseudomonas-putida strain able to grow in the presence of high concentrations of aromatic-hydrocarbons. J. Bacteriol.
**1995**, 177, 3911–3916. [Google Scholar] [PubMed] - Ramos, J.L.; Duque, E.; Godoy, P.; Segura, A. Efflux pumps involved in toluene tolerance in Pseudomonas putida DOT-T1E. J. Bacteriol.
**1998**, 180, 3323–3329. [Google Scholar] [PubMed] - Rojas, A.; Duque, E.; Mosqueda, G.; Golden, G.; Hurtado, A.; Ramos, J.L.; Segura, A. Three efflux pumps are required to provide efficient tolerance to toluene in Pseudomonas putida DOT-T1E. J. Bacteriol.
**2001**, 183, 3967–3973. [Google Scholar] [CrossRef] [PubMed] - Biospec/cluster-toolbox-v2.0. Available online: https://github.com/Biospec/cluster-toolbox-v2.0 (accessed on 25 October 2016).
- Troyanskaya, O.; Cantor, M.; Sherlock, G.; Brown, P.; Hastie, T.; Tibshirani, R.; Botstein, D.; Altman, R. Missing value estimation methods for DNA microarrays. Bioinformatics
**2001**, 17, 520–525. [Google Scholar] [CrossRef] [PubMed]

**Figure 1.**Structured output coding for the riboswitch set; for example, a wild-type sample under the IPTG inducer condition would be coded as [1 0 0 0 0 0 1 0 0].

**Figure 2.**Structured output coding for the propranolol set, For example, a sample of S1, D2, and T2 would be coded as [6 0 0 4 6].

**Figure 3.**VIP score plots for the riboswitch data. The variable identifications and their corresponding VIP scores values were annotated in the data tips. Note that each metabolite is only annotated once, if a metabolite is significant in both VIP score plots (e.g., variable 27), only the higher one is annotated.

**Figure 4.**VIP score plots for the propranolol data. The variable identification and their corresponding VIP score values were annotated in the data tips.

**Table 1.**PLS predictions of the riboswitch set. Confusion matrix of strain prediction using structured output.

Wild-Type | PET | EGFP | iL3EGFP | iL3PET | |
---|---|---|---|---|---|

Wild-type | 97.20% | 0.38% | 0.08% | 1.73% | 0.63% |

PET | 10.03% | 71.93% | 8.20% | 6.23% | 3.63% |

EGFP | 0.00% | 2.88% | 89.33% | 4.83% | 2.98% |

iL3EGFP | 1.20% | 8.53% | 2.35% | 69.10% | 18.83% |

iL3PET | 3.35% | 12.55% | 2.85% | 7.78% | 73.48% |

**Table 2.**PLS predictions of the riboswitch set. Confusion matrix of inducer condition prediction using structured output.

No Inducer | IPTG | IPTG + PPDA | PPDA | |
---|---|---|---|---|

No inducer | 66.62% | 6.56% | 10.92% | 15.90% |

IPTG | 12.66% | 58.04% | 22.58% | 6.72% |

IPTG + PPDA | 4.02% | 34.42% | 44.40% | 17.16% |

PPDA | 19.14% | 6.58% | 10.26% | 64.02% |

**Table 3.**PLS predictions of the riboswitch set. Confusion matrix of strain prediction using binary coding.

Wild-Type | PET | EGFP | iL3EGFP | iL3PET | |
---|---|---|---|---|---|

Wild-type | 79.53% | 4.40% | 1.58% | 9.55% | 4.95% |

PET | 8.93% | 52.88% | 11.53% | 18.25% | 8.43% |

EGFP | 0.70% | 8.55% | 83.23% | 3.65% | 3.88% |

iL3EGFP | 6.53% | 8.28% | 5.60% | 62.08% | 17.53% |

iL3PET | 7.43% | 10.00% | 10.25% | 6.13% | 66.20% |

**Table 4.**PLS predictions of the riboswitch set. Confusion matrix of inducer condition prediction using binary coding.

No Inducer | IPTG | IPTG + PPDA | PPDA | |
---|---|---|---|---|

No inducer | 64.60% | 7.10% | 12.16% | 16.14% |

IPTG | 10.44% | 59.40% | 22.50% | 7.66% |

IPTG + PPDA | 7.58% | 34.44% | 43.46% | 14.52% |

PPDA | 15.38% | 5.00% | 9.60% | 70.02% |

S1 | S2 | S3 | |
---|---|---|---|

S1 | 78.03% | 21.26% | 0.71% |

S2 | 10.40% | 87.67% | 1.93% |

S3 | 4.78% | 3.22% | 92.00% |

**Table 6.**PLS predictions for the propranolol set. Confusion matrix for dosages of propranolol prediction.

D0 | D1 | D2 | D3 | |
---|---|---|---|---|

D0 | 99.08% | 0.91% | 0.01% | 0% |

D1 | 0.20% | 64.45% | 35.35% | 0% |

D2 | 0% | 0.05% | 93.23% | 6.72% |

D3 | 1.67% | 0.10% | 39.35% | 58.88% |

T0 | T1 | T2 | |
---|---|---|---|

T0 | 30.13% | 55.57% | 14.30% |

T1 | 15.60% | 58.35% | 26.05% |

T2 | 1.48% | 27.15% | 71.37% |

**Table 8.**PLS predictions for the propranolol set. Confusion matrix for time point prediction using an evenly spaced coding.

T0 | T1 | T2 | |
---|---|---|---|

T0 | 20.64% | 70.40% | 8.96% |

T1 | 7.15% | 67.78% | 25.07% |

T2 | 0.78% | 35.72% | 63.50% |

Inducer Compound | Final Concentration |
---|---|

0.9% NaCl solution (control, no inducer) | - |

IPTG (lac inducer) | 50 μM |

PPDA (riboswitch inducer ligand) | 200 μM |

IPTG + PPDA | 50 μM + 200 μM |

© 2016 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC-BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Xu, Y.; Muhamadali, H.; Sayqal, A.; Dixon, N.; Goodacre, R. Partial Least Squares with Structured Output for Modelling the Metabolomics Data Obtained from Complex Experimental Designs: A Study into the *Y*-Block Coding. *Metabolites* **2016**, *6*, 38.
https://doi.org/10.3390/metabo6040038

**AMA Style**

Xu Y, Muhamadali H, Sayqal A, Dixon N, Goodacre R. Partial Least Squares with Structured Output for Modelling the Metabolomics Data Obtained from Complex Experimental Designs: A Study into the *Y*-Block Coding. *Metabolites*. 2016; 6(4):38.
https://doi.org/10.3390/metabo6040038

**Chicago/Turabian Style**

Xu, Yun, Howbeer Muhamadali, Ali Sayqal, Neil Dixon, and Royston Goodacre. 2016. "Partial Least Squares with Structured Output for Modelling the Metabolomics Data Obtained from Complex Experimental Designs: A Study into the *Y*-Block Coding" *Metabolites* 6, no. 4: 38.
https://doi.org/10.3390/metabo6040038