Untargeted Urinary 1H NMR-Based Metabolomic Pattern as a Potential Platform in Breast Cancer Detection

Breast cancer (BC) remains the second leading cause of death among women worldwide. An emerging approach based on the identification of endogenous metabolites (EMs) and the establishment of the metabolomic fingerprint of biological fluids constitutes a new frontier in medical diagnostics and a promising strategy to differentiate cancer patients from healthy individuals. In this work we aimed to establish the urinary metabolomic patterns from 40 BC patients and 38 healthy controls (CTL) using proton nuclear magnetic resonance spectroscopy (1H-NMR) as a powerful approach to identify a set of BC-specific metabolites which might be employed in the diagnosis of BC. Orthogonal partial least squares-discriminant analysis (OPLS-DA) was applied to a 1H-NMR processed data matrix. Metabolomic patterns distinguished BC from CTL urine samples, suggesting a unique metabolite profile for each investigated group. A total of 10 metabolites exhibited the highest contribution towards discriminating BC patients from healthy controls (variable importance in projection (VIP) >1, p < 0.05). The discrimination efficiency and accuracy of the urinary EMs were ascertained by receiver operating characteristic curve (ROC) analysis that allowed the identification of some metabolites with the highest sensitivities and specificities to discriminate BC patients from healthy controls (e.g. creatine, glycine, trimethylamine N-oxide, and serine). The metabolomic pathway analysis indicated several metabolism pathway disruptions, including amino acid and carbohydrate metabolisms, in BC patients, namely, glycine and butanoate metabolisms. The obtained results support the high throughput potential of NMR-based urinary metabolomics patterns in discriminating BC patients from CTL. Further investigations could unravel novel mechanistic insights into disease pathophysiology, monitor disease recurrence, and predict patient response towards therapy.


Introduction
The global cancer burden is estimated to have risen to 18.1 million new cases and 9.6 million deaths in 2018 (WHO) [1], being the second leading cause of death worldwide. Several genetic and epigenetic factors including ageing, unhealthy life styles (poor diet, tobacco and alcohol consumption), population growth, as well as the changing prevalence of certain causes of cancer linked to social and economic development, contribute for this sobering fact. Lung and female breast cancers are at the top of the leading cancer types in terms of the number of new cases, with approximately 2.1 The study was approved by the Ethic Committee of Hospital Dr. Nélio Mendonça (Approval no. S.1708625/2017). Written informed consent for the study was obtained from all participants. Each individual (either as a patient or healthy volunteer) provided a sample of morning urine (after overnight fasting) in a 20 mL sterile container. The samples were aliquoted into 4 mL glass vials and frozen at -80 • C until the experiments.
Urine samples were thawed and centrifuged (8000 rpm for 5 min) to remove any suspended cells and other precipitated material [12]. Then, 540 µL of urine was mixed with 60 µL of a buffer solution (KH 2 PO 4 , 1.5 M in D 2 O) containing 0.1% of TSP-d 4 (used as chemical shift reference) and sodium azide (NaN 3 , 2 mM). The pH was adjusted to 7.00 ± 0.02 by adding small amounts of KOD.

NMR Measurements
NMR spectral acquisition was performed using a Bruker Advance II Plus NMR spectrometer equipped with a 400 MHz magnet UltraShield™ 400 Plus at 300K. All NMR spectra acquisition and pre-processing were performed under the control of a workstation with TopSpin 3.1 (Bruker BioSpin). Two different 1 H-NMR spectra were collected: a 1D 1 H spectrum providing quantitative metabolite data for statistical analysis while 2D HSQC and 2D-Jres experiments assisted in peak assignment and metabolite identification using standard Bruker pulse programs. For each sample, a 1D nuclear overhauser enhancement spectroscopy (NOESY) pulse sequence (noesypr1d) was used in all cases and solvent signal suppression was achieved by presaturation during relaxation and mixing time (SW 4807.692 Hz, TD 64 K data points, relaxation delay 5 s, 128 scans). The shimming was calibrated automatically. Also, all spectra were processed using a line broadening (1.0 Hz) and baseline automatically corrected. The NMR spectrum of each sample was aligned with reference to the TSP signal at δ 0.00 ppm. Spectral regions within the range of 0.94 to 10 ppm were analyzed after excluding the sub-region δ 4.55-6.05 ppm to remove variability arising from water suppression and possible cross-relaxation effect on the urea signal via solvent exchanging protons. As already known, the TSP signal may be affected by proteins or other macromolecules present in samples [41] and for that reason in the preparation of urine samples before NMR analysis, the step of centrifugation was taken into account and the rotations per minute (rpm) was used in order to remove any proteins present in samples The identification of metabolites was accomplished using the Chenomx NMR Suite 8.2 (Chenomx Inc., Alberta, Canada) and relative concentrations (in mM) of metabolites were determined using the 400MHz library from Chenomx NMR Suite 8.2, which compares the integral of a known reference signal (TSP) with signals derived from a library of compounds containing chemical shifts and peak multiplicities. In addition, the identification of selected metabolites was also cross checked from the Human Metabolome Database (HMDB) [42] and literature [43]. Regarding the metabolites that were not available in the library, identification was accomplished by running a standard solution and the relative concentration was calculated manually. This software not only allows the identification of compounds but also access their quantification based on advanced algorithms turning into a very straightforward tool to analyze NMR spectra.

Statistical Analysis
Statistical analyses were performed using the web server Metaboanalyst 3.0 [44] where sample specific normalization allowed the manual adjustment of relative concentrations based on biological inputs (i.e., volume, mass) and row-wise normalization allowed the general-purpose adjustment for differences among samples. Regarding data transformation and scaling were accomplished using two different approaches to make features more comparable, raw data were scaled using mean-centering and cubic root transformation. Intensities in each spectrum were normalized by the sum to avoid the contribution of urine dilution. Then, multivariate statistical analyses, namely, PCA, PLS-DA, and OPLS-DA were applied to the urinary metabolomic profile dataset to provide insights into the separations between the groups. Furthermore, hierarchical cluster analysis by k-means of the 2 groups in the study was performed and Pearson's correlation was used to generate the heat map using the metabolites to identify clustering patterns. Moreover, the ROC curves were attained to verify which metabolites had the highest sensitivity/specificity for a BC diagnosis. Finally, the metabolites were used for the metabolic pathway analysis to identify the most relevant metabolic pathways involved in the BC and CTL groups.

Results and Discussion
3.1. Urinary Metabolomic Pattern Based on 1 H NMR 1 H NMR analysis was performed according to the procedure described in the Methods section.
A representative first dimension urine 1 H NMR spectrum, referenced to TSP (δ 0. 0 ppm), from a BC patient is shown in Figure 1, and metabolites were indicated based on their chemical shifts. Table 2 represents the identification of metabolites as well as their minimum and maximum relative concentrations (mM) for each group and the respective percentage of occurrence (FO). Each sample analysis was performed in triplicate and the relative standard deviation (RSD) was lower than 2%. hydroxyisobutyrate, trimethylamine N-oxide, hypoxanthine, and glycine for the CTL group. Regarding relative concentrations (in mM), the highest level was obtained for creatinine followed by hippurate in the BC group and citrate in the CTL group. In addition, taurine and mannitol presented superior levels in BC group, respectively. It can also be highlighted that the majority of metabolites were down-regulated relatively to the BC group, except formate, α-hydroxybutyrate hippurate, and phenylalanine, that were up-regulated, being also identified in a study developed by Carrola et al. [12] with lung cancer patients Figure 1. Typical 400 MHz representative urine 1 H NMR spectrum from a BC patient, referenced to TSP (δ 0.00 ppm). For peak identification please see Table 2.
Thirty-six metabolites were identified and quantified relative to TSP (Figure 1). The main metabolites identified in urine resulted mainly from tricarboxylate (e.g., citrate, cis-aconitate), methane (e.g., dimethylamine, trimethylamine N-oxide) and amino acid metabolisms (e.g., hippurate, glycine). The most intense signals were obtained from creatinine, creatine, hippurate, citrate, and trimethylamine N-oxide ( Figure 1). These metabolites were already identified in several studies that use urine from cancer individuals [12,37,45]. Trimethylamine N-oxide is produced in the liver by intestinal bacteria from dietary quaternary amines, such as choline and carnitine trough trimethylamine (TMA) via flavin-containing monooxygenase (FMO3), and the levels in urine or plasma are used to determine FMO3 deficiency [46][47][48][49][50]. Creatinine is subsequently produced via a biological system involving creatine, phosphocreatine, and adenosine triphosphate (ATP), whereas hippurate and citrate are derived from phenylalanine metabolism and the citrate cycle [51]. The concentration of creatinine is age-and sex dependent, decreasing with age and varying throughout the day. Normally, the concentration of creatinine is increased in males when compared with females, given the increased body mass index [52]. Creatinine production from the muscles is proportional to the total muscle mass and muscle catabolism. In individuals with a relatively low muscle mass, including children, women, and cancer patients, serum creatinine levels are reduced for a given glomerular filtration rate (GFR), which is the flow rate of filtered fluid through the kidney, thus providing information on kidney function [53].  Table 2.

Multivariate Statistical Analysis of Urinary Metabolomic Profile
For most metabolites, the FO was higher than 90% with the following exceptions: valine, glutamine, carnitine, trigonelline, 4-cresol sulphate, and hypoxanthine for the BC group; α-hydroxyisobutyrate, trimethylamine N-oxide, hypoxanthine, and glycine for the CTL group. Regarding relative concentrations (in mM), the highest level was obtained for creatinine followed by hippurate in the BC group and citrate in the CTL group. In addition, taurine and mannitol presented superior levels in BC group, respectively. It can also be highlighted that the majority of metabolites were down-regulated relatively to the BC group, except formate, α-hydroxybutyrate hippurate, and phenylalanine, that were up-regulated, being also identified in a study developed by Carrola et al. [12] with lung cancer patients Thirty-six metabolites were identified and quantified relative to TSP (Figure 1). The main metabolites identified in urine resulted mainly from tricarboxylate (e.g., citrate, cis-aconitate), methane (e.g., dimethylamine, trimethylamine N-oxide) and amino acid metabolisms (e.g., hippurate, glycine). The most intense signals were obtained from creatinine, creatine, hippurate, citrate, and trimethylamine N-oxide ( Figure 1). These metabolites were already identified in several studies that use urine from cancer individuals [12,37,45]. Trimethylamine N-oxide is produced in the liver by intestinal bacteria from dietary quaternary amines, such as choline and carnitine trough trimethylamine (TMA) via flavin-containing monooxygenase (FMO3), and the levels in urine or plasma are used to determine FMO 3 deficiency [46][47][48][49][50]. Creatinine is subsequently produced via a biological system involving creatine, phosphocreatine, and adenosine triphosphate (ATP), whereas hippurate and citrate are derived from phenylalanine metabolism and the citrate cycle [51]. The concentration of creatinine is age-and sex dependent, decreasing with age and varying throughout the day. Normally, the concentration of creatinine is increased in males when compared with females, given the increased body mass index [52]. Creatinine production from the muscles is proportional to the total muscle mass and muscle catabolism. In individuals with a relatively low muscle mass, including children, women, and cancer patients, serum creatinine levels are reduced for a given glomerular filtration rate (GFR), which is the flow rate of filtered fluid through the kidney, thus providing information on kidney function [53].

Multivariate Statistical Analysis of Urinary Metabolomic Profile
The first step before performing multivariate statistical analysis was to verify the normal distribution of the urinary metabolomic profile dataset using the Kolmogorov-Smirnov test ( Table 2). All samples under analysis exhibited a normal distribution within each assigned group (p > 0.05). The samples under analysis that exhibited a normal distribution within each assigned group (those with p-values > 0.05 in the K-S column) were tested using the t-test to compare the means of the two groups (BC and CTL). For the samples that rejected the normality assumption (p-values < 0.05), the Mann-Whitney-Wilcoxon test was applied. The corresponding p-values associated with these tests are presented in a mean comparison column in Table 2. Furthermore, to obtain a reliable dataset to apply multivariate analysis, the dataset was evaluated to exclude the metabolites that had an FO < 90%. The dataset used to perform the statistical analysis also excluded creatinine, as mentioned above, as its concentration is dependent on age, gender, and disease status, decreasing with age and varying throughout the day. Regarding creatinine, relative concentrations obtained, and their respective differences between groups under study, this metabolite might be considered a potential artefact given that creatinine values may be altered, as the generation of creatinine may not be simply a product of muscle mass but influenced by muscle function, muscle composition, activity, diet, and health status [54]. Based on this, the dataset composed of 33 metabolites and 70 samples (32 BC and 38 CTL) that fulfilled this condition was subjected to principal component analysis (PCA). PCA as an unsupervised method was performed to visualize the similarities/differences between urine sample profiles of groups in this study. In this step, the samples were analyzed individually, e.g. without classification, according to the groups. A PCA score plot and loading plot from urine samples are presented in Supplementary Figure S1a,b. Although the projection of the variance between samples was performed without classification, it is possible to observe that the PCA of urine samples from BC patients and those from CTL presented a tendency for the formation of two clusters across the first principal component (PC1) that explains 54.6 % of the total variance. Most of the metabolites exhibited enormous importance in the variance projection of samples. Then, the partial least square-discriminant analysis (PLS-DA) was used as a supervised clustering method to maximize the separation between the groups and demonstrated that the samples tended to be grouped according with health condition of subject (BC and CTL) through its variance/covariance along the first component. Ten differently expressed metabolites that exhibited a variable importance in a projection (VIP) score greater than 1 were identified: creatine, glycine, serine, dimethylamine, trimethylamine N-oxide, α-hydroxyisobutyrate, mannitol, glutamine, cis-aconitate, and trigonelline (Figure 2a-d).
Many of these metabolites were already identified in various cancer types, including lung [12,55], breast, ovarian [3,5], bladder [56], and gastric cancers [57], in previous reports. Additionally, Zhou et al. [58] performed a metabonomics study using serum and urine from BC patients based on NMR, where citrate, phenylacetylglycine, and guanidoacetate exhibited significance in the discrimination of BC patients from healthy volunteers. In addition, Slupsky et al. [3] accomplished a study using urine from breast and ovarian cancer patients to discover metabolites for an early diagnosis. The authors found that certain intermediates of the tricarboxylic acid cycle and metabolites relating to energy metabolism, amino acids, and gut microbial metabolism, were perturbed. With regard to amino acids as raw materials of protein synthesis and catabolism products in vivo, their changes, whether in composition and concentration, can reflect the metabolic status of patients [58]. In addition, Cala et al. [59] established that the urinary and lipid profiles of Hispanic women also identified some of the same metabolites of this study, namely amino acids (valine, alanine, glycine, threonine), their levels being decreased in BC patients when compared to controls. This might be related to the requirement of amino acids in cancer metabolism in order to facilitate proliferation and cancer progression [60].
Moreover, a heat map was constructed using Pearson's correlation, providing intuitive visualization of the data set. The heat map contained the metabolites and was used to identify samples or features that are unusually high or low ( Figure 3). As noted in Figure 3, the higher relative concentrations for most metabolites were found in the CTL group whereas the lowest relative concentrations were noted in the BC group.   Table 2.
Also, a heat map was generated for the dataset using Pearson's correlation, providing an immediate visualization of data and possible correlations between samples (Figure 3).
Metabolites 2019, 9, x FOR PEER REVIEW 10 of 18 Also, a heat map was generated for the dataset using Pearson's correlation, providing an immediate visualization of data and possible correlations between samples (Figure 3). Additionally, orthogonal partial least square-discriminant (OPLS-DA) analysis was applied to the urinary metabolomic profile dataset to maximize the separation between the CTL and BC groups. Figure 4a-b presents the scores and the loading plots for the OPLS-DA analysis, where it can be observed that a good separation was achieved with 54.8 % of total variance. Additionally, orthogonal partial least square-discriminant (OPLS-DA) analysis was applied to the urinary metabolomic profile dataset to maximize the separation between the CTL and BC groups. Figure 4a,b presents the scores and the loading plots for the OPLS-DA analysis, where it can be observed that a good separation was achieved with 54.8 % of total variance.  The OPLS-DA tool provides insights into the separation between the groups, demonstrating which variables are responsible for class discrimination. The robustness of the model was tested using a random permutation test with 1000 permutations (Figure 4b). This test yielded an R 2 (that represents the goodness of fit) of 0.846 and a Q 2 (that represents the predictive ability) of 0.770, indicating that the model is not over fitted and has a good predictive ability to distinguish between the groups under study.
Moreover, receiver operating characteristic curves (ROCs) were generated for the two groups (CTL-BC) using the 10 identified metabolites with VIP values higher than 1 and are presented in Figure 5a,b.
Metabolites 2019, 9, x FOR PEER REVIEW 12 of 18 The OPLS-DA tool provides insights into the separation between the groups, demonstrating which variables are responsible for class discrimination. The robustness of the model was tested using a random permutation test with 1000 permutations (Figure 4b). This test yielded an R 2 (that represents the goodness of fit) of 0.846 and a Q 2 (that represents the predictive ability) of 0.770, indicating that the model is not over fitted and has a good predictive ability to distinguish between the groups under study.
Moreover, receiver operating characteristic curves (ROCs) were generated for the two groups (CTL-BC) using the 10 identified metabolites with VIP values higher than 1 and are presented in Figure 5a,b. As noted in the figure, as the number of metabolites increases, the area under the curve (AUC) also increases. Thus, using only 4 metabolites, the AUC value obtained was 0.91 for the CTL-BC, demonstrating the higher sensitivity/specificity to distinguish the groups. The metabolites with significance were creatine, glycine, serine, and trimethylamine N-oxide. These results are in accordance with the literature, where Xia et al. [61] report that an AUC value between 0.9 and 1.0 is excellent, and a value between 0.8 and 0.9 is good. By comparing the results, the values obtained were very good. A greater AUC value indicates a greater ability to distinguish the CTL from the BC group. The AUC can be interpreted as the probability that a randomly selected diseased subject is classified as diseased than a casually selected healthy subject [61].
Moreover, a 10-fold cross validation was used to generate a logistic regression model and the performance was calculated according to the equation below. logit(P) = log(P / (1 -P)) = 0.471 + 1.434 creatine + 1.327 glycine + 0.25 serine + 0.285 trimethylamine N-oxide where P is Pr(y=1|x). The best threshold (or cutoff) for the predicted P was 0.39 (Table S1 and S2). Figure 6a and b show the results obtained for the predicted probabilities using the OPLS-DA model and the average of the predictive accuracy for the same model, where it can be observed that the model allowed good classification of samples (>90%). Moreover, 14 samples without known labels As noted in the figure, as the number of metabolites increases, the area under the curve (AUC) also increases. Thus, using only 4 metabolites, the AUC value obtained was 0.91 for the CTL-BC, demonstrating the higher sensitivity/specificity to distinguish the groups. The metabolites with significance were creatine, glycine, serine, and trimethylamine N-oxide. These results are in accordance with the literature, where Xia et al. [61] report that an AUC value between 0.9 and 1.0 is excellent, and a value between 0.8 and 0.9 is good. By comparing the results, the values obtained were very good. A greater AUC value indicates a greater ability to distinguish the CTL from the BC group. The AUC can be interpreted as the probability that a randomly selected diseased subject is classified as diseased than a casually selected healthy subject [61].
Moreover, a 10-fold cross validation was used to generate a logistic regression model and the performance was calculated according to the equation below. logit(P) = log(P/(1 − P)) = 0.471 + 1.434 creatine + 1.327 glycine + 0.25 serine + 0.285trimethylamine N-oxide (1) where P is Pr(y = 1|x). The best threshold (or cutoff) for the predicted P was 0.39 (Tables S1 and S2). Figure 6a,b show the results obtained for the predicted probabilities using the OPLS-DA model and the average of the predictive accuracy for the same model, where it can be observed that the model allowed good classification of samples (>90%). Moreover, 14 samples without known labels were processed together with the ones with known labels in order to obtain the probability of class labels, as presented in Table S3. were processed together with the ones with known labels in order to obtain the probability of class labels, as presented in Table S3. Most of the cases were classified in their respective group except for CTL49, BA71 and BA1, with a probability score ranging from 0.729 to 0.829 (Table S3).
Finally, the metabolic pathway analysis was performed to determine which pathways were altered in the groups under study. Figure 7a,b presents the impacted pathways in the CTL-BC groups, respectively. The most impacted metabolic pathways were the glycine metabolism, the glutamate metabolism, the butanoate metabolism, glycolysis, the citrate cycle (TCA cycle), the taurine metabolism and the pyruvate metabolism, indicated by the red and yellow colors. It can be also highlighted that the pathway with the highest impact was the glycine metabolism ( Figure 6b). The alterations are related to the downregulation of the tricarboxylic acid (TCA) cycle (e.g., the Warburg effect) and the increased energy request in tumors [62]. It is well known that cancer cells convert more glucose into lactic acid than normal cells even in aerobic conditions, leading to a disturbance of the TCA cycle and their intermediates.
Based on these results, a successful differentiation and discrimination of samples was achieved between the CTL and BC groups. The results indicate that the 1 H NMR urinary profile represents a useful approach to identifying potential BC biomarkers. Most of the cases were classified in their respective group except for CTL49, BA71 and BA1, with a probability score ranging from 0.729 to 0.829 (Table S3).
Finally, the metabolic pathway analysis was performed to determine which pathways were altered in the groups under study. Figure 7a,b presents the impacted pathways in the CTL-BC groups, respectively. The most impacted metabolic pathways were the glycine metabolism, the glutamate metabolism, the butanoate metabolism, glycolysis, the citrate cycle (TCA cycle), the taurine metabolism and the pyruvate metabolism, indicated by the red and yellow colors. It can be also highlighted that the pathway with the highest impact was the glycine metabolism ( Figure 6b). The alterations are related to the down-regulation of the tricarboxylic acid (TCA) cycle (e.g., the Warburg effect) and the increased energy request in tumors [62]. It is well known that cancer cells convert more glucose into lactic acid than normal cells even in aerobic conditions, leading to a disturbance of the TCA cycle and their intermediates.
Based on these results, a successful differentiation and discrimination of samples was achieved between the CTL and BC groups. The results indicate that the 1 H NMR urinary profile represents a useful approach to identifying potential BC biomarkers.

Conclusions
This study assessed the metabolomic urinary profile of BC patients in active and follow-up stages compared with that in healthy volunteers, using 1 H NMR combined with multivariate statistical tools (PCA, PLS-DA, and OPLS-DA) that were applied to two groups (BC and CTL). Thirty-three metabolites were identified and quantified using Chenomx software. Multivariate statistical analysis revealed some metabolites were significantly altered in BC patients. Of the metabolites detected, creatine, glycine, serine, dimethylamine, trimethylamine N-oxide, α-hydroxyisobutyrate, mannitol, glutamine, cis-aconitate, and trigonelline exhibited the highest sensitivities and specificities to discriminate BC patients from healthy controls. A plot analysis revealed a metabolomic biosignature comprising an array of several biochemical pathways altered in BC patients. A metabolic pathway analysis indicated that the discriminatory metabolites potentially originated from several dysregulated pathways in BC: the glycine metabolism, the glutamate metabolism, the butanoate metabolism, glycolysis, the citrate cycle (TCA cycle), the taurine metabolism and the pyruvate metabolism. Moreover, although this was a small sample cohort without discrimination against BC subtypes, the results obtained were promising, indicating the usefulness of endogenous metabolites for biomarker discovery metabolites and the need to investigate the related metabolomic pathways in order to improve the diagnostic tools of BC.