Discriminative Analysis of Different Grades of Gaharu (Aquilaria malaccensis Lamk.) via 1H-NMR-Based Metabolomics Using PLS-DA and Random Forests Classification Models

Gaharu (agarwood, Aquilaria malaccensis Lamk.) is a valuable tropical rainforest product traded internationally for its distinctive fragrance. It is not only popular as incense and in perfumery, but also favored in traditional medicine due to its sedative, carminative, cardioprotective and analgesic effects. The current study addresses the chemical differences and similarities between gaharu samples of different grades, obtained commercially, using 1H-NMR-based metabolomics. Two classification models: partial least squares-discriminant analysis (PLS-DA) and Random Forests were developed to classify the gaharu samples on the basis of their chemical constituents. The gaharu samples could be reclassified into a ‘high grade’ group (samples A, B and D), characterized by high contents of kusunol, jinkohol, and 10-epi-γ-eudesmol; an ‘intermediate grade’ group (samples C, F and G), dominated by fatty acid and vanillic acid; and a ‘low grade’ group (sample E and H), which had higher contents of aquilarone derivatives and phenylethyl chromones. The results showed that 1H- NMR-based metabolomics can be a potential method to grade the quality of gaharu samples on the basis of their chemical constituents.

Metabolites labeled with numbers were tentatively identified. The spectra are arranged according to the selling price of the gaharu samples, from grades A (the most expensive) through B, C, D, E, F, G to H (the cheapest), respectively. Table 1. Identified metabolites in the 1 H-NMR spectra of (A. malaccensis) gaharu samples.
Sesquiterpenes have been reported to be a major class of compounds present in the resin of A. malaccensis [13]. Yoneda et al. [41] also reported the presence of jinkohol, kusunol, α-agarofuran, 10epi-γ-eudesmol, and agarospirol in the agarwood oil obtained from A. malacensis. After detailed analysis and a comparison with literature data and online databases, the sesquiterpenoids jinkohol (3) and kusunol (4) [42], agarofuran (5) and epieudesmol (6) [43], and isoeugenol (7) (HMBD http://www.hmdb.ca/) were also identified in the gaharu extracts in the present study. Further examination of the 1 H-NMR spectra also showed the presence of very small amounts of aldehydic compounds (δ 9.32 ppm) as can be seen in Figure 1. However, due to technical limitations of the present study, the structures of these aldehydes could not be identified.

Discriminative Analysis of Gaharu Samples
The processed 1 H-NMR data was initially subjected to principal component analysis (PCA) in order to see the differences between the eight groups of gaharu. However, PCA did not show any clear clustering or differences among the gaharu samples. This could be due to the high variability of  The representative signals of two phenylethyl chromones i.e., 6-hydroxy-2-(2-phenylethyl)chromone (6HC, 1) and 6-hydroxy-2-[2-(4-hydroxyphenyl)ethyl]chromone (6DHC, 2) were observed in the 1 H-NMR spectra of the gaharu extracts. Compound 1 was identified based on the 1 H-NMR signals for an ABX coupling system at δ 7.14 (1H, d, J = 8. Besides the signals for the phenylethylchromones, signals attributable to 5,6,7,8-tetrahydrochromone (14) were also observed at  Table S2), supported the tentative identification of the compound as an aquilarone derivative. Minor constituents from the phenolic class of compounds were also identified in the gaharu samples, tentatively assigned as vanillic acid (8), cinnamic acid (9), o-cresol (10), xanthosine (11) and catechol (12) ( Table 1).
Sesquiterpenes have been reported to be a major class of compounds present in the resin of A. malaccensis [13]. Yoneda et al. [41] also reported the presence of jinkohol, kusunol, α-agarofuran, 10epi-γ-eudesmol, and agarospirol in the agarwood oil obtained from A. malacensis. After detailed analysis and a comparison with literature data and online databases, the sesquiterpenoids jinkohol (3) and kusunol (4) [42], agarofuran (5) and epieudesmol (6) [43], and isoeugenol (7) (HMBD http://www.hmdb.ca/) were also identified in the gaharu extracts in the present study. Further examination of the 1 H-NMR spectra also showed the presence of very small amounts of aldehydic compounds (δ 9.32 ppm) as can be seen in Figure 1. However, due to technical limitations of the present study, the structures of these aldehydes could not be identified.

Discriminative Analysis of Gaharu Samples
The processed 1 H-NMR data was initially subjected to principal component analysis (PCA) in order to see the differences between the eight groups of gaharu. However, PCA did not show any clear clustering or differences among the gaharu samples. This could be due to the high variability of Besides the signals for the phenylethylchromones, signals attributable to 5,6,7,8-tetrahydro-chromone (14) were also observed at  Table S2), supported the tentative identification of the compound as an aquilarone derivative. Minor constituents from the phenolic class of compounds were also identified in the gaharu samples, tentatively assigned as vanillic acid (8), cinnamic acid (9), o-cresol (10), xanthosine (11) and catechol (12) ( Table 1).
Sesquiterpenes have been reported to be a major class of compounds present in the resin of A. malaccensis [13]. Yoneda et al. [41] also reported the presence of jinkohol, kusunol, α-agarofuran, 10-epi-γ-eudesmol, and agarospirol in the agarwood oil obtained from A. malacensis. After detailed analysis and a comparison with literature data and online databases, the sesquiterpenoids jinkohol (3) and kusunol (4) [42], agarofuran (5) and epieudesmol (6) [43], and isoeugenol (7) (HMBD http: //www.hmdb.ca/) were also identified in the gaharu extracts in the present study. Further examination of the 1 H-NMR spectra also showed the presence of very small amounts of aldehydic compounds (δ 9.32 ppm) as can be seen in Figure 1. However, due to technical limitations of the present study, the structures of these aldehydes could not be identified.

Discriminative Analysis of Gaharu Samples
The processed 1 H-NMR data was initially subjected to principal component analysis (PCA) in order to see the differences between the eight groups of gaharu. However, PCA did not show any clear clustering or differences among the gaharu samples. This could be due to the high variability of the different gaharu samples. Partial least squares-discriminant analysis (PLS-DA) was then used to model the relationships between the eight groups of gaharu samples. A permutation test was applied to evaluate the reliability of the model (Supplementary Figure S1). Overall, the PLS-DA model was found to be a reliable and good model for the classification. The model did not show over-fitting, based on the Y-axis intercept values of R 2 = 0.07 and Q 2 = −0.14, and the fact that the R 2 line was far from being horizontal.
The PLS-DA score plot showed that the eight groups of gaharu samples were differentiated into three distinct clusters (Figure 2a) (9) and aquilarone derivatives (14). The samples A, B and D were marked by higher levels of jinkohol (3), kusunol (4), agarofuran (5), and 10-epi-γ-eudesmol (6), whereas C, F and G were characterized by higher levels of isoeugenol (7), vanillic acid (8), xanthosine (11), catechol (12) and fatty acids (13). A blind test was carried out to evaluate the performance of the PLS-DA model (Supplementary Figure S2). A new batch of gaharu samples (test samples) belonging to low, medium and high grades were analyzed by 1 H-NMR and subjected to PLS-DA together with the training set (previous NMR data of the different grades). In the PLS-DA score plot, the new gaharu samples were clustered well within the corresponding grades.
To further validate the results obtained from the PLS-DA, the Random Forests classifier was applied to the same 1 H-NMR data. In contrast to PLS-DA, the application of Random Forests as a classification model [44] is relatively rare in metabolomics data analysis. Although it is available in freeware softwares such as the Random Forest package in the R software [45] and MetaboAnalyst [46], its applicability in metabolomics studies still needs to be explored. Figure 2b shows the Random Forests multi-dimensional scaling (MDS) plot of proximity matrix. The MDS plot showed the same clustering as was found in the score plot of PLS-DA. The accuracy of the models was evaluated using confusion matrices as shown in Table 2.

Identification of Discriminating Metabolites
The discriminating metabolites were identified from the chemical shifts in the PLS-DA loading plot (Figure 3), and from the VIP (variable importance) values in the Random Forests for each cluster of gaharu samples (Figure 4). In the latter, metabolites having high value of VIP are deemed to have high contribution to the clustering.

Identification of Discriminating Metabolites
The discriminating metabolites were identified from the chemical shifts in the PLS-DA loading plot (Figure 3), and from the VIP (variable importance) values in the Random Forests for each cluster of gaharu samples (Figure 4). In the latter, metabolites having high value of VIP are deemed to have high contribution to the clustering. For the PLS-DA model, the high grade cluster (groups A, B and D) was characterized by higher levels of jinkohol (3), kusunol (4), and 10-epi-γ-eudesmol (6), whereas the intermediate grade cluster (groups C, F and G) contained higher levels of isoeugenol (7), vanillic acid (8), xanthosine (11), catechol (12) and fatty acid (13). The low grade cluster (groups H and E) was distinguished from the other groups by having higher levels of aquilarone derivatives (14) and phenylethylchromones 1 and 2. The Random Forests analysis basically resulted in the identification of the same discriminant metabolites in the established clusters as in the PLS-DA model.
Identification of the discriminant metabolites from the PLS-DA and Random Forests models were confirmed by analysis of the variable importance (VIP) values for all clusters as shown in the Supplementary Figure S3a,b, respectively. Furthermore, the relative quantification of the discriminant metabolites in the three clusters (high, intermediate and low grade clusters) was carried out using Tukey posthoc analysis, based on the average peak area of the corresponding 1 H-NMR signals (Supplementary Figure S4).
The sesquiterpenoids jinkohol (3), kusunol (4), α-agarofuran (5) and 10-epi-γ-eudesmol (6) are well known volatile constituents that are associated with the fragrance of gaharu [14,26]. In the present study, the high grade gaharu samples were indeed characterized by high levels of jinkohol (3), kusunol (4) and 10-epi-γ-eudesmol (6), and thus, were in agreement with previous findings [14,15,21]. The higher grade gaharu samples were also observably darker in color in comparison to the low grade gaharu samples which clearly reflected the higher contents of the resinous constituents. According to the literature, non-infected A. malaccensis wood is brighter in colour and almost odourless, whereas the infected wood is heavier and dark brown to black in colour [14,15]. On the other hand, chromones have been reported to be the metabolites responsible for the warm, sweet, balsamic and long-lasting odor when gaharu wood is burned or heated [14]. Therefore, the lower grade gaharu samples which were richer in these chromones are more suitable for use as incense. Jinkohol (3), the proposed chemical marker for high grade gaharu, also has a distinctive and extremely strong woody smell which contributed to the suitability of the high grade gaharu extract/resin as a perfume ingredient. Several studies have also reported high levels of agarofuran (5) in gaharu samples of high grade [24][25][26][27].
In the present study, however, both the PLS-DA and Random Forests models showed that the levels of the constituent in the different groups were not significantly different.
Interestingly, although PLS-DA and Random Forests are based on different concepts, both models yielded similar results in terms of class plots (score and MDS plots) and the chemical constituents for the new group clusters, as well as the VIP values. We noted that since Random For the PLS-DA model, the high grade cluster (groups A, B and D) was characterized by higher levels of jinkohol (3), kusunol (4), and 10-epi-γ-eudesmol (6), whereas the intermediate grade cluster (groups C, F and G) contained higher levels of isoeugenol (7), vanillic acid (8), xanthosine (11), catechol (12) and fatty acid (13). The low grade cluster (groups H and E) was distinguished from the other groups by having higher levels of aquilarone derivatives (14) and phenylethylchromones 1 and 2. The Random Forests analysis basically resulted in the identification of the same discriminant metabolites in the established clusters as in the PLS-DA model.
Identification of the discriminant metabolites from the PLS-DA and Random Forests models were confirmed by analysis of the variable importance (VIP) values for all clusters as shown in the Supplementary Figure S3a,b, respectively. Furthermore, the relative quantification of the discriminant metabolites in the three clusters (high, intermediate and low grade clusters) was carried out using Tukey posthoc analysis, based on the average peak area of the corresponding 1 H-NMR signals (Supplementary Figure S4).
The sesquiterpenoids jinkohol (3), kusunol (4), α-agarofuran (5) and 10-epi-γ-eudesmol (6) are well known volatile constituents that are associated with the fragrance of gaharu [14,26]. In the present study, the high grade gaharu samples were indeed characterized by high levels of jinkohol (3), kusunol (4) and 10-epi-γ-eudesmol (6), and thus, were in agreement with previous findings [14,15,21]. The higher grade gaharu samples were also observably darker in color in comparison to the low grade gaharu samples which clearly reflected the higher contents of the resinous constituents. According to the literature, non-infected A. malaccensis wood is brighter in colour and almost odourless, whereas the infected wood is heavier and dark brown to black in colour [14,15]. On the other hand, chromones have been reported to be the metabolites responsible for the warm, sweet, balsamic and long-lasting odor when gaharu wood is burned or heated [14]. Therefore, the lower grade gaharu samples which were richer in these chromones are more suitable for use as incense. Jinkohol (3), the proposed chemical marker for high grade gaharu, also has a distinctive and extremely strong woody smell which contributed to the suitability of the high grade gaharu extract/resin as a perfume ingredient. Several studies have also reported high levels of agarofuran (5) in gaharu samples of high grade [24][25][26][27].
In the present study, however, both the PLS-DA and Random Forests models showed that the levels of the constituent in the different groups were not significantly different.
Interestingly, although PLS-DA and Random Forests are based on different concepts, both models yielded similar results in terms of class plots (score and MDS plots) and the chemical constituents for the new group clusters, as well as the VIP values. We noted that since Random Forests was developed based on a random subset in both variables and individual data, repeating the Random Forests analysis may not always yield exactly the same results, albeit it was similar when we rerun the analysis. However, the Random Forests result may still explain about uncertainties in the biological system. 1 H-NMR-based metabolomics was shown to be effective in classifying A. malaccensis gaharu samples of varying quality, as sampled from the market place. Using this approach, it was also possible to propose a new group of classification based on their chemical constituents. Random Forests and PLS-DA were found to be reliable chemometric methods to assess the differences and similarities among the different gaharu samples. Using the two methods, the gaharu samples analysed in the present study, could be reclassified into three groups based on their chemical characteristics. From the identified gaharu constituents, eight metabolites could be proposed as differentiating chemical constituents between the high (jinkohol (3), kusunol (4), and 10-epi-γ-eudesmol (6)), intermediate (fatty acid (13) and vanillic acid (8)) and low (phenylethyl chromones (1 and 2) and aquilarone derivatives (14)) grades of A. malaccensis gaharu. However, the results were based on relative quantification of the metabolites. Further confirmatory analyses are required to determine the absolute quantification and identification of these chemical constituents.

Samples and Chemicals
Samples of A. malaccensis gaharu in the form of wood chips were purchased from an experienced collector and trader. The samples consisted of varying quality of gaharu samples (Supplementary  Table S1), graded and valued (RM/kg) by the collector according to the ABC Agarwood Grading System. The gaharu samples were inspected for authenticity and the grading was double checked and confirmed by an in-house expert. For the purpose of the present study, the samples were grouped according to their selling price and labeled A to H. Each group of gaharu samples consisted of six replicates. Lab grade methanol (redistilled prior to use), methanol-d4 (CD 3 OD, 99.8%), KH 2 PO 4 , sodium deuterium oxide (NaOD) and deuterium oxide (D 2 O) (99.8%) were purchased from Merck (Darmstadt, Germany).

1 H-NMR Sample Preparation
Each sample of gaharu wood chips was pulverized into fine powder using mortar pestle and grinder. To ensure that wood particles are of uniform size, the ground samples were sieved using sieve shaker (Retsch) to collect ≤140 µm particles. The sieved samples (1 g) were then extracted with 10 mL lab grade methanol (sample:solvent ratio of 1:6 (w/v) by sonication (Ultrasonic LC 60H, Elma, Singen, Germany), for 1 h at ambient temperature. The extracts were filtered and the collected filtrates taken to dryness under vacuum (MiVac, Genevac, Ipswich, UK), followed by lyophilization. All extracts were kept at −80 • C until further analysis.
Samples for NMR measurements were prepared by resuspending 20 mg of each sample extract in 700 µL CD 3 OD to which 0.5% tetramethylsilane (TMS) had been added as reference standard. The sample-solvent mixtures contained in 1.5 mL Eppendorf tubes were then sonicated for 15 min at room temperature to facilitate resolubilization of the extract in the NMR solvent. After centrifuging for a further 15 min at a speed of 13,000 rpm, the clear supernatant solutions of each sample (700 µL) were transferred into 5 mm NMR tubes for NMR data acquisition.

1 H-NMR Data Acquisition and Data Preprocessing
In total, 48 samples were analyzed (8 groups × 6 replicates). 1 H-NMR spectra were recorded at 25 • C on a Unity Inova 500 MHz NMR spectrometer (Varian, Palo Alto, CA,USA) using 128 scans over a proton frequency range of 15 ppm. The PRESAT program was used to suppress undesirable signals caused by residual water. J-resolved and 2D NMR analysis were also performed for structural elucidation of chemical constituents present in the extract. The generated 1 H-NMR FIDs were manually processed, phase corrected and referenced to the internal standard, TMS (δ 0.00 ppm). Baseline correction was applied to all spectra before converting to ASCII file and binned using Chenomx software after which the processed raw data was saved in an Excel spreadsheet (Microsoft, Washington, DC, USA). The raw data were binned into individual widths of δ 0.04 starting from chemical shift region δ 0.5 to δ 10.00 ppm. Water and solvent peaks in the region δ 4.70-4.96 and δ 3.28-3.33 ppm, respectively, were excluded in the multivariate analysis.

Metabolite Assignment
The metabolites were identified by comparing the characteristic peak signals in the 1 H-NMR spectra of samples with published data and existing literature databases (www.hmdb.ca; Chenomx NMR Suit Ver.7.1, company Edmonton, AB, Canada). Identification of compounds was also supported by 2D NMR and LC-MS analysis. Table 1 and Supplementary Table S2 show the identified metabolites in the gaharu samples.

Development of PLS-DA and Random Forests Models
PLS-DA is a supervised classification technique. This technique optimizes separation between different groups of samples and develops link between two data matrices X (i.e., data, binned spectra) and Y (i.e., groups, class membership etc.) by maximizing the covariance between these X and Y matrices and finding a linear subspace of the explanatory variables [30,47]. The Y-variables are represented with a special binary 'dummy' [30,48]. Data were Pareto-scaled and PLS-DA was carried out using SIMCA-P software (version 12.0, Umetrics, Umea, Sweden). Random Forests is a tree-based ensemble method where two subsets are operated in independent variables at each node and in individual observation data by bootstrapping technique. Random Forests can be used for unsupervised and supervised classification as well as regression. In the current study, random Forests was used for a supervised classification, performed using the Random Forest R package [45].

Statistical Analysis
The relative quantification of chemical constituents was based on the mean binned peak height of the related 1 H-NMR signals. ANOVA and Tukey's honest multiple comparison tests were conducted to evaluate the significant difference (p < 0.05) between the differentiating metabolites. The statistical analysis was performed using SPSS version 16.0 (SPSS Inc., Chicago, IL, USA) software.

Conclusions
The study showed that, using 1 H-NMR-based metabolomics, it is possible to discriminate between A. malaccensis gaharu samples of different quality. The results provide an insight into the chemical characteristics of gaharu, categorizing the samples into high, intermediate and low grades. Although more extensive work needs to be done, such as applying the analysis to other gaharu-producing species, the information obtained in this study is of importance and contributes towards development of a 'chemical assay' that could make the process of grading gaharu or agarwood samples more accurate, practical and efficient.