UHPLC-Orbitrap-MS Tentative Identification of 51 Oleraceins (Cyclo-Dopa Amides) in Portulaca oleracea L. Cluster Analysis and MS2 Filtering by Mass Difference

Oleraceins are a class of indoline amide glycosides found in Portulaca oleracea L. (Portulacaceae), or purslane. These compounds are characterized by 5,6-dihydroxyindoline-2-carboxylic acid N-acylated with cinnamic acid derivatives, and many are glucosylated. Herein, hydromethanolic extracts of the aerial parts of purslane were subjected to UHPLC-Orbitrap-MS analysis, in negative ionization mode. Diagnostic ion filtering (DIF), followed by diagnostic difference filtering (DDF), were utilized to automatically filter out MS data and select plausible oleracein structures. After an in-depth MS2 analysis, a total of 51 oleracein compounds were tentatively identified. Of them, 26 had structures, matching one of the already known oleracein, and the other 25 were new, undescribed in the literature compounds, belonging to the oleracein class. Moreover, based on selected diagnostic fragment ions, clustering algorithms and visualizations were utilized. As we demonstrate, clustering methods provide valuable insights into the mass fragmentation elucidation of natural compounds in complex mixtures.


Introduction
Portulaca oleracea L. (Portulacaceae), or purslane, is a widely spread annual plant found in many parts of the world. Purslane is considered an edible vegetable in many areas of Europe, the Mediterranean, and tropical Asian countries, and is added in soups and salads [1][2][3]. Purslane has been used in folk and traditional medicine as a remedy for many ailments [4].
Herein, utilizing UHPLC-Orbitrap-MS in negative ionization mode, we carried out an extensive tentative identification (level 2 annotation) of oleraceins in hydromethanolic extracts from the aerial parts of purslane. The MS 2 characterization of oleraceins was limited to compounds having a mass up to 1 kDa, although heavier ones were also detected. After an in-depth MS 2 analysis, a total of 51 oleracein compounds were tentatively identified and characterized, of which, 26 had structures, matching one of the already known oleraceins, and the other 25 were new, undescribed in the literature structures, belonging to the oleracein class. Diagnostic ion filtering (DIF), and diagnostic difference filtering (DDF), were utilized to refine the selection of compounds. Additionally, clustering of every oleracein based on their MS 2 features was performed and presented with heatmaps, k-means and pam clustering, principal component analysis (PCA), and hierarchical clustering. As we demonstrate, clustering methods can provide valuable insights into the structure elucidation of natural compounds by mass spectrometry in complex mixtures.

Results and Discussion
A workflow diagram of the study is shown in Figure 1. In summary, hydromethanolic extract of purslane was obtained, and subjected to HR-MS 2 analysis. The raw data were filtered by DIF, and then DDF, to select compounds possessing both specific fragment ions and specific mass differences corresponding to the presence of 5,6-dihydroxyindoline-2-carboxylic acid-a common scaffold for all oleraceins. The filtered MS 2 data were then analyzed manually to structurally elucidate oleracein compounds. In the course of elucidation, 43 fragment ions were selected as diagnostic fragment ions that were afterwards used to describe every oleracein as a vector of length 43, and values equal to the relative percentage intensities of the diagnostic ions. This permitted to carry out clustering analyses to establish structural similarities between oleracein compounds, based on their MS 2 features. The results from the clustering analysis were used to corroborate and supplement the structural elucidation or to correct it. Oleraceins are characterized by 5,6-dihydroxyindoline-2-carboxylic acid N-acylated with either coumaric, caffeic, or ferulic acid. Most of them are 6-O glucosylated ( Table 1).
The sugar moieties of all characterized oleraceins in the literature are identified as β-Dglucopyranose [12,27,29], and so, in this paper, the hexose moieties are regarded as glucoses.
In our study, the lowest m/z oleracein detected was oleracein A, with a molecular ion of 502.135 [M-H] − m/z. Our study limited the characterization of oleraceins with mass up to 1 kDa, although heavier oleraceins were also detected.
In total, 82 candidate substances were automatically selected, based on the abovementioned criteria using DIF, followed by DDF, and their MS 2 fragmentation manually inspected. The DDF results are presented in the Supplementary Material (Table S1) as m/z transitions for all identified oleraceins. Of the total 82 candidates, 19 had too low MS 2 intensity (base peak below 1.5 × 10 4 ) and were not interpreted, 12 were false positives (not having oleracein structure), and 51 were identified as oleraceins ( Table 2). Of them, 25 were new (undescribed in the literature) oleraceins. The other 26 compounds matched a structure of one of the already known oleraceins, namely: A, B, C, D (two isoforms), H, I (two isoforms), J, K, L, M (two isoforms), N/S (four isoforms), O (three isoforms), P, Q, W glu, X (three isoforms). Figure 2 presents the full-scan total ion chromatogram (TIC) and the extracted ion chromatogram (EIC) of the identified oleraceins. As observed, oleraceins are major components of purslane hydromethanolic extract. Table 2 presents the chemical structures of the identified 51 oleraceins and Table 3 provides their chromatographic and mass spectral characteristics. For complete chromatographic and mass spectral data, see Tables S2 and S3.       As mentioned earlier, all identified in the literature oleraceins are N-acylated with either coumaric, caffeic, or ferulic acid (i.e., bearing the substructure IC, IA, or IF). Each oleracein identified in our study is comprised of either GIC, GIA, or GIF alone, or linked with several other moieties that include hydroxybenzoyl

Individual MS 2 Fragmentation Analysis
Below, the tentative identification of all 51 oleraceins (Tables 2 and 3)

Diagnostic Ions
Since oleracein compounds bear similar structure, we sought to refine the "fragment ion pool" and to select diagnostic fragment ions that can be used to describe the identified oleraceins. Thus, after thorough MS 2 fragmentation analysis, we selected 43 fragment ions, their elemental compositions and exact masses determined, that were utilized as diagnostic ions for the identified substances (Table S4). Hence, each oleracein was described as a vector of length 43 with values equal to the relative intensities of the corresponding diagnostic ions. A fragment ion from an MS 2 data of a particular oleracein was assigned to a diagnostic ion if its m/z were within 15 ppm error of the diagnostic ion's m/z. All diagnostic ions had the following features: a mass greater than 100 Da, were encountered in two or more oleraceins, had a mean percent intensity greater than 5%, and a maximum percent intensity greater than 10%. The diagnostic fragment ions along with their featured structures are shown in Table S4 and discussed below in increasing mass.
Fragment ion 137.0243 m/z is derived from the hydroxybenzoyl (O) moiety, as described in the characterization of OGGIC and OGGICG. The coumaroyl (C) moiety can be evident by fragment ion 145.0294 m/z. The caffeoyl (A) can be confirmed with fragment ion 161.0243 m/z; however, if not linked to the indoline core (I), fragment 179.0349 m/z may as well be present. Fragment 161.0243 m/z might indicate a feruloyl (F) as well; however, in our study, the caffeoyl produced > 60%, whereas the feruloyl produced < 30% intensity of fragment ion 161.0243 m/z. Fragment ion 175.0400 m/z indicates the presence of F. The appearance of fragment 193.0506 m/z might indicate F not linked to I. The I is featured by fragment ions 194.0458, 150.0560 and 148.0403 m/z, in decreasing intensity. Fragment 194.0458 m/z as well as the characteristic fragments for the HCAs are usually very prominent, their intensity decreases with increasing the mass of the molecule, unless the molecule possesses easily cleavable moieties, like consecutive glucoses, as in GGGICG or GGICGG. Fragment 205.0505 m/z is observed in all identified substances, bearing the sinapoyl (S) moiety, as well as fragment 223.0611 m/z, in lower intensity. As the S is encountered only linked to a glucosyl (G), fragment ions 562.1565 and 367.1034 m/z corresponding to the SGI and SG substructures, respectively, can indicate the presence of S.
As mentioned above, the ind-HCA structures: IC, IA, and IF, are confirmed with their corresponding fragment ions: 340.0826, 356.0775, and 370.0931 m/z, respectively. However, as the mass of the oleracein increases, the intensities of these characteristic fragments might lower. If the oleraceins bear easily cleavable moieties, like two or three consecutively linked G, the fragment ions indicating the ind-HCA structures may be more prominent. Thus, in the oleracein GGGICG, where the GGG cleave together as a neutral loss of 486.159 Da, high intensity of fragment 340.0826 m/z (64%) is observed, as well as fragment 145.0294 m/z (100%). On the other hand, in the oleracein AGGIC, lower intensities of both these fragments are observed (11% and 37%, respectively).
The IC fragment at 340.0826 m/z undergoes consecutive CO 2 and CO cleavages, resulting in fragments 296.0927 and 268.0978 m/z, respectively, in decreasing intensity.

Clustering Methods
The clustering methods express the similarity between the oleraceins based on their MS 2 features. Initially, a m by n ions matrix was created, describing every oleracein (rows, m = 51) with their corresponding diagnostic ions (columns, n = 43), shown in Table S5. Every cell in the matrix represents the percentage intensity of a diagnostic ion for a particular oleracein. If a diagnostic ion was missing in the MS 2 data of a compound, zero intensity was assigned. This data were imported to RStudio and manipulated further with the R programming language. Different methods were used to cluster the oleraceins based on their MS 2 features. As a primary step, a distance matrix was created (51 × 51), calculating the Euclidean distance from the ions matrix. Figure 5 represents the ordered and unordered heatmaps, using the data from the distance matrix.
Then, k-means and pam clustering were used to estimate the optimal number of clusters. The clustering observed in the ordered heatmap ( Figure 5), as well as the data of the k-means and pam clustering (Supplementary Material Figures S1 and S2), suggested to cluster our data with eight clusters (Figure 6). Distribution of individuals in the groups can be found in Table S6.
Next, principal components were calculated. Scree plot (representing the percentage of variances explained by each principal component) can be found in Figure S3. The PCA visualization is presented in Figure 7, where the color gradient from orange (darker) to blue (lighter) presents the quality of representation (cos2), from high to low. A high cos2 indicates a good representation of the variable on the principal component, and vice versa. Hence, three groups can be distinguished, that are characterized by either IA (1st quadrant), IF (2nd quadrant) or IC (4th quadrant) substructures. The quality of representation (cos2) of individuals as well as a visualization of the contribution of individuals on PC1 and PC2 are given in Figures S4 and S5. Figure 8 depicts the clustering of the individual fragment ions, positively correlated variables (fragment ions) point to the same side of the plot, and vice versa. Thus, we can clearly observe which fragment ions "go together" (are parent ion → daughter ion). As expected, identical to Figure 7, the fragment ions corresponding to IA (1st quadrant), IF (2nd quadrant), and IC (4th quadrant), are observed as they are described in the "diagnostic ions" section, and presented in Figure 3.
Next, hierarchical clustering was employed with method = "average". The cophenetic correlation coefficient was calculated to be 0.85. The hierarchical clustering is visualized as a dendrogram and a phylogenetic tree in Figure 9, where the tree was "cut" into 8 parts.
Overall, the different clustering methods used resulted in similar clustering. In accordance with the proposed structures, different isoforms with the same structure are clustered together (as FGGIC, FGGIC.1, FGGIC.2, etc.). In general, oleraceins are grouped depending on the presence of either of the three common substructures: GIC, GIA, or GIF, that give rise to other diagnostic fragment ions. Thenceforward, different combinations or permutations of substructures lead to specific diagnostic fragment ions.    Oleraceins GIC, FGIC, SGIC, GICG, FGICG, and SGICG, have either GIC or GICG in common, and are grouped together. Next, GGICGG has the unique feature that there is a GG attached to the N-coumaroyl. It does, however, show some similarity to GGICG, GGGICG, GGGICG.1. The latter three oleraceins cluster together, with the ICG substructure in common. Next, a cluster composed of GGIC, CGGIC, FGGIC, FGGIC.1, FGGIC.2, FGGIC.3, SGGIC, and SGGIC.1, is formed with the GGIC as a common feature, and no G attached to the N-coumaroyl. The two representatives possessing O are clustered together; fragment ion 137.0243 m/z corresponding to the O substructure is exhibited at high intensity and unique to OGGIC and OGGICG. Oleracein AGGICG shows structural uniqueness, but nevertheless, demonstrates similarity with AGGIC, GAGGIC, and GAGGIC.1, which conforms with their proposed structures. Another cluster comprised of AGGIA, AGGIF, AGGIC, GAGGIC and GAGGIC.1 is formed with the AGGI substructure in common. The A not attached to I is indicated by the appearance of fragment ion 179.035 m/z, in addition to 161.0243 m/z, characteristic for A. Moreover, fragment ion 680.1831 m/z, that indicates GGIA or GIAG or AGGI, is exhibited in prominent intensity. AGGGIC shows distinctive MS 2 features due to, on one hand, the proximal A, and on the other hand, the GGG. Other clustered oleraceins bearing the GIA substructure include: GGIA, FGGIA, and SGGIA with the GGIA substructure in common. GIAG and GGIAG, as well as GIA and SGIA, are clustered pairs, which is also evident from their structures. Oleraceins FGGIF, FGGIF.1, FGGIF.2, GGIF, GGIF.1, SGGIF, SGGIF.1, and SGGIF.2 are grouped with GGIF as a common substructure. Another grouping is formed between oleraceins with the GIFG common structural feature, namely GIFG, GGIFG, and GIFG.1. The rest of the oleraceins bearing the GIF do not possess G attached to the N-feruloyl, have a single G attached to I, and not GG or GGG. They are grouped and represented with oleraceins GIF, FGIF, GFGIF, and SGIF.
It is worth mentioning that the calculated similarities do not provide direct quantitative measure of the structural similarity. Some substructures (or moieties) are represented with more than one diagnostic fragment ion, and others are not. Additionally, some diagnostic fragment ions exhibit, in general, higher intensity than others. Nevertheless, the clustering analysis performed demonstrates that this approach can provide additional perception on the relationships of MS 2 fragment ions and outline groups of parent ion → daughter ion.

Extraction and Sample Preparation
The hydromethanolic extracts of purslane were obtained as described in our previous study [28] with small modifications. In brief, air-dried aerial parts of purslane were powdered, 3.00 g of plant material were extracted twice by sonication with 10 mL 80% MeOH at 50 • C for 15 min in an ultrasonic bath. The combined extracts were filtered and diluted to 25 mL in volumetric flasks with 80% MeOH. The solutions were filtered through a 0.22 µm syringe filter, and 1µL was injected into the LC instrument for LC-MS analysis.

UHPLC-HR-MS Instrument
The UHPLC system consisted of Dionex UltiMate 3000 RSLC HPLC, equipped with an SRD-3600 solvent rack degasser, an HPG-3400RS binary pump with solvent selection valve, a WPS-3000TRS thermostated autosampler, and a TCC-3000RS thermostated column compartment (Thermo Fisher Scientific, Germering, Germany). The UHPLC system was controlled by Chromeleon software, version 7.2. The effluents were connected on-line with a Q Exactive Plus mass spectrometer (Thermo Fisher Scientific, Bremen, Germany) equipped with a heated electrospray ionization (HESI-II) probe.

Chromatographic Parameters
Elution was carried out on Kromasil EternityXT C18 (1.8 µm, 2.1×100 mm, Akzo Nobel, Bohus, Sweden) column maintained at 40 • C. The chromatographic conditions were as described elsewhere [28] with slight modifications. The binary mobile phase consisted of A: 0.1% formic acid in water, and B: 0.1% formic acid in acetonitrile. The total run time was 23.5 min. The acquisition time where substances were analyzed with MS was 18 min and set between the 2nd and the 20th min. The following gradient was used: the mobile phase was held at 5% B for 0.5 min and then gradually turned to 33% B over 19.5 min. Next, % B was increased gradually to 95 % over 1 min and maintained at 95% B for 2 min. The system was turned to the initial condition of 5% B in 1 min and re-equilibrated over 4 min. Oleraceins eluted between the 5th and 17th min. The flow rate and the injection volume were set to 300 µL/min and 1 µL, respectively.

Mass Spectrometric Parameters
For MS 2 fragmentation analysis, several normalized collision energies (NCE) were tested to select the optimal conditions. The 20 NCE gave satisfactory abundance of variety of heavier fragment ions, and 40 NCE provided good intensity to lower m/z specific fragment ions, and thus, a stepped 20-40 NCE was selected for initial screening of oleraceins. Mass spectrometric parameters for Full-scan MS were as follows: resolution 17,500 (at m/z 200); AGC target 1.0×10 6 ; Maximum IT 83ms; Scan range 500-2000 m/z. For dd-MS 2 , the following parameters were used: TopN 10; isolation window 1.0 m/z; stepped NCE 20-40; Minimum AGC target 8.0×10 3 ; Intensity threshold 9.6×10 4 ; Apex trigger 2 to 6 s; dynamic exclusion 3 s. The structural elucidation of the oleraceins was achieved by manual inspection of the MS 2 spectra in Xcalibur 4.2 software (Thermo Fisher Scientific).

Mass Spectral Filtering by Diagnostic Ion Filtering (DIF) and Diagnostic Difference Filtering (DDF)
Initially, vendor *.raw (Thermo Fisher Scientific) files were converted to *.mzML files by msConvertGUI 3.0 (ProteoWizard) [31] and imported to MZmine 2.53 [32]. Then, DIF was applied based on the presence of two of the specific fragment ions for 5,6-dihydroxyindoline-2-carboxylic acid (called below in the text as "indoline core"): 194.0459 m/z (chemical formula: C 9 H 8 O 4 N − ) and 150.0560 m/z (chemical formula: C 8 H 8 O 2 N − ) (Figure 3), with a ±15 ppm threshold. MZmine also offers "diagnostic neutral loss" filtering for searching of specific mass difference(s) only between the precursor ion and each of its fragments. However, since we were interested in searching for the specific mass difference including between fragment ions of the same precursor ion (Figure 4), a DDF approach was applied to refine the selection of molecules that supposedly possess the 5,6-dihydroxyindoline-2-carboxylic acid substructure. DDF involved searching for a specific mass difference between each fragment (including the precursor ion, even if it was not present in the MS 2 spectrum) and all lower m/z fragment ions. That difference was set to 195.05316 Da and suggested a neutral loss of 5,6-dihydroxyindoline-2-carboxylic acid. DDF was achieved by an in-house script written in Python 3.7.1 programming language. The defined threshold was set to ±15 ppm of the ions from which the difference originated. Thus, in the fragmentation transition 340.0848 m/z → 145.0300 m/z, with a threshold of ±15 ppm, the searched difference was between 145.0300 ±15 ppm and 340.0848 ±15 ppm (i.e., from 195.0475 Da to 195.0621 Da). Thus, if the difference originated from heavier fragments, a bigger mass threshold was used, and vice versa.

Grouping of MS 2 Scans
In order to group MS 2 scans that presumably derive from the same substance, MS 2 scans with precursor ion m/z within 15 ppm and within 1.5 % deviation in retention times were added together, and afterwards manually checked. In these grouped MS 2 scans, fragment ions that were within 15 ppm m/z were considered identical, their intensities added, and their masses recalculated by weighted mean averaging: where (m/z) avg is the recalculated m/z value, (m/z) i and int i are the m/z and the intensity of the ith fragment ion, respectively. Fragment ions having less than 0.5% intensity and mass < 100 Da were excluded. The retention time of the precursor ion with the highest intensity was chosen as the retention time of grouped MS 2 scans, i.e., the peak apex.

Used Abbreviations
For simplicity and clarity of the presentation, the following abbreviations are used throughout this paper: hydroxycinnamic acid: HCA; hydroxybenzoyl: hb or O; coumaroyl: coum or C; caffeoyl: caff or A; glucosyl: glu or G; feruloyl: fer or F; indoline core: ind or I; sinapoyl: sin or S. In case multiple oleraceins bore the same structure, the names of the compounds were suffixed with numbers, i.e., FGGIC, FGGIC.1, FGGIC.2, etc. (Tables 2 and 3).  Table S7. Representative raw HR-Orbitrap-MS 2 spectra of the 51 identified oleraceins in negative ionization mode; Figure S1: Estimating the optimal number of clusters with k-means clustering; Figure S2: Estimating the optimal number of clusters with pam clustering; Figure S3: Scree plot representing the percentage of variances explained by each principal component; Figure S4: Visualization of the quality of representation of individuals (cos2); Figure S5: Visualization of the contribution of individuals on PC1 and PC2.