Hopomics: Humulus lupulus Brewing Cultivars Classification Based on LC-MS Profiling and Nested Feature Selection

Omics approaches in plant analysis find many different applications, from classification to new bioactive compounds discovery. Metabolomics seems to be one of the most informative ways of describing plants’ phenotypes, since commonly used methods such as liquid chromatography–mass spectrometry (LC-MS) and nuclear magnetic resonance spectroscopy (NMR) could provide a huge amount of information about samples. However, due to high efficiency, many disadvantages arise with the complexity of the experimental design. In the present work, we demonstrate an untargeted metabolomics pipeline with the example of a Humulus lupulus classification task. LC-MS profiling of brewing cultivars samples was carried out as a starting point. Hierarchical cluster analysis (HCA)-based classification in combination with nested feature selection was provided for sample discrimination and marker compounds discovery. Obtained metabolome-based classification showed an expected difference compared to genetic-based classification data. Nine compounds were found to have the biggest classification power during nested feature selection. Using database search and molecular network construction, five of them were identified as known hops bitter compounds.


Introduction
Humulus lupulus is one of the species representing the genus Humulus in the family Cannabaceae. It is a perennial, herbaceous, climbing liana that grows in a temperate climate. This plant left a remarkable trace in different countries' cultures, mainly as a medicinal plant. As well as a multifunctional beer drinks component, it acts both as a preservative and one of the main flavoring properties of the drink's sources. Nowadays, the brewing industry remains the main consumer of hops. Volume of annually harvested hops is estimated at hundreds of thousands of tons [1]. For the needs of brewing, female flowers are required, which have a cone-shaped form and glands on their surface. The last secrete the bulk of the secondary metabolites that compose the plant flavor profile. Breeding techniques development and wide distribution of hops around the world let us have more than a hundred varieties cultivated nowadays [2], some of which were bred already in the 21st century [3,4]. In addition, wild hops are also considered as a raw material for brewing [5,6]. The multiple techniques used in beer hopping as well as the wide range of hop varieties cultivated today [7] make it possible to produce beverages with a rich palette of flavors and aromas [8].
The lack of genealogical information about the individual species' origin is formed due to the absence of systematizing information. The protocols about the derived cultivars and their initial stage of breeding development are unavailable due to commercial secrecy. All this together makes it difficult to systematize used varieties. To solve this problem, there are many approaches that can be divided into two groups: genetic and phenotypic. The development of genetic techniques, including whole genome sequencing, has opened absence of a predictive marker in the chosen set (for methods such as EST-SSR) or the big influence of a single gene expression on overall plant phenotype, and as a result, on its metabolome [19,20]. For human cultivated species, both genetic and metabolomic approaches are important: it is necessary to estimate connections between cultivars on both genetic and small organic molecules' composition levels. Considering all mentioned details, Humulus lupulus was chosen as our study object, for proving the hypothesis that there is no clear correlation between genome-and metabolome-based classification for plants. Influence of the region of growth was also going to be estimated. The collection was formed (see Section 3.2) based on the sample's region of growth and genetic origin, which was obtained from the study on hops EST-SSR-based classification [10]. A total of 18 cultivars were analyzed in the present study, each was represented by three biological repeats (2019-2020 harvests). Samples were divided into three groups based on their geographic origin: Central Europe, North America and Oceania (six samples in each); in two groups by genetic classification [10]: North American (6) and European germplasm (12); and also in two groups according to plant usage in beer flavoring: aroma (10) and dual use (8). Detailed information about selected cultivars is provided in Table S1.

Data Pre-Treatment
Usage of LC-MS instruments provides a huge, informative dataset for metabolomics research, but the complexity of both LC and MS instruments together with specifics of the large-scale analysis brings many issues, such as batch and matrix effects [21], which could affect results of the following statistical analysis. Most of the negative effects could be reduced by the development of experimental design, including experimental sample randomization [22], usage of quality control (QC) and blank samples [23] and careful data pre-processing steps before results interpretation. There are various workflows and tools designed for LC-MS data pre-treatment [24][25][26][27], and most of them are based on Rprogramming language and corresponding packages. The pipeline of the present study was based on "Omics Untargeted Key Script" [28] (computational details in Section 3.5, experimental design and LC-MS instrumentation-3.3, 3.4). The missing value filling step was done by Random Forest model-based imputation strategy described in [29], as this approach works well on samples with large similarity, which positively affects model construction time. Principal component coordinates were computed for QC and individual samples for quality assurance purposes. As displayed in Figure 1, all QC samples are placed in the center in case of both positive and negative ion signals, which indicates properly prepared QC samples and well-constructed experimental sequences.
Before the signal correction step, signal reproducibility was checked by RSD [30] calculation between QC samples-88% (3127) and 90% (3023) of signals had RSD less than 30% for negative and positive polarity, respectively. The XGBoost model was utilized for signal correction as described in Section 3.5. After the correction step, RSD value among QC samples was checked again as mentioned above and, as corrected peak areas had good reproducibility between QC samples, no features have been filtered. All steps mentioned above allowed us to reduce the influence of the unwanted technical variations and perform further classification and feature selection tasks correctly. Before the signal correction step, signal reproducibility was checked by RSD [30] calculation between QC samples-88% (3127) and 90% (3023) of signals had RSD less than 30% for negative and positive polarity, respectively. The XGBoost model was utilized for signal correction as described in Section 3.5. After the correction step, RSD value among QC samples was checked again as mentioned above and, as corrected peak areas had good reproducibility between QC samples, no features have been filtered. All steps mentioned above allowed us to reduce the influence of the unwanted technical variations and perform further classification and feature selection tasks correctly.

Metabolome-Based Classification
A large amount of data obtained during metabolomic profiling (more than 3000 peaks in the present study) leads to confusion in further results. Dimensionality reduction

Metabolome-Based Classification
A large amount of data obtained during metabolomic profiling (more than 3000 peaks in the present study) leads to confusion in further results. Dimensionality reduction techniques can be performed in two ways: feature selection by statistical tests or machine learning methods to eliminate statistically non-informative variables and feature extraction by principal component analysis (PCA) or other algorithms in order to transform the whole feature space into new variables. PCA and other similar data representation techniques [31] are also often used in the first steps of LC-MS experimental data treatment for exploratory purposes, such as quality assurance [23]. In some cases, it is even possible to cluster sample dots for group labels on a score plot by visual analysis of the first two or three principal components spaces [32,33]. In the present study, a combination of two unsupervised methods was utilized (see Section 3.6 for more details). PCA was employed for multivariate projection and HCA was performed both for distance calculation between sample dots on the score plot and assignment of the cluster group. The procedure of receiving a new sample group label should be considered as one of the primary aims of the present study, due to the lack of solid confidence in cultivars labels based on genetic data, despite its relevance compared to information about the geographic origin or cultivar's aroma characterization [34]. In a similar way, a combination of different algorithms (both univariate and multivariate) was applied for feature selection and dimensionality reduction.
Thus, class labels were received from the combination of both information sources: HCA results (Section 3.6, Figure 2, Table 1) and the geographic origin of the samples, and PCA was used to describe the distribution of experimental samples in the high-dimensional feature space ( Figure 3).
niques [31] are also often used in the first steps of LC-MS experimental data treatment for exploratory purposes, such as quality assurance [23]. In some cases, it is even possible to cluster sample dots for group labels on a score plot by visual analysis of the first two or three principal components spaces [32,33]. In the present study, a combination of two unsupervised methods was utilized (see Section 3.6 for more details). PCA was employed for multivariate projection and HCA was performed both for distance calculation between sample dots on the score plot and assignment of the cluster group. The procedure of receiving a new sample group label should be considered as one of the primary aims of the present study, due to the lack of solid confidence in cultivars labels based on genetic data, despite its relevance compared to information about the geographic origin or cultivar's aroma characterization [34]. In a similar way, a combination of different algorithms (both univariate and multivariate) was applied for feature selection and dimensionality reduction.
Thus, class labels were received from the combination of both information sources: HCA results (Section 3.6, Figure 2, Table 1) and the geographic origin of the samples, and PCA was used to describe the distribution of experimental samples in the high-dimensional feature space ( Figure 3).  Table S1. Table 1. List of cultivars analyzed in the present study. The last column is dedicated to groups, to which cultivars were assigned by the results of the study.

Cultivar
Region   the Table S1. North American Aroma 1 Figure 3. Projection of study samples on principal components surface. Newly assigned groups are demonstrated by color, geographic origin is represented by points shape.
As shown, HCA assigned cluster membership similar to their visually observable location. Furthermore, for each sample, biological repeats are grouped together, so it can be concluded that there is no significant difference between harvests. Compared to genetic classification results from [10], all "European germplasm" origin samples were assigned to the same cluster with four more cultivars from the "North American germplasm" group: Hallertau Mittelfruh, Amarillo, Cascade and Kohatu, which are reported to contain rela-  As shown, HCA assigned cluster membership similar to their visually observable location. Furthermore, for each sample, biological repeats are grouped together, so it can be concluded that there is no significant difference between harvests. Compared to genetic classification results from [10], all "European germplasm" origin samples were assigned to the same cluster with four more cultivars from the "North American germplasm" group: Hallertau Mittelfruh, Amarillo, Cascade and Kohatu, which are reported to contain relatively low (3-7%) alpha-acids in their composition. Such results prove the hypothesis that there is no clear correlation between plants' genome and metabolome. Genomic research, unfortunately, could not tell us about expression level of genes related to enzymes responsible for synthesis of the small organic molecules in the plant. Additionally, it is typical for plants to contain wide non-coding or some kind of regulatory regions, which do not have a direct influence on the resulting phenotype. Additionally, obtained sample labels, in most cases, match accepted cultivars classification [34] by aroma and flavor, even in conditions when volatile compounds, which primarily affect Humulus lupulus taste and odor characteristics, have not been the aim of our work. Another interesting task is a comparison between cultivar genetic and geographic origins. As an example, studies [35,36] were focused on untargeted metabolomics analysis for cultivar and geographic origin discriminations of olive oil and hazelnut (Corylus avellana L.) and significant overlapping between marker compounds for those two levels of classification was not reported. As the plot above shows, only some cultivars from the Oceania region seem to be distancing from the studied dataset (see Figure S6 for plot with marked sample names). Internationally standardized harvest conditions of studied species, as well as the major influence of genotype, could explain such a situation. As the region of growth does not have a significant influence on our sample, in the next step, most discriminative marker compounds for our metabolome-based classifi-cation, which is based on HCA results, were derived using a rational cut-off approach in further statistical analysis.
Rational cut-off is one of the most frequently used techniques in chemometrics for feature selection. It is usually based on the simultaneous application of statistical tests combination to a studied dataset with different rigors and natures, which leads to the selection of only variables that satisfy a list of statistical conditions (cut-off). In the present work for feature selection, two classical mean comparison tests with different rigor levels were utilized: moderated t-test (as one of the most rigorous test for comparing means due to shrink deviation of variables) and fold change calculation, and also partial least squares (PLS) classification model computation as an "orthogonal" supervised multivariate method. A moderated t-test was implemented by Limma package [37] with Benjamini−Hochberg pvalue adjustment method for multiple comparisons. Those calculations were implemented in a nested-like feature selection scheme (Section 3.6) [38,39] to avoid negative effects associated with small dataset size. Cross-validation based on 10-fold splitting (outer loop, Section 3.6) was performed in order to avoid selection bias effect [40] and to provide multiple estimations of classification ability. Classification accuracy was tested with a generalized linear model (GLM) on each outer-fold, which showed maximum classification accuracy for each of 10 dataset splits. Bootstrapping on each inner fold was performed to reduce the influence of the small dataset size. Obtained bootstrapped datasets were further proceeded through a rational cut-off step.
Sixty-two signals met the criteria established for maximum prediction power. Finally, the descriptive ability of selected metabolites was checked by providing HCA as well as the distribution of sample dots on the space of the first two principal components (Supplementary material, Figures S4 and S5). Misclassifications were not observed in the case of cluster analysis and the visual separation of groups on the PCA plot was increased.
It should be noted before structure elucidation that signal annotation was not performed at the preprocessing stages. As a result, among 62 derived signals (33 for positively charged ions and 29 for negative), 19 groups of signals were clustered by retention time. Each group contained expected a de-or protonated molecular ion signal, isotope and adduct signals, as well as smaller m/z values, which could be assigned to individual compounds or to ions formed during in-source fragmentation. The signal's type was investigated during subsequent analysis of the fragmentation spectra.

Identification of Marker Compounds
The correct way of unknown identification in complex mixtures such as plant extracts is to isolate compounds in their individual state and to provide multi-methods analysis for structure verification, as it was previously performed for studied plants in works [41][42][43][44]. However, as target compounds could occur in analyzed samples at extremely low concentrations, their isolation becomes very time-consuming. In that case, information obtained from LC-MS experiments: relative retention time, accurate mass, isotope distribution pattern and fragmentation spectra can help to identify a putative structure of a target compound with variable levels of confidence [45]. Furthermore, as not every plant species is studied enough and literature data is not systematized, not much information could be found in public massspectra databases compared to a situation with mammalian metabolites. So, the only way to identify minor plant components is to use such databases and resources as KNaPSAcK [18] or GNPS [46] in combination with tools for fragmentation spectra prediction [47,48] or molecular networking [49]. Such a pipeline allows us to find connections with similar compounds if the target one is not presented in a database. After filtration inside each peak group, only 11 pairs of signals were left (for positive and negative ions), apparently related to protonated and deprotonated molecules. The molecular formula was predicted using the Shimadzu Formula Predictor tool for each signal. Molecular formulas, prediction errors in ppm, as well as VIP (variable importance in projection) from the PLS model and fold change values are provided in Table 2. Concentration of all of the derived markers are greater in group 1 (Table 1) as the fold change value is greater than 0. As will be discussed below, most of the identified compounds belong to the alpha-acids class. Differences in the abundance between classes correlate with the results of the classification discussed above in Section 2.3: all cultivars in group 1 are reported to have a high concentration of alpha-acids. Obtained formulas were searched within the KNaPSAcK metabolite library and articles associated with Humulus profiling. After that, fragmentation spectra were obtained and processed via the GNPS library search tool (see Section 3.7). To add one more verification level, some major compounds (Xanthohumol, 8-prenylnargenin, lupulone, etc.) were identified on obtained chromatograms to compare relative retention and fragment ions of the marker compounds with data from [50]. Compounds with the same fragmentation spectra as xanthohumol were found in GNPS, but low retention time (10.46 min) and abundance of the peak with the same m/z at the later part of the chromatogram allows us to suspect this as isoxanthohumol. Cohumulone and humulone were also identified by the GNPS library and its relative retention matches information from the source mentioned above. Other markers' relative retention times and fragments were matched to posthumulone and prehumolone from the same article [50]. To add one more verification step, a molecular network between target and major compounds of Humulus lupulus has been constructed. To obtain the major plant's compounds fragmentation spectra, data acquisition in DDA (data-dependent acquisition) mode was provided. Obtained spectra, total ion current (TIC) and extracted ion (XIC) chromatograms and informative parts of the molecular network are presented in Supplementary Materials (Figures S1-S7). There are two informative clusters describing connections between humulone derivatives and between xanthohumol and an unknown compound related to the deprotonated ion with m/z 399.1443. Analysis of neutral losses in humulone derivatives spectra confirms all suspected structures. The connection between xanthohumol and the unknown compound is explained by the presence of ions with m/z 119.048, 93.034 and 65.003, which are supposedly corresponding to the 4-ethylphenol fragment. Other fragments and their neutral losses are not matched, so it is the only information that can be obtained about the structure. To sum up, five marker signals were identified as known hops' bitter components and our findings are consistent with previous reports. Three other signals with fold change values close to 1 have not been annotated properly. Thus, the list of marker substances was proposed along with classification performance statistics ( Table 2). Suggested markers may be used for Humulus lupulus material quality control as well as intragenus discrimination studies.

Sample Collection and Preparation
Hop pellets of different cultivars imported by Beervingem (Russia) were purchased at a local brewing store. All samples were stored in hermetic packages at low temperatures (3-5 • C) before and after the homogenization step. As the study object was presented in pellets, for which production out of hope cones technology is unified, there should be no significant loss of its content [51] in this step.
In the sample preparation step, typical workflows for Humulus lupulus [14,15,52] were used and performed before each experimental batch, with the exception of homogenization and weighing. A 15 mL tube with 300 mg of homogenized pellets was filled with 10 mL of methanol and then placed in the ultrasonic bath for 1 h (at 25 • C) for extraction. The obtained extract was syringed through a PET-membrane filter and then diluted 10 times with methanol before injection into the LC-MS system. Signal correction procedure and quality control samples were used to eliminate the impact of the batch effect. For QC samples preparation, an equal mixture of all study samples was treated in all steps provided above. The obtained stock solution was stored at low temperature and was diluted the same as the samples.

Data Acquisition
An LCMS-IT-TOF (Shimadzu, Kyoto, Japan) system, equipped with two Nexera LC-20 AD XR pumps and SIL-20 AC XR autosampler was used for untargeted profiling. Chromatographic separation was performed on reversed-phase column Acclaim C18 3 µm Hybrid mass-spectrometer consisting of the ion trap and time-of-flight instruments equipped with an electrospray ionization source was used as a detection system. Data were acquired in both positive and negative polarities on scan (MS 1 ) detection mode in conditions listed below: interface voltage-4.5 kV for positive and −3.5 kV for negative charged ions transition; CDL and heat block temperature-200 • C; nebulizing gas flow-1.5 mL/min; cooling gas pressure-100 kPa. Mass-spectrometer was operated at 120-850 Da range, repeated three times during the 300 ms period with ion trap accumulation time set to 30 ms.

LC-MS Profiling
In order to avoid sample selection bias, randomization of the experimental sequence was performed before LC-MS profiling. The resulting batches were filled with blank-and QC-sampls injections (three and five per batch, respectively) for system stability check, and six chosen samples were analyzed twice per batch (total nine batches were acquired), as shown in Figure 4. Each hop species was analyzed in two technical replicates.

LC-MS Profiling
In order to avoid sample selection bias, randomization of the experimental seq was performed before LC-MS profiling. The resulting batches were filled with blan QC-sampls injections (three and five per batch, respectively) for system stability and six chosen samples were analyzed twice per batch (total nine batches were acq as shown in Figure 4. Each hop species was analyzed in two technical replicates. System was cleaned to reduce influence of contamination before and after each (chromatographic column with acetonitrile and mass-spectrometer's ion source, s cones and skimmer with a mixture of water and isopropyl alcohol).

Computation and Software
The total of 154 raw files (samples and QC data), that were obtained after the imental step, were then converted to mzXML format using msConvert [54] too peaking and subsequent alignment of chromatograms were performed by the xcm R package with parameters that were optimized by IPO [56] in order to obtain a pea (samples by row, features-columns). Results of the optimization and peak tables fo step are available at the Github repository. Obtained peak tables for each polarit subjected to the missing value imputation procedure using the missForest [29] pa A QC-based signal correction was performed by the XGBoost model as described and corrected peak area was calculated by the following equation: After the correction step, features were filtered by their reproducibility in QC ples, and as a result, signals with relative standard deviation (RSD) among QC sa more than 30% were removed. In the next step, technical repeats were averaged an System was cleaned to reduce influence of contamination before and after each batch (chromatographic column with acetonitrile and mass-spectrometer's ion source, sample cones and skimmer with a mixture of water and isopropyl alcohol).

Computation and Software
The total of 154 raw files (samples and QC data), that were obtained after the experimental step, were then converted to mzXML format using msConvert [54] tool. Peak peaking and subsequent alignment of chromatograms were performed by the xcms [55] R package with parameters that were optimized by IPO [56] in order to obtain a peak table (samples by row, features-columns). Results of the optimization and peak tables for each step are available at the Github repository. Obtained peak tables for each polarity were subjected to the missing value imputation procedure using the missForest [29] package. A QC-based signal correction was performed by the XGBoost model as described in [28], and corrected peak area was calculated by the following equation: After the correction step, features were filtered by their reproducibility in QC samples, and as a result, signals with relative standard deviation (RSD) among QC samples more than 30% were removed. In the next step, technical repeats were averaged and then all data tables were centered and scaled. The final peak table was generated by binding negative and positive polarity corresponding tables by columns.
All dendrograms were generated using Ward.D2 linkage algorithm based on the Manhattan distance between samples. Data matrix was scaled and centered prior PCA in all cases.
Basic operations and tables processing were performed by dplyr, data.table and stringr packages. Data visualization and projection (PCA, HCA) were implemented by FactoMineR and factoextra packages. Figures were created using ggplot2 library.

Semi-Supervised Classification and Feature Selection
In the first step, the number of data dimensions was reduced by computing principal components. Then, in the space of the first 10 principal components, our samples' coordinates (scores matrix) were extracted, and HCA was computed for clustering samples in two groups (metabolome-based classification). Obtained cluster membership was used to perform subsequent statistical analysis. Nested feature selection was implemented to determine a subset of significant variables, the algorithm is shown in Figure 5.

Semi-Supervised Classification and Feature Selection
In the first step, the number of data dimensions was reduced by computing principal components. Then, in the space of the first 10 principal components, our samples' coordinates (scores matrix) were extracted, and HCA was computed for clustering samples in two groups (metabolome-based classification). Obtained cluster membership was used to perform subsequent statistical analysis. Nested feature selection was implemented to determine a subset of significant variables, the algorithm is shown in Figure 5. Then, 10-fold cross validation splitting was performed on a full experimental dataset (54 samples) as an outer loop. Then, each of all remaining nine inner folds were bootstrapped five times. So, we can validate feature subsets on 10 independent datasets. Feature significance was evaluated and selected by three statistical criteria: adjusted p-value from moderated t-test < 0.05, fold change value > 1, and VIP value from PLS model is greater than 1. Then we intersected the results of rational cut-off between resampled data within each bootstrapped inner fold, and, finally, between the overall outer loop. Before final intersection, each signal's list was validated using a GLM with outer fold for test purposes. The Limma [37] R package was used for adjusted p-value and fold change calculations with following parameters: least squares model as fitting method and Benjamini and Hochberg method for p-value adjustment. The PLS model was constructed using ropls package [57] with 10-fold cross validation and 100 random permutations for tuning Then, 10-fold cross validation splitting was performed on a full experimental dataset (54 samples) as an outer loop. Then, each of all remaining nine inner folds were bootstrapped five times. So, we can validate feature subsets on 10 independent datasets. Feature significance was evaluated and selected by three statistical criteria: adjusted p-value from moderated t-test < 0.05, fold change value > 1, and VIP value from PLS model is greater than 1. Then we intersected the results of rational cut-off between resampled data within each bootstrapped inner fold, and, finally, between the overall outer loop. Before final intersection, each signal's list was validated using a GLM with outer fold for test purposes. The Limma [37] R package was used for adjusted p-value and fold change calculations with following parameters: least squares model as fitting method and Benjamini and Hochberg method for p-value adjustment. The PLS model was constructed using ropls package [57] with 10-fold cross validation and 100 random permutations for tuning hyperparameters and model evaluation, respectively. GLM for final validation in the outer loop was constructed using the caret [40] package. More detailed information about computations is provided on Github repository.
Before the identification step, isotopic and adduct peaks were removed from the obtained list by manual examination of mass-spectra. For identification of the in-source fragmentation signals, MS 2 spectra were obtained for each signal as described in the next part (Section 3.7). If peaks were presented in both scan and fragmentation spectra from the ion with greater m/z value with the same retention time, this was considered as a molecular ion, but the signal with smaller m/z is assumed as an in-source fragmentation product.

Conclusions
Untargeted LC-MS-based studies, especially those dedicated to plants, are always related to various disadvantages due to specifics of the used apparatus and complexity of the object. In the present work, we demonstrated the ways to discard them using the example of a highly important species for the food industry-Humulus lupulus. Critical steps in LC-MS data preprocessing were fully described. The approach of combining HCA-based classification and nested feature selection was used for metabolome-based differentiation of brewing cultivars and group-markers compounds search. Obtained labels showed a low correlation with the cultivars' region of growth but partially agreed with genetic and "usage type" classifications. With the implementation of nested feature selection, the selection bias effect was minimized and an informative feature list was obtained, without annotation-based pre-filtration procedure. Nine compounds that were found to have maximum discriminative power, and multi-step identification including database search and molecular networks construction was performed. Four of them were identified as humulone and its derivatives. Furthermore, isoxanthohumol and structurally similar compounds were found. The markers discovered could be potentially helpful in the development of quality control approaches and further studies oriented to hops metabolome. The experimental design presented in this work could be easily transferred to other types of tasks and objects.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/ 10.3390/metabo12100945/s1, Figure S1: TIC chromatograms of QC sample, Figure S2: XIC chromatograms of marker compounds m/z in QC sample, Figure S3: Part of obtained molecular network describing connectivity between fragmentation spectra of marker compounds, Figure S4: Dendrogram obtained after HCA of samples using only selected features, Figure S5: Plot obtained after PCA of samples using only selected features, Figure S6: Projection of study samples on principal components surface. Newly assigned groups are demonstrated by colour. Sample names description could be found in table S1, Figure S7: Fragmentation spectra of marker compounds. Number corresponds to Table 1 in the main text, Table S1: List of cultivars, analyzed in present study. Last column is dedicated to groups, to which cultivars were assigned by results of the study.

Author Contributions:
The manuscript was written with the contributions of all authors. Y.A.I.: conceptualization, methodology, software, investigation, writing. I.V.P.: software, investigation, writing. I.A.R.: supervision, project administration, funding acquisition. All authors have read and agreed to the published version of the manuscript.