WebSpecmine: A Website for Metabolomics Data Analysis and Mining

Metabolomics data analysis is an important task in biomedical research. The available tools do not provide a wide variety of methods and data types, nor ways to store and share data and results generated. Thus, we have developed WebSpecmine to overcome the aforementioned limitations. WebSpecmine is a web-based application designed to perform the analysis of metabolomics data based on spectroscopic and chromatographic techniques (NMR, Infrared, UV-visible, and Raman, and LC/GC-MS) and compound concentrations. Users, even those not possessing programming skills, can access several analysis methods including univariate, unsupervised and supervised multivariate statistical analysis, as well as metabolite identification and pathway analysis, also being able to create accounts to store their data and results, either privately or publicly. The tool’s implementation is based in the R project, including its shiny web-based framework. Webspecmine is freely available, supporting all major browsers. We provide abundant documentation, including tutorials and a user guide with case studies.

: Illustration of how to import the public project Cassava PPD. (A) In the Public Projects page, accessed through the sidebar panel by clicking the tab with the same name, the user selects the Cassava PPD project in the table of projects and clicks the Import Project button. (B) By accessing the user's personal space, by clicking the tab My Projects present in the sidebar panel, the user can confirm that the imported project has actually been copied.
To select the wanted data for analysis (Supplementary Figure S2), the user clicks the Choose Files button. A pop-up window appears, so that the project and respective data folder and metadata file can be selected. After that, data and metadata options can be set, before finalizing the submission of the files for analysis. Once the dataset is loaded into the session, it is named by default as OriginalData. All datasets created during the current session are stored. Figure S2: Illustration of how to choose the correct files, once the Cassava PPD project is copied to the user account. After clicking the Choose Files button in the header panel, the user will see (A) a pop-up window where the project (Cassava PPD), the correct data folder from this project (IR Data (DX files)) and the metadata file (metadata_ir.csv) are chosen. After clicking Next, (B) the data and metadata options must be set to correctly read the files. After setting these options, as shown in the figure, the Submit For Analysis button can be clicked, so that the dataset is created and uploaded into the session.
Before starting the analysis, pre-processing was performed (Supplementary Figure S3). For this, the Pre-Processing page was accessed through the header panel. Smoothing interpolation using the Bin method, with a reducing factor of 10, was applied, followed by converting the numerical values of the PPD metadata variable (ppds) into factors, so that classification can be correctly performed later on. The new dataset was named cassava_processed. After clicking Finish, the new dataset is made available in the Dataset being used section. To visualize the effect of smoothing interpolation in the spectral data (supplementary figures S4), the OriginalData dataset can be selected (at the Dataset being used section) and the Data Visualization page may be accessed, through the header panel, to observe the spectra plot, followed by selecting the new dataset cassava_processed, which will allow to see the respective plot. Figure S4: To visualize the effect of smoothing interpolation on the spectral data, the user can (A) select the Data Visualization button at the header panel after selecting the OriginalData dataset and select the Spectra Plot option at the left of the page. Then, (B) by selecting the cassava_processed dataset, the spectra plot that will appear will correspond to this dataset. The authors started by conducting a Principal Components Analysis (PCA) on the dataset, with the intention of evaluating the most important biochemical events related to the deterioration changes and to discriminate cassava cultivars during PPD.
As the authors only used the data in the spectra range of 3000 to 600 cm −1 , having the cassava_processed dataset selected, we went back to the Pre-Processing page and created a subset of the dataset by selecting the interval of values we wanted to retain (from 3000 to 600 cm -1 ), as in Supplementary Figure S6. The new dataset was named cassavaProc_600_3000. Having the cassavaProc_600_3000 dataset selected, PCA was performed, by entering the Principal Components Analysis (PCA) box in the Run Analysis page, accessed through the header panel. The dataset was centred prior to PCA analysis, set in the options for this method (Supplementary Figure S7). The analysis was named PCA_3000_600. After being redirected to the results page of this PCA analysis, all numerical results available may be checked and personalized plots may be generated (Supplementary Figure S7). A two-dimensional plot of the two first components was generated by accessing the Scores Plot 2D tab in the Make plots section, by colouring the data points using the PPD metadata values. Figure S7: Illustration of how to perform the PCA and obtain the two-dimensional scores plot from the first and second PCs. After selecting the cassavaProc_600_3000 dataset, the user clicks the button Run Analysis at the header panel. In the Run Analysis page, the Principal Components Analysis (PCA) box should be selected. After that, (A) PCA is performed, by ticking the option to centre the variables prior to the analysis and giving a name to the analysis. Once the user clicks Submit, the analysis starts and, when finished, (B) the site redirects to the results page. Here, (C) the user can go to the Make plots section, select the Score Plot 2D tab and set the option to personalize the results plot. Then, the button Plot should be clicked, so that the plot is created and (D) shown in the Visualize Plots section. Figure S8). For example, there was a clear separation between the varieties BRA and ORI, which are the susceptible and tolerant to PPD genotypes, respectively. On the other hand, the total variance explained only by the first component is of approximately 95%. Figure S8: Scores plots of the first and second components of PCA on the spectra region from 3000 to 600 cm -1 , obtained by WebSpecmine.

Some interesting conclusions can be deduced (Supplementary
We further separated the dataset into three regions, as mentioned in the original article, typical of carbohydrates (1200-900 cm −1 ), proteins (1680-1000 cm −1 ), and lipids (3000-1700 cm −1 ), to see if any of these regions could discriminate the cultivars according to their biochemical discrepancies over the PPD. For this, the same process for the analysis of PCA was repeated three times, after creating three new datasets, by subsetting the cassavaProc_600_3000 in each of the three different spectral regions mentioned. These datasets were named cassavaProc_carbohyd, cassavaProc_proteins, cassavaProc_lipids, respectively, with the respective data analysis being named as PCA_carbohydrates, PCA_proteins and PCA_lipids.
In fact, we also observed that carbohydrates and proteins were the best ones at performing this screening.
In the original study, the authors trained several classification models, including SVMs and Decision Trees, to test the ability to discriminate deterioration stages (PPD) along different groups of cultivars.
After re-selecting the cassavaProc_600_3000 dataset, we entered the Machine Learning box in the Run Analysis page. Parameter optimization was performed by testing 10 different values for each models' parameters, while cross validation with 10 folds and accuracy as the performance metric were chosen (Supplementary Figure S9). The analysis was named models_PPDS.  The main conclusion drawn by the authors from such analysis consisted on the fact that the SVM model showed the best performance at clearly separating the samples from different deterioration stages (PPD) across the different cultivars. The same could be accomplished with our results.
A hierarchical clustering on samples was performed next, with the objective of assessing the similarity within samples. They concluded that four clear clusters emerged, although not clearly by either cultivars or PPD. The same results could be drawn by our analysis (Supplementary Figure S11). Hierarchical Clustering on samples was performed by entering the Clustering Analysis box in the Run Analysis page. The cassavaProc_600_3000 dataset was selected to perform this analysis. As mentioned by the authors, the hierarchical clustering was performed with an Euclidean distance and a complete aggregation method (Supplementary Figure S12). The analysis was named HC_600_3000. After being redirected to the results of Clustering Analysis, the numerical results can be checked and the user can personalize a dendrogram plot (Supplementary Figure S12). Figure S12: Illustration of how to perform the Hierarchical Clustering and obtain the dendrogram plot of the results. After selecting the cassavaProc_600_3000 dataset, the user clicks the button Run Analysis at the header panel. Once in the Run Analysis page, the Clustering Analysis box should be selected. After that, (A) Hierarchical Clustering is performed, by setting the desired options and naming the analysis. Once the user clicks Submit, the analysis starts and, when finished, (B) the site redirects to the results page. Here, the user can go to the Dendrogram section and personalize the plot. Here, the samples in the plot were represented by their respective names and coloured by the deterioration stage (PPD).