This report was created with MeTaQuaC v0.1.30.
The data as imported, restructured and preprocessed in this report is available for export in the section Preprocessing (via the full data tables provided for each preprocessing step).
Before reading and interpreting this QC report, please make sure to be familiar with the Biocrates kit used, i.e. familiarize your self with the compounds, sample types, status values, terminology, analytical specification, etc. Please refer to Biocrates’ manuals and documents provided with the kit used.
Import of Biocrates AbsoluteIDQ p400 HR Kit data of measurement type LC.
Data files to import:
Importing batch Batch1
… Guessed encoding UTF-8. Guessed encoding UTF-8.
The imported data was restructured as follows:
Number of samples in total: 26
List of all samples:
Number of samples per batch:
Number of compounds in total: 42
List of all compounds:
Number of compounds per batch:
This study variables profile visualizes group sizes within study variables as well as intersected with other study variables, if more than one is indicated. Set Size = Number of samples per group in variable (color-coded). Intersection Size = Number of samples with same combination of groups.The following variable(s) are used: Type.
Status profiles visualize the occurrence of the different possible measurement statuses. Profiles are generated over all measurements, for different sample types (in percentages, since the number of samples vary per type), for different samples, for different compound classes (in percentages, since the number of compounds vary per class), as well as for different compounds.
The following preprocessing procedures are applied on the data in this order:
The preprocessing statuses and filter thresholds can be modified with the following parameters of the create_report
function (refer to the readme or function description for details):
preproc_keep_status
filter_compound_qc_ref_max_mv_ratio
filter_compound_qc_ref_max_rsd
filter_compound_qc_pool_max_mv_ratio
filter_compound_qc_pool_max_rsd
filter_compound_bs_max_mv_ratio
filter_compound_bs_min_rsd
filter_sample_max_mv_ratio
The result of each preprocessing or filter step can be found in the tab with the corresponding enumerator. Please note, as each tab contains a table with the complete dataset in the corresponding filtered form, loading the tab for the first time might take a couple of seconds.
Note: The beginning of each following section will indicate on which dataset calculations and visualizations are based. Thereby, the preprocessed datasets will be referenced as 1, 2a, 2b, 3a, 3b and 4, resp.
The status preprocessing discards measurements based on quality statuses.
By default, only “Valid” and “Semi Quant.” measurements are considered as reliable and are kept for further analysis. Hence, all other statuses including “< LLOQ” and “> ULOQ” (out of quantification range / extrapolated calibration), “STD/QC < Limit” and “STD/QC > Limit” (insufficient accuracy in STDs and QCs), “ISTD Out of Range” (internal standard too high or low) and other statuses are usually considered as unreliable and measurements are transformed to missing values (set to NA).
Based on the parameter preproc_keep_status
, measurements with the following statuses will be retained: Valid
Missing Concentration values before status preprocessing: 270 of 1092 (25%)
Missing Concentration values after status preprocessing: 320 of 1092 (29%)
To alter the threshold use parameter filter_compound_qc_ref_max_mv_ratio
.
Number of compounds before: 42 Number of compounds left: 41
To alter the threshold use parameter filter_compound_qc_ref_max_rsd
.
Number of compounds before: 41 Number of compounds left: 40
To alter the threshold use parameter filter_compound_qc_pool_max_mv_ratio
.
Number of compounds before: 40 Number of compounds left: 40
To alter the threshold use parameter filter_compound_qc_pool_max_rsd
.
Number of compounds before: 40 Number of compounds left: 40
Number of compounds before: 40 Number of compounds left: 30
To alter the threshold use parameter filter_compound_bs_max_mv_ratio
.
Number of compounds before: 30 Number of compounds left: 27
To alter the threshold use parameter filter_compound_bs_min_rsd
.
Number of compounds before: 27 Number of compounds left: 27
To alter the threshold use parameter filter_sample_max_mv_ratio
.
Number of all samples before: 26 Number of all samples left: 22
Number of biological samples before: 10 Number of biological samples left: 10
Missing value counts either by samples or by compounds based on unprocessed data (i.e. no values have been turned into missing values based on status and nothing is filtered) as well as the full unprocessed dataset is available in the tables below.
The following QC analyses are based on the status-preprocessed dataset 1, unless explicitly stated otherwise.
Missing values are counted per sample, i.e. how many compounds were not measured in the sample at all or reliably enough (depending on status preprocessing). First, missing value counts are visualized depending on the total concentration measured in a sample (with Well Positions indicated for a few data points in low density areas).
Furthermore, missing value counts are summarized by histograms (primarily) including visual separation and comparison of relevant sample types, batches (if applicable) or study variable groups (if indicated).
Missing values are counted per compound, i.e. how many samples don’t feature a reliable measurement of the compound (depending on status preprocessing). Counts are calculated and visualized for different compound classes as well as for different sample subsets. I.e., for the same compound, missing values are separately counted within different sample types or within different study variable groups (if indicated).
Bar plots per study variable visualize differences in the occurrence of missing values in different study groups. # Missing Values counts are used as reference to show whether a compound is missing just a bit or a lot. It shouldn’t be used for comparison of groups due to possible differences in group sizes. However, % Missing Values show the percentage of missing values per group, which are further added up and normalized to one as Normalized % Missing Values. This allows to infer compounds which are considerably more missing in certain groups (if the missing value count is high). Only compounds with a least one and less than 100% missing values (overall) are shown.
Progression of sample summary statistics (totals for concentration/area/intensity and missing values) is visualized over the acquisition sequence. This allows an quick overview of variability between replicate samples as well as between batches, while the parallel view on the different data types (concentration, area, istd area, etc.) illustrates the effect of calibration/normalization.
After calibration (concentration), replicate samples (such as reference and pooled QCs) should feature a preferably horizontal regression line (in particular for primarily multi-point calibrated data). Furthermore, regression lines of different batches should align reasonably to justify subsequent joint analysis. In comparison, uncalibrated and unnormalized measurements (area and intensity) may feature incomparable fluctuation and batch drifts.
geom_smooth()
using formula ‘y ~ x’
Linear models are calculated to assess the horizontal behavior of measurements (total concentration per sample) within specific sample types. In particular, in a perfect batch with no deviation the resulting regression lines should be horizontal. Statistical analysis is applied to highlight potential deviation of the regression line fitted to the experimental data from the horizontal.
For single batches, the simple linear model Total Concentration ~ Sequence Position
is checked for unwanted batch drifts inferred from significant slopes. Significant slopes may not only indicate batch drifts but general technical variability not sufficiently resolved by internal standard normalization and external standard calibration.
For batch comparison, the model Total Concentration ~ Batch + Sequence Position / Batch
is used to infer unwanted significant differences in intercepts and slopes between batches (with one batch randomly selected as reference). Furthermore, a Kruskal-Wallis rank sum test is used to infer unwanted general differences in batch distributions.
Note: Use the automated rating for guidance only and in addition judge manually (e.g. by visual inspection of the totals overview) whether total concentration fits behave well enough in comparison to area or intensity (i.e. unnormalized and uncalibrated data). The rating remains rather subjective for now, as it will require a great number of experiments and extensive validation to assess which model coefficients and p-values are acceptable to support reliable analysis.
No batch with undesirable Concentration [ng/ml] slope for samples of type QC Level 2.
geom_smooth()
using formula ‘y ~ x’
No batch with undesirable Concentration [ng/ml] slope for samples of type Sample.
geom_smooth()
using formula ‘y ~ x’
geom_smooth()
using formula ‘y ~ x’
Sample summary statistics (missing values and total concentration/area/intensity) are compared with respect to well plate position. This allows to identify unexpected position-based issues (e.g. edge-effects), as might be demonstrated by a cluster of aberrating dots, or to spot outlier samples. In general, samples of the same type should feature similar dot sizes (with minor variations in biological samples). Samples areas and intensities will vary more than concentrations (as latter are calibrated).
Violin and box plots summarize compound measurements per sample and are grouped and overlapped per sample type, if applicable. This allows an quick overview of variability between replicate samples such as reference and pooled QCs as well as biological samples, while the parallel view on the different data types illustrates the effect of calibration/normalization (i.e. replicate sample concentration profiles should line up rather well).
Violin and box plots summarize compound measurements per replicate sample, grouped per sample type and ordered according to acquisition sequence. This allows a quick overview of variability progression of replicate samples such as Reference or Pooled QC samples over the batch (e.g. to indicate batch drift), while the parallel view on the different data types illustrates the effect of calibration/normalization (i.e. replicate sample concentration profiles should line up rather well).
The plot below shows the standard deviation of each individual metabolite against their average concentration, separately for QC and biological samples (i.e. technical vs biological variability).
A linear regression analysis is used to attempt to separate the technical and biological variability. The outputs of the models are used to score and assess this separation, and can be used as indicator for data quality.
Warning: Technical and biological variability can be separated, however the relation between the both is not fully conclusive.
Please see the details and the plot below for more information.
geom_smooth()
using formula ‘y ~ x’
An integrative linear model is used to assess differences in technical and biological variability: log(SD) ~ Group + log(Mean)/Group
with Group being a factor indicating QC or biological samples.
Model results are combined to identify four important main cases of differences in technical and biological variability:
k >= 7
: A clear biological variability on top of the technical one.k = 2:6
: Inconclusive differences in biological and technical variability.k = 0:1
: A strong technical variability which hides the biological one.k < 0
: Something clearly wrong (when variability in QC decreases with increasing amount of metabolite concentration).with final score k being computed to allow the classification of the four main cases as follows:
k = 0
k = k + (r_squared > 0.8)
k = k + 2*(delta_intercept > 0 & delta_intercept_pvalue < 0.05)
k = k + 4*(delta_slope > 0)
k = k - 8*(qc_slope < 0)
The results are as follows:
For visual examination, SD as well as %RSDs are plotted against the mean per compound not only separated by sample type but further separated by groups in study variables of the biological samples.
geom_smooth()
using formula ‘y ~ x’
geom_smooth()
using formula ‘y ~ x’
Relative standard deviations per compound or compound class are calculated and visualized for different sample groupings, either by sample type or for biological replicates (i.e. samples from the same group within variables indicated with parameter replicate_variables
) as well as for all batches or each batch separately.
This analysis is based on the status-preprocessed dataset 1.
Red RSD% threshold lines in bar plots are currently only based on parameter filter_compound_qc_pool_max_rsd
.
Calibration scatter plots are reconstructed to enable an impression of calibration performance and range placement (in particular for 7-point calibrated compounds). To properly reflect the internal standard calibration method, areas and intensities are normalized by the ISTD areas and intensities, resp.
The following visualization is based on the status-preprocessed dataset 2b.
The following multivariate analyses are based on the preprocessed dataset 4, unless explicitly stated otherwise.
Heatmaps of centered and scaled log10 target values (i.e. concentration, intensity or area) by compounds and samples. Sample labels may be colored by additional variables (e.g. Batch or Sample.Type). Compounds as well as samples are hierarchically clustered with Pearson correlation as distance metric or euclidean distance metric (aka L2 norm).
Median imputation is applied for the remaining visualizations on biological, reference and pooled QC samples (separately), as the methods are not suited to handle missing values. Samples which still contain missing values after imputation are completely removed.
Scatter plot matrices illustrate the correlation between samples based on compound concentration.
In the case of high sample numbers (> 9), several matrices will be generated based on sample groups of size up to 9 with one overlapping sample between each group (i.e. the last sample in a matrix is the first sample in the next matrix).
This analysis is based on the imputed dataset 4.
Principal component analysis (PCA) is performed to emphasize compound concentrations-based sample type, batch and potential study group relationships.
This analysis is based on the imputed dataset 4.
The following figures illustrate reproducibility by comparing compound measures (e.g. concentration) within and between groups such as different sample types and conditions in study variables (as indicated with parameter study_variables
).
This analysis is based on status-preprocessed dataset 3a.
Simple metabolite concentration profiles overlapped per sample (point and line plot) or summarized (as box plots) are created to enable a brief impression of overall sample behavior consistency (e.g. to justify later normalization).
This analysis is based on the status-preprocessed dataset 1.
MeTaQuaC package version: 0.1.30
R version 4.0.3 (2020-10-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale: LC_COLLATE=English_Europe.1252, LC_CTYPE=English_Europe.1252, LC_MONETARY=English_Europe.1252, LC_NUMERIC=C and LC_TIME=English_Europe.1252
attached base packages: stats, graphics, grDevices, utils, datasets, methods and base
other attached packages: metaquac(v.0.1.30)
loaded via a namespace (and not attached): gtools(v.3.8.2), tinytex(v.0.27), tidyselect(v.1.1.0), xfun(v.0.19), purrr(v.0.3.4), pander(v.0.6.3), lattice(v.0.20-41), splines(v.4.0.3), colorspace(v.2.0-0), vctrs(v.0.3.5), generics(v.0.1.0), htmltools(v.0.5.0), viridisLite(v.0.3.0), yaml(v.2.2.1), mgcv(v.1.8-33), rlang(v.0.4.9), R.oo(v.1.24.0), pillar(v.1.4.7), withr(v.2.3.0), glue(v.1.4.2), R.utils(v.2.10.1), ggfortify(v.0.4.11), lifecycle(v.0.2.0), plyr(v.1.8.6), stringr(v.1.4.0), munsell(v.0.5.0), gtable(v.0.3.0), R.methodsS3(v.1.8.1), caTools(v.1.18.0), htmlwidgets(v.1.5.2), evaluate(v.0.14), labeling(v.0.4.2), knitr(v.1.30), UpSetR(v.1.4.0), crosstalk(v.1.1.0.1), Rcpp(v.1.0.5), KernSmooth(v.2.23-17), readr(v.1.4.0), scales(v.1.1.1), ggpmisc(v.0.3.7), DT(v.0.16), jsonlite(v.1.7.1), farver(v.2.0.3), gplots(v.3.1.1), gridExtra(v.2.3), ggplot2(v.3.3.2), hms(v.0.5.3), digest(v.0.6.27), stringi(v.1.5.3), dplyr(v.1.0.2), ggrepel(v.0.8.2), grid(v.4.0.3), bitops(v.1.0-6), tools(v.4.0.3), magrittr(v.2.0.1), tibble(v.3.0.4), crayon(v.1.3.4), tidyr(v.1.1.2), pkgconfig(v.2.0.3), Matrix(v.1.2-18), ellipsis(v.0.3.1), MASS(v.7.3-53), assertthat(v.0.2.1), rmarkdown(v.2.5), rstudioapi(v.0.13), viridis(v.0.5.1), R6(v.2.5.0), nlme(v.3.1-149) and compiler(v.4.0.3)
additional_normalization: None
pqn_reference_type: Sample
sample_filter:
metadata_extraction:
zero2na: TRUE
ignore_errors: TRUE
data_files:
kit: Biocrates AbsoluteIDQ p400 HR Kit
measurement_type: LC
generic_data_types:
CONCENTRATION | AREA |
---|---|
Concentration | Area |
generic_index_first_compound:
title: Targeted Metabolomics of Myalgic Encephalomyelitis/Chronic Fatigue Syndrome patients and controls. Biocrates p400 kit QC Report for LC injections. Fernandez-Guerra, Gonzalez-Ebsen, et al. 2021
author: Julie Courraud
pool_indicator:
profiling_variables: Type
study_variables:
replicate_variables: Type
preproc_keep_status: Valid
preproc_q500_urine_limits: FALSE
filter_compound_qc_ref_max_mv_ratio: 0.3
filter_compound_qc_ref_max_rsd: 20
filter_compound_qc_pool_max_mv_ratio:
filter_compound_qc_pool_max_rsd:
filter_compound_bs_max_mv_ratio: 0.3
filter_compound_bs_min_rsd:
filter_sample_max_mv_ratio: 0.8
data_tables: stats
metadata_import:
metadata_import_overlap: rename
metadata_name_mods_org:
metadata_name_mods_add:
metadata_value_mods:
lowcon_conditions:
lowcon_sd_outlier_removal: FALSE
lowcon_minimum_intensity: 20000