## 1. Introduction

Glycans have been recognized to contribute to the pathophysiology of every major disease [

1]. To keep up with the increasing interest to better understand the involvement of glycans in biological processes at a molecular level, high-throughput platforms have been developed in the recent past. These platforms allow to profile glycans in large-scale datasets and from a wide variety of biospecimens.

Similar to all other omics data types, glycomics samples need to be preprocessed prior to statistical analysis in order to minimize intrinsic, non-biological variation. This variation can arise, for example, from fluctuations in the instrument settings, sample preparation, or experimental conditions. The process that aims at reducing technical variations from the data is referred to as normalization. Different normalization procedures have substantially different assumptions regarding the nature of the non-biological variation, which, however, is unknown in most practical cases. Systematic comparisons of commonly implemented preprocessing strategies for various omics technologies have been published in recent years, including transcriptomics [

2], proteomics [

3], as well as metabolomics [

4,

5,

6]. A recent study to guide the choice of normalization strategies for glycomics data has recently been published [

7]; however, that study could not identify an optimal preprocessing strategy. Therefore, there is still no consensus on the appropriate normalization methods for glycomics data.

This need for a glycomics-specific evaluation is further supported by the observation that the de facto standard for large-scale glycomics data preprocessing is Total Area (TA) normalization [

8], which describes each glycan intensity in a sample as a percentage of the total. Following this transformation, the normalized intensities of a sample sum up to one (or 100%) by definition, leading to the loss of one degree of freedom. The division of each value by the sum of all values in a sample is referred to as a closure operation, and the resulting dataset is known as a compositional dataset [

9]. Notably, these types of data normalization alter the structure of the covariance matrix, subsequently affecting any downstream correlation-based analysis (for details on this phenomenon, see the Methods Section). Compositional datasets are not unique to glycomics, but also widely occur in other fields, prominently in microbiome profiling [

10], where percentages are used to describe the relative abundance of different microbial species. Notably, regular multivariate methods are not appropriate to treat these types of data, and specific statistical techniques need to be employed [

11,

12,

13,

14,

15]. Most of such techniques require the definition of new variables, typically defined as ratios between the original compositional values [

16,

17,

18]. This makes interpretation of the results in terms of the original quantities challenging [

19,

20].

In order to be able to infer biological interactions from the analysis of large-scale glycomics data, the selection of a more suitable alternative to TA normalization is therefore necessary. Given the variety of possible preprocessing strategies available, we need to define a criterion to quantitatively evaluate the performance of each method to select the most appropriate normalization method.

Common evaluation schemes for the performance of preprocessing strategies are mostly based on two approaches: (1) Minimizing the variation between technical replicates [

21,

22], and (2) maximizing the variation across groups [

6]. Consistency across technical replicates is a desirable outcome, but alone is not sufficient to guarantee good data quality, and technical replicates might not always be available. The maximization of variation across groups, on the other hand, is a viable measure that provides insights into the recovery of true biological signals.

In this paper, we address the question of evaluating normalization strategies for glycomics data with a different, innovative approach. We assess the quality of a normalized dataset by its ability to reconstruct a biochemically correct pathway using statistical network inference. One popular approach for the inference of biological interactions is based on Gaussian Graphical Models (GGMs) [

23]. GGMs depict correlating variables in the form of a network, where nodes represent the measurements (e.g., glycans) and edges represent their statistical associations. Specifically, GGMs quantify pairwise associations via partial correlations, an extension of regular Pearson correlations that accounts for the presence of confounding factors. Molecular measurements are generally highly correlated and thus contain a large number of correlations that are indirect and mediated by one or more other variables. Partial correlations allow to remove these indirect correlations automatically. Due to this property, GGMs have been repeatedly shown to selectively identify single enzymatic steps in metabolic [

24,

25] and glycosylation pathways [

26], hence providing a reliable data-driven approach to infer biochemical pathways.

In this paper, we exploit the ability of GGMs to reconstruct biochemical reactions to define a biological measure of normalization quality. The idea is to compare the GGMs inferred from data normalized with different approaches to the known biochemical pathway of glycan synthesis and evaluate the quality of each normalization according to how well the corresponding GGM retrieves known synthesis reactions (

Figure 1). By computing the overlap between estimated GGM and glycosylation pathway, we rely on a biological measure of quality, as a higher overlap indicates data whose correlations are able to better reflect known biochemical interactions. Hence, the normalization that produces the highest overlap is defined as the best. Glycomics data provide an ideal test case to demonstrate the validity of this approach, as the known biochemical pathway of synthesis is well characterized.

We compared the performance of different variations of seven commonly implemented normalization methods on data from six cohorts across three different glycomics platforms, including measurements of the Fragment crystallizable (Fc) region of Immunoglobulin G (IgG), total IgG, or total plasma N-glycans. In order to assess how our approach compares to other common normalization evaluation strategies, we additionally investigated how the normalization methods affect the statistical associations of glycans with age.

## 3. Discussion

Several systematic evaluations of preprocessing methodologies have been recently published for different omics data types, but glycomics has received little attention so far in this regard. In order to address this gap, we developed an innovative approach to assess the quality of different normalization strategies applied to glycomics data. The main feature of our procedure lies in the definition of a biological measure of quality. More specifically, we quantify how well significant correlations in the data normalized with a given technique represent known biochemical reactions in the pathway of glycan synthesis. Our quantitative measure of choice for this evaluation was the p-value of a Fisher’s exact test, which allows for an intuitive interpretation of overlap between correlations and biochemical pathway.

We performed a systematic analysis of 23 preprocessing strategies applied to six large-scale glycomics cohorts across three platforms, with measurements ranging from a single protein and single glycosylation site (LC-ESI-MS), to total plasma N-glycome (MALDI-FTICR-MS). The observed normalization ranking was consistent across platforms; overall, the Probabilistic Quotient appeared to be the most reliable method, as all variations of this procedure ranked consistently in the top performers in all cohorts and across platforms. Log-transformation and normalization per IgG subclass or per total IgG did not seem to significantly affect the ability of this method to correctly retrieve the glycan synthesis pathway. Interestingly, while Total Area normalization did not rank high in comparison to other methods (as expected), the log-transformed Total Area preprocessing was a well-performing method. In fact, TA Probabilistic Quotient was among the best performing approaches overall, suggesting that additional transformations on TA normalized data can neutralize the constraints imposed on the data correlation structure, as shown in Dieterle et al. [

36].

One interesting finding was the substantial difference of the evaluation results between MS- and UHPLC-based platforms: While for MS, most normalization approaches performed comparably, the variance among the considered strategies was considerable for UHPLC. The origin of this discrepancy is not easy to trace, but it could be due to the fact that UHPLC does not separate glycans according to their mass, like MS-based techniques do, but according to their chemical and physical properties. This leads most chromatographic peaks to represent a mixture of glycan structures. Although it has been shown that there is a predominant structure in the vast majority of IgG chromatographic peaks [

31], this contamination is likely to make the data correlation structure noisier and thus more sensitive to different normalizations. Moreover, it is expected to affect the comparison to the biological reference, which does not account for any structure mixture.

While our results seem to suggest that log-transformation does not significantly affect performance, it should be considered that data normality is an assumption for many other statistical tests and approaches, and thus we still recommend to always log-transform omics data after normalization.

To assess how our approach compares to a more common normalization evaluation strategy, we ranked the preprocessing methods based on how strongly the normalized abundances associated with age. Consistent with our network-based results, Probabilistic Quotient-based approaches clearly outperformed all other methods.

The network approach described in this paper could be employed to evaluate normalization strategies in other types of mass-flow data, e.g., metabolomics data. Moreover, we could extend this approach to evaluate other preprocessing steps. For example, it has already been shown that, for untargeted metabolomics data, different missing value imputation strategies have a prominent impact on the results of the downstream analysis [

46]. We could investigate whether the same holds for glycomics data and quantitatively evaluate the performance of each strategy. Similarly, our framework could be applied to the evaluation of batch correction approaches, which aim at reducing the technical variation due to samples being measured at different times.

In conclusion, we recommend normalizing glycan data with the Probabilistic Quotient normalization followed by log-transformation. This technique was robust and reliable regardless of the measurement platform.