This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

In the last decade,

Metabolomics,

The main problems tackled within metabonomics are the detection, characterization and classification of external perturbations (including diseases, drugs,

Networks [

Beyond this structural representation, it has been shown that complex networks can be used to support data analysis tasks. For instance, in neuroscience studies, nodes may represent individual sensors detecting the electric or magnetic field generated by groups of neurons, and the links between them may indicate the presence of some kind of correlation between their activity [

In this contribution, we propose the use of a complex network representation of spectral data as a preliminary step for a classification task. We first introduce and describe in

The information relative to each subject is codified by means of a network. Each node represents one of the available spectral measurements, or a bin representing a group of them, and the links between two nodes identify pairs of measurements that exhibit characteristics related to the disease.

Mathematically, let us suppose that the initial data available are represented by _{c}_{d}_{c}_{d}_{c}_{d}_{i,j}_{i,j}

The creation of a network representation requires analyzing whether we should create a link between each possible pairs of nodes. For each possible pair of bins, therefore, it is necessary to identify if its value follows two different models for control and disease subjects respectively. Here, we suppose that both models are a linear correlation between pairs of bins. Specifically, we linearly fit the values of the two bins (in what follows, _{•}_{,j}_{•}_{,i}_{•}_{,j}_{•}_{,i}

In

Example of calculation of the weight of a link. (Left) Lineal fit of data corresponding to control subjects and patients; (right) classification of an unlabeled subject (marked as

Taking into account the expected value of the second bin in both models and the corresponding expected error in the lineal fit (given by the standard deviation of residuals), the probability

It must be noticed that the result of this process is a number, defined within the interval [0, 1], associated to a pair of measurements (or nodes of the network). In other words, we can construct the network of

As a first application and example of the proposed network reconstruction algorithm, we consider a data set of metabolic spectral measurements, corresponding to 25 control subjects, and 25 patients suffering from

The network reconstruction algorithm previously described has been applied to such data set, each node representing a bin of the spectra. The resulting network representations for four subjects, two of them of the control group (upper part, in green) and two of the patient group (lower part, in red), are shown in

Four examples of network representation of spectral data. Upper (bottom) networks represent control subjects (patients suffering from Glomerulonephritis).

Two features can be easily recognized. Firstly, the two networks of the control group have less links (lower link density) than the other two. This effect is to be expected, as data corresponding to GN patients should be closer to the disease model, as defined in

Secondly, while control subject networks lack a clear structure, in the GN networks there is one or a few nodes with a central position,

Analysis of the structural characteristics of networks for control subjects (green) and patients (blue). The three plots represent the histograms for (left) link density, (center) clustering coefficient, and (right) efficiency-see text for definitions.

To further analyze the differences between networks representing both groups of subjects, in

_{i,j}

_{∆} being the number of triangles in the network, and _{3} the number of connected triples.

_{i,j}

The structural features that have been manually identified in

These two structures are confirmed by the analysis of the distribution of node centrality.

Histograms of the eigenvector centrality of nodes of networks represented in

The classification of subjects can be easily performed by any of the standard data mining algorithms available in the literature [

Furthermore, the analysis of the most central nodes in each network provides valuable information about the bins defining the health status of the subject. As can be seen in ^{1}H 9.44–9.6 ppm (associated to CH and CHO signals).

To check the sensitivity of the proposed algorithm to the presence of noise, an ensemble of 100 modified data sets has been created, by polluting the original measurements with an additive noise drawn from a normal distribution center in zero.

Mean classification scores obtained by four algorithms for data sets polluted with additive noise; the proposed network-based representation is represented by black squares.

To show the wide range of applications in which the proposed network reconstruction algorithm is of help, we move from the previously described classification problem to the issue of monitoring the evolution of the status of a patient through time.

The data set analyzed has been constructed by collecting Raman spectra [

As a first step, both sets of data, corresponding to control subjects and patients, have been used to train the model,

Classification of control and leukemia subjects. (Left) Representation of the position of control subjects (green squares) and leukemia patients (blue points) in the space of network features. (Right) Classification score as a function of the binning size.

For the second phase, we got access to the Raman spectra corresponding to an additional patient, who was diagnosed with leukemia and underwent chemotherapy treatment. Notably, several measurements were available, for the day before the start of the therapy, and for each treatment days (

Chemotherapy seems to globally lower the link density of the network, therefore representing an improvement in the status of the patient. Nevertheless, the network corresponding to September 11^{th}, which has been measured after a long pause of two weeks in the treatment, suggests a return to the initial status. While no other biomedical information was available for this subject, and thus no significant conclusions can be drawn from this example,

Evolution through time of the link density of networks representing a leukemia patient under chemotherapy treatment.

In this contribution, we proposed a method based on complex network theory for the discrimination of pathological patterns in metabolic spectra associated with different physiological samples (urine and blood of patients suffering respectively from

Our results show that the network structures of control subjects and patients are significantly different, and that this difference can be used to achieve a 100% classification score using Support Vector Machines and

The proposed method offers clear advantages in terms of relevant knowledge extraction from data:

When analyzing a network structure, the addition or deletion of a single link has minimal effects on the overall topology. Therefore, data mining results obtained from features representing such topologies are robust to the presence of noise in the initial data set. Furthermore, the whole initial data set, which may be composed of thousands of individual measurements, is reduced to a few features. Therefore,

The information relative to a subject is synthesized into one or a few features, and

Although further analyses are needed to corroborate and validate these last three points, we expect the proposed methodology to offer new avenues for the solution of problems in metabolomics, such as metabolomic fingerprinting and profiling.

SB acknowledges funding from the BBVA-Foundation within the Isaac-Peral program of Chairs. The authors wish to thank CONACYT for financial support under grant number 45488. The authors also acknowledge the computational resources, facilities and assistance provided by the Centro computazionale di RicErca sui Sistemi COmplessi (CRESCO) of the Italian National Agency for New Technologies, Energy and Sustainable Economic Development (ENEA), and the facilities provided by CESVIMA (Spain).