Molecules 2012, 17(4), 3818-3833; doi:10.3390/molecules17043818

Article
1D 13C-NMR Data as Molecular Descriptors in Spectra — Structure Relationship Analysis of Oligosaccharides
Florbela Pereira
CQFB and REQUIMTE, Departamento de Química, Faculdade de Ciências e Tecnologia, Universidade Nova de Lisboa, 2829-516 Caparica, Portugal; Email: florbela.pereira@fct.unl.pt; Tel.: +351-21-294-8300; Fax: +351-21-294-8550
Received: 9 February 2012; in revised form: 19 March 2012 / Accepted: 23 March 2012 /
Published: 28 March 2012

Abstract

: Spectra-structure relationships were investigated for estimating the anomeric configuration, residues and type of linkages of linear and branched trisaccharides using 13C-NMR chemical shifts. For this study, 119 pyranosyl trisaccharides were used that are trimers of the α or β anomers of D-glucose, D-galactose, D-mannose, L-fucose or L-rhamnose residues bonded through α or β glycosidic linkages of types 1→2, 1→3, 1→4, or 1→6, as well as methoxylated and/or N-acetylated amino trisaccharides. Machine learning experiments were performed for: (1) classification of the anomeric configuration of the first unit, second unit and reducing end; (2) classification of the type of first and second linkages; (3) classification of the three residues: reducing end, middle and first residue; and (4) classification of the chain type. Our previously model for predicting the structure of disaccharides was incorporated in this new model with an improvement of the predictive power. The best results were achieved using Random Forests with 204 di- and trisaccharides for the training set—it could correctly classify 83%, 90%, 88%, 85%, 85%, 75%, 79%, 68% and 94% of the test set (69 compounds) for the nine tasks, respectively, on the basis of unassigned chemical shifts.
Keywords:
machine learning techniques; Random Forest; classification tree; CPGNN; 13C-NMR; oligosaccharides; disaccharides; trisaccharides

1. Introduction

Carbohydrates play key roles in many biological processes, however their functions and the mechanisms of these processes are still not completely known. As a result, carbohydrates remain the least exploited among the three major classes of biomolecules. Additionally, the building blocks of polysaccharides sequences is significantly larger than for the residues in proteins or nucleic acids, more than 30 monosaccharides have been identified in mammalian polysaccharides and even more so than 500 in bacterial polysaccharides [1]. The rich variety of possible linkages positions between the monosaccharides as well as their stereochemistry increase the difficulty of their structural analysis.

The elucidation of carbohydrate structures from the simplest monosaccharides to the most complex branched polysaccharides is crucial to understand the wide-ranging functions of carbohydrates in biological systems. Their biological activities are mainly due to their surface properties which depend on their structure and conformation.

Nowadays, NMR spectroscopy has become a sophisticated and powerful analytical technology that has found a variety of applications in many disciplines of scientific research, medicine, and various industries. Modern NMR has been emphasizing the application in biomolecular systems and plays an important role in structural biology. The determination of the linkage type and anomeric configuration, as well as assigning the monomer stereochemistry is frequently the main objective of an NMR study of unknown polysaccharides from biological materials. The complete analysis of carbohydrates is a complex, time-consuming process that usually makes use of a variety of 2D techniques [2,3,4], such as 1H-1H TOCSY and DQF-COSY, 1H-1H NOESY, 1H-1H ROESY, 13C-1H HSQC, and 13C-1H HMQC or HMBC. The challenge and the same time the complexity of these problems have led to the development of computerized approaches [5,6,7].

Jansson and co-workers [8,9,10,11] developed the Computer Assisted SPectrum Evaluation of Regular polysaccharides (CASPER) program [12], which provides a structural analysis of linear oligo- and polysaccharides, as well as, branched counterparts using 1H and 13C chemical shift data and 1JCH or 3JHH scalar coupling constants. CASPER generates the predicted 1H and 13C chemical shifts from input of the oligo- or polysaccharide chemical structure (i.e., constituent monosaccharides, linkage types as well as methoxylated derivatives and anomeric configurations). In addition, the predicted chemical shifts are ranked according to the lowest average total difference between predicted and experimental chemical shifts. The 1H and 13C chemical shift data can be also used as input and then the CASPER output displays all possible chemical structures match this data. These structures are evaluated by comparison of the structures based on the input NMR data and the predicted chemical shifts. However, the structural determination of an oligo- or polysaccharide using CASPER requires, besides to the NMR data, information about the residues and their linkages.

Many research works cite the use of NMR to determine 3-D structure of macromolecules and in the detection, identification and quantification of potential drug compounds. However, there are quite few studies reporting the use of NMR data in quantitative structure–activity relationships (QSAR) modeling [13,14] or/and spectra-structure correlations [7,15,16,17,18,19]. There is a huge source of information that has been created in the last years from the NMR data. All this information is processed almost only by human learning process. We think that is possible to use this kind of information as an input in machine learning techniques with a high improvement of the results, since it is possible that the human discrimination could miss out important and subjacent information. The NMR data could be seen as a fingerprint of the 3D chemical structure and as well as fingerprint of the electronic and surface properties of a molecule. In accordance with our previous work [7] in which, spectra-structure correlation models to predict the anomeric configuration, type of linkage and residues for disaccharides from unassigned list of 13C chemical shifts were built with high accuracy.

The aim of this study is the development of new computational tools for structural elucidation of oligosaccharides, such as di- and trisaccharides, using 1D 13C-NMR data. This approach is complementary to the program CASPER because it allows the prediction of the anomeric configuration, residues and type of linkages in oligosaccharides using only the 13C-NMR chemical shifts sorted in ascending order (unassigned chemical shifts). In fact, the structure determination of an oligosaccharide using CASPER requires, besides to the 13C-NMR chemical shifts, information about the residues and their linkages.

2. Results and Discussion

Different representative machine learning techniques, such as RF, CT and CPGNN, were compared to build a quantitative spectra–structure relationships model to predict: (1) the three anomeric configurations; (2) the two type of linkages; (3) the three residues; and (4) the chain type of trisaccharides from 1D 13C-NMR descriptors. The results for internal cross-validation (10-fold cross-validation with CPGNN and out-of-bag estimation with RF on training set) and external validation (on test set) are presented in Table 1 and Table 2.

Table Table 1. RF, CT and CPGNN predictions of the anomeric configurations, type of linkages, residues and chain type in trisaccharides from 1D 13C-NMR descriptors.

Click here to display table

Table 1. RF, CT and CPGNN predictions of the anomeric configurations, type of linkages, residues and chain type in trisaccharides from 1D 13C-NMR descriptors.
RF aCTCPGNN b
Training set / Test set
ModelClassesSizeCorrect pred.Sensitivity cSpecificity dCorrect pred.SensitivitySpecificityCorrect pred.SensitivitySpecificity
Ano_F1A (α)46/1233/60.71/0.500.75/0.6743/70.93/0.580.78/0.5827/50.59/0.420.59/0.56
B (β)46/1535/120.76/0.80.73/0.6734/100.74/0.670.92/0.6727/110.59/0.730.59/0.61
Ano_S2A (α)46/1245/120.98/10.98/0.9238/50.83/0.420.80/0.6730/50.65/0.420.62/0.56
B (β)46/1545/140.98/0.930.98/137/100.81/0.500.82/0.5928/110.61/0.730.64/0.61
Ano_R3A (α)46/1634/120.74/0.750.77/0.9238/110.83/0.690.93/123/80.5/ 0.50.74/0.89
B (β)46/1136/100.78/0.910.75/0.7143/110.93/10.84/0.6938/100.83/0.910.62/0.56
F_Link4A (1→2)33/1330/100.91/0.770.83/128/100.85/0.770.78/0.628/40.24/0.310.53/0.8
B (1→3)27/722/70.81/10.85/0.7823/40.85/0.570.72/0.5718/60.67/0.860.34/0.35
C (1→4)16/411/20.69/ 0.50.92/18/00.5/00.67/06/30.38/0.750.28/0.6
D (1→6)16/315/30.94/10.83/0.512/30.75/11/10/00/00/0
S_Link5A (1→2)8/11/00.12/01/00/00/00/00/00/00/0
B (1→3)17/129/70.53/0.581/112/80.70/0.670.86/17/60.41/0.50.32/0.86
C (1→4)51/1350/130.98/10.74/0.7247/110.92/0.850.75/0.7339/120.76/0.920.57/0.6
D (1→6)16/113/10.81/10.93/0.513/10.81/10.87/0.250/00/00/0
Red_end6A (Glc)26/1721/140.81/0.820.6/118/100.69/0.590.75/0.776/30.23/0.181/1
B (Gal)18/58/50.44/10.5/0.718/20.44/0.40.89/16/20.33/0.40.17/0.14
C (Man)18/39/20.5/0.670.64/0.6711/10.61/0.330.61/0.29/10.5/0.330.24/0.12
D (Rha)17/212/20.70/10.92/0.6716/10.94/0.50.84/0.253/20.23/10.25/1
E (Fuc)13/010/00.77/---0.71/---13/01/---0.59/---1/00.08/---1/---
M_residue7A (Glc)26/619/50.73/0.830.61/0.5620/30.77/0.50.83/0.337/30.27/0.50.78/0.6
B (Gal)19/59/40.47/0.80.64/0.88/10.42/0.20.53/0.1711/20.58/0.40.34/0.2
C (Man)19/59/40.47/0.80.43/0.812/20.63/0.40.63/0.678/00.42/00.35/0
D (Rha)12/46/30.5/0.750.6/110/30.83/0.750.53/0.757/30.58/0.750.25/0.5
E (Fuc)16/79/40.56/0.570.56/0.811/40.69/0.570.73/0.80/10/0.140/1
F_residue8A (Glc)22/918/60.82/0.670.86/0.6715/50.68/0.560.83/0.6210/20.45/0.220.91/1
B (Gal)18/314/20.78/0.670.82/0.2811/10.61/0.330.69/0.178/20.44/0.670.38/0.17
C (Man)16/69/20.56/0.330.56/113/50.81/0.830.59/0.836/20.38/0.330.21/0.5
D (Rha)19/414/30.74/0.750.7/0.7512/30.63/0.750.63/18/20.42/0.50.28/0.28
E (Fuc)17/515/30.88/0.60.83/0.614/30.82/0.60.82/0.751/10.06/0.20.33/0.5
Chain_TypeA (LT)39/829/70.74/0.880.83/0.8837/80.95/10.80/0.6726/60.67/0.750.81/1
B (BT)53/1947/180.89/0.950.82/0.9544/150.83/0.790.96/147/190.89/10.78/0.90

a Out-of-bag; b 10-Fold cross-validation; c Ratio of true positives to the sum of true positives and false negatives; d Ratio of true positives to the sum of true positives and false positives; 1 Anomeric configuration of the first unit; 2 Anomeric configuration of the second unit; 3 Anomeric configuration of the reducing end; 4 First linkage type; 5 Second linkage type; 6 Reducing end; 7 Middle residue; 8 First residue.

Table Table 2. Mean predictability of RF, CT and CPGNN predictions of the anomeric configurations, type of linkages, residues and chain type in trisaccharides from 1D 13C-NMR.

Click here to display table

Table 2. Mean predictability of RF, CT and CPGNN predictions of the anomeric configurations, type of linkages, residues and chain type in trisaccharides from 1D 13C-NMR.
Mean Predictability (%) a
Training set bTest set
ModelRFCTCPGNNRFCTCPGNN
Anomeric ConfigurationsAno_F173.9183.7058.7065.0062.557.50
Ano_S297.8381.5263.0496.6754.1757.50
Ano_R376.0988.0466.3082.9584.3870.45
Linkage TypesF_Link483.7273.7632.1081.7358.5247.87
S_Link561.1861.0029.4164.5862.8235.58
ResiduesRed_end664.5473.7826.3587.2545.5447.74
M_residue754.8166.8537.0575.1048.4341.25
F_residue875.5571.2135.0860.3361.4443.06
Chain_Type81.5288.9477.6791.1289.4787.50

a Average sensitivity of the trisaccharide classes; b 10-Fold cross-validation with CPGNN and out-of-bag estimation with RF on training set; 1 Anomeric configuration of the first unit; 2 Anomeric configuration of the second unit; 3 Anomeric configuration of the reducing end; 4 First linkage type; 5 Second linkage type; 6 Reducing end; 7 Middle residue; 8 First residue.

The random forest method showed an improved prediction performance compared to a single classification tree and CPGNN method to predict the anomeric configurations, type of linkages, residues, and chain type of trisaccharides for test set (Table 2). However, a single tree was able to predict the reducing end anomeric configuration of the training set and external data set with a mean predictability of 88.0% and 84.4% (for the α anomer: sensitivity = 0.69, specificity = 1 and for the β anomer: sensitivity = 1, specificity = 0.69) respectively, using unassigned chemical shifts (Table 1 and Table 2). The performance of the reducing end anomeric tree was even slightly superior to RF in terms of correct predictions for the training set, and similar for the predictions of the test set. A graphical representation of the reducing end anomeric tree was presented in Figure 1. Four descriptors were chosen by the tree, although one descriptor was used twice, C6 (6th chemical shift). All of these descriptors are identified as the ten most important spectral descriptors by the RF revealed by the Gini parameter—C16 (16th chemical shift), C21 (21st chemical shift) and C23 (23rd chemical shift).

In addition, it was possible to infer important rules derived by the reducing end anomeric classification tree. Rules such as: (1) at the first node, for 16th chemical shift <72.52 ppm, all the trisaccharides are α anomers and almost all without D-glucose monomer in any position (only one trisaccharide has a D-glucose unit), otherwise (16th chemical shift ≥72.52 ppm.) the reducing end, second unit or first unit of all trisaccharides may be D-glucose monomer or not; (2) at the second node, for 6th chemical shift ≥57.16 ppm, most of trisaccharides have only as reducing, middle and first residue a D-glucose, a D-galactose or a D-mannose monomer (eight trisaccharides have a L-fucose or L-rhamnose monomer units), otherwise (6th chemical shift <57.16 ppm) all trisaccharides have at least a L-fucose or a L-rhamnose monomer; (3) at the third node, for 21st chemical shift ≥96.52 ppm, nearly all the trisaccharides are β anomers without a L-fucose or a L-rhamnose monomer, otherwise (21st chemical shift <96.52 ppm) almost all the trisaccharides are non-methoxylated linear α anomers; (4) at the fourth node, for the 23rd chemical shift ≥175.3 ppm, almost all trisaccharideas are linear β anomers, otherwise (23rd chemical shift <175.3 ppm) all trisaccharides are methoxylated branched α anomers with at least one L-fucose or L-rhamnose residue.

Molecules 17 03818 g001 200
Figure 1. Representation of the classification tree derived with CART algorithm to distinguish the reducing end anomeric configuration of 92 trisaccharides (training set).

Click here to enlarge figure

Figure 1. Representation of the classification tree derived with CART algorithm to distinguish the reducing end anomeric configuration of 92 trisaccharides (training set).
Molecules 17 03818 g001 1024

In order to compare the nine models of classification built with RF and CT, just an experiment was performed with a counterpropagation neural network (CPGNN) on the basis of unassigned 13C chemical shifts. In spite of this advantage of the CPGNN as compared to CT and RF methods, its predictive power is lower than those for all classes with no exceptions within internal validation (i.e., 10-fold cross-validation procedure with training set) and external validation (i.e., with test set, Table 1 and Table 2).

After the exploration of models derived with trisaccharides, we investigated the inclusion in this new model of our previously model for predicting the structure of 154 pyranosyl disaccharides using unassigned 13C-NMR chemical shifts (Table 3). RF was used due to its best performance in previous experiments.

Table Table 3. RF predictions of the anomeric configurations, type of linkages, residues and chain type in 204 di- and trisaccharides for the training set and 69 for the test set using 1D 13C-NMR descriptors.

Click here to display table

Table 3. RF predictions of the anomeric configurations, type of linkages, residues and chain type in 204 di- and trisaccharides for the training set and 69 for the test set using 1D 13C-NMR descriptors.
Training set / Test set
Model Classes Size Correct pred. Sensitivity a Specificity b Mean Predictability c (%)
Ano_F1A (α)105/3089/250.85/0.830.88/0.7886.32/82.69
B (β)99/3987/320.88/0.820.84/0.86
Ano_S2A (α)46/1238/100.83/0.830.74/0.8384.78/90
B (β)46/1533/130.72/0.870.80/0.87
X (NA)112/42112/421/11/1
Ano_R3A (α)102/3984/310.82/0.790.88/0.9485.78/88.08
B (β)102/3091/280.89/0.930.83/0.78
F_Link4A (1→2)33/1330/100.91/0.770.79/184.99/85.38
B (1→3)27/721/70.78/10.88/0.78
C (1→4)16/411/20.69/0.50.85/0.67
D (1→6)16/314/30.88/10.82/0.6
X (NA)112/42112/421/11/1
S_Link5A (1→2)36/1322/100.61/0.770.88/0.8382.32/85.16
B (1→3)48/2138/150.79/0.710.84/0.94
C (1→4)71/2669/240.97/0.920.748/0.77
D (1→6)49/945/90.92/10.98/0.9
Red_end6A (Glc)72/3860/330.83/0.870.71/0.7575.70/74.96
B (Gal)58/1339/90.67/0.690.72/0.75
C (Man)44/1637/70.73/0.440.86/0.7
D (Rha)17/212/20.70/10.92/0.67
E (Fuc)13/011/00.87/---0.69/---
M_residue7A (Glc)26/620/50.77/0.830.69/0.5662.27/79.25
B (Gal)19/59/40.47/0.80.5/0.8
C (Man)19/57/40.37/0.80.37/0.8
D (Rha)12/46/30.5/0.750.67/1
E (Fuc)16/710/40.62/0.570.59/0.8
X (NA)112/42112/421/11/1
F_residue8A (Glc)74/3365/310.88/0.940.86/0.7679.64/67.50
B (Gal)44/731/20.70/0.280.74/0.33
C (Man)50/2039/120.78/0.60.78/1
D (Rha)19/414/30.74/0.750.67/0.75
E (Fuc)17/515/40.88/0.80.88/0.67
Chain_TypeA (LT)39/828/70.72/0.880.82/0.8886.82/94.08
B (BT)53/1947/180.89/0.950.81/0.95
X (NA)112/42112/421/11/1

a Ratio of true positives to the sum of true positives and false negatives; b Ratio of true positives to the sum of true positives and false positives; c Average sensitivity of the trisaccharides classes; 1 Anomeric configuration of the first unit; 2 Anomeric configuration of the second unit; 3 Anomeric configuration of the reducing end; 4 First linkage type; 5 Second linkage type; 6 Reducing end; 7 Middle residue; 8 First residue.

When the trisaccharides and disaccharides models were taken together the performance of seven and eight of the nine tasks [i.e., (1) anomeric configuration of the first unit, second unit and reducing end; (2) type of first and second linkages; (3) reducing end, middle and first residue; and (4) chain type of di- and trisaccharides] were improved for the test set and training set, respectively, see Table 1Table 3. In fact the di- and trisaccharides model showed an improved prediction performance compared to the trisaccharides model in terms of correct predictions for test set—It could classify without any error or with only one error for the nine tasks 52% of the trisaccharides (14 of 27 trisaccharides) as compared with 37% for the trisaccharides model (10 of 27 trisaccharides). As expected with the increase of the system complexity, for the prediction of disaccharides in the new model was obtained a lower result in terms of correct predictions for the test set to that obtained with the disaccharides model (i.e., our previous model)—It could classify without any error or with only one error for the nine tasks 83% of the disaccharides (35 of 42 disaccharides) as compared with 95% for the disaccharides model (40 of 42 disaccharides). Although the decrease of the predictive power of the new model for the disaccharides, it still showing for all tasks a mean predictability great than or equal to 67% for the test set (93%, 100%, 90%, 100%, 88%, 67%, 100%, 81% and 100% for the anomeric configuration of the first unit, anomeric configuration of the second unit, anomeric configuration of the reducing end, first linkage type, second linkage type, reducing end, middle residue, first residue and chain type of 42 disaccharides, respectively).

The random forest algorithm for classification can give two measures of importance for the descriptors used in growing trees, the Mean Decrease in Accuracy and Mean Decrease in Gini. The ten most important descriptors using RF for trisaccharides model and oligosaccharides model (i.e., di- and trisaccharides model), as well as the descriptors selected by the CT in the trisaccharides model were analyzed—Table 4. From the analysis it was evident that the 23rd, 22nd, 21st and 20th chemical shift descriptors have gained importance on di- and trisaccharide approach as compared with trisaccharides approach. These descriptors correspond mostly to the chemical shift of the anomeric carbon atoms. Although for the N-acetylated amino sugar derivatives the 23rd descriptor corresponds to the chemical shift of the carbonyl carbon atom in 2-acetamide group.

Table Table 4. Comparison of the ten most important descriptors by RF in the trisaccharides and oligosaccharide models and the selected descriptors by CT in the trisaccharides model.

Click here to display table

Table 4. Comparison of the ten most important descriptors by RF in the trisaccharides and oligosaccharide models and the selected descriptors by CT in the trisaccharides model.
ModelRFCT
TrisaccharidesDi- and trisaccharides
Ano_F1C12; C11; C9; C22; C14; C13; C23; C15; C16; C17C23; C19; C20; C12; C15; C18; C21; C13; C11; C22C12; C9 (2×); C21
Ano_S2C15; C14; C13; C22; C16; C6; C10; C23; C17; C7C12; C6; C10; C9; C8; C11; C7; C21; C13; C14C14; C6; C22
Ano_R3C16; C6; C21; C10; C14; C18; C11; C17; C5; C8C22; C18; C16; C20; C19; C17; C14; C23; C21; C6C16; C6 (2×); C21; C2
F_Link4C8; C7; C20; C6; C19; C23; C10; C14; C16; C11C8; C7; C6; C10; C21; C12; C11; C9; C20; C13C8; C20; C7; C11; C6; C17
S_Link5C8; C7; C19; C6; C22; C20; C5; C18; C9; C23C21; C13; C14; C15; C8; C22; C20; C12; C23; C19C8 (2×); C22; C17
Red_end6C6; C5; C22; C19; C20; C7; C9; C10; C14; C18C22; C15; C14; C16; C20; C13; C19; C21; C18; C17C6; C12; C5; C20; C23
M_residue7C6; C7; C15; C5; C16; C11; C23; C9; C10; C18C10; C7; C6; C9; C8; C12; C11; C21; C15; C5C16; C6 (2×); C7; C18; C23
F_residue8C7; C5; C6; C10; C15; C16; C8; C9; C11; C21C14; C12; C16; C15; C23; C17; C20; C5; C21; C13C16; C9 (2×); C7; C8; C5
Chain_TypeC7; C23; C20; C14; C5; C8; C6; C21; C18; C12C8; C21; C7; C6; C9; C10; C11; C12; C14; C23C7; C5; C23 (2×); C8

1 Anomeric configuration of the first unit; 2 Anomeric configuration of the second unit; 3 Anomeric configuration of the reducing end; 4 First linkage type; 5 Second linkage type; 6 Reducing end; 7 Middle residue; 8 First residue.

3. Experimental

3.1. Data Set and Descriptors

A data set of 119 pyranosyl trisaccharides and their corresponding twenty three 13C-NMR chemical shifts were used for establishing spectra–structure relationships. Trisaccharides used are trimers of the α or β anomers of D-glucose, D-galactose, D-mannose, L-fucose or L-rhamnose residues bonded through α or β glycosidic linkages of types 1→2, 1→3, 1→4, or 1→6, as well as methoxylated and/or N-acetylated amino trisaccharides. The 13C-NMR chemical shifts of the test set (27 trisaccharides) were experimental values obtained from the literature [19,20,21,22,23,24,25,26] and chemical shifts of the training set were also experimental values obtained from the literature (57 trisaccharides) [20,21,22,23,24,25,26,27,28,29] as well as chemical shifts calculated by the CASPER program [12,30] (35 trisaccharides).

The training and test sets are listed in Table S1 and S2 of the Supporting Information. The chemical shifts (independent variables) were encoded as a sequence of chemical shifts sorted in ascending order. The input, 1D 13C descriptors, corresponds to the 13C chemical shifts of only one of the epimeric disaccharides (α or β configuration of the reducing end). Therefore, the model that was constructed did not consider the mutarotation process, because 13C chemical shifts corresponding to a single anomer (i.e., a single diastereoisomer) had been used for each one of the input objects. These chemical shifts should, however, be interpreted only qualitatively, since oligosaccharides are flexible molecules, and the measured (for the test and training sets) and calculated (for the training set) chemical shifts represent an average for all existing conformations.

The three anomeric configurations of the 119 trisaccharides result in three outputs and for each one two classes , A(α) and B (β) corresponding to the stereochemistry of the glycosidic linkage between the first and second units, glycosidic linkage between the second and third units, and the reducing end of the trisaccharide, respectively. The linkage discrimination of these trisaccharides was also achieved using two outputs and for each one four classes, A (1→2), B (1→3), C (1→4) and D (1→6), corresponding to the first and second linkage positions of the trisaccharide. The three monomers generate three outputs and for each one five classes, A (Glc), B (Gal), C (Man), D (Rha), and E (Fuc), corresponding to the reducing end, middle and first residue of the trisaccharide. The chain type of these trisaccharides was also achieved using two classes, A (linear trisaccharide—LT), B (branched trisaccharide—BT).

Our previously model for predicting the structure of 154 pyranosyl disaccharides using unassigned 13C-NMR chemical shifts [7] was incorporated in this new model in order to evaluate its predictive power, as well as to compare it with the results obtained using the disaccharides model. The partition between training set and test set was maintained. Therefore, this new model (i.e., di- and trisaccharides model) is built with 204 di- and trisaccharides for the training set and 69 for the test set. The disacchaides training and test sets are listed in Table S3 and S4 of the Supporting Information.

The main objective of this procedure was to verify if unassigned 13C-NMR chemical shifts values could enable machine learning techniques, to clearly discriminate between various anomeric configurations, type of linkages, and residues of trisaccharides as well as disaccharides. As a result, building a tool that allows the users to predict the structure of oligosaccharides from the unassigned 13C-NMR chemical shifts could be very useful.

The validation of the CASPER program by Loß et al. [31] showed in many cases discrepancies between the calculated and experimental 13C chemical shifts as low as 0.2 ppm for 155 glycan compounds. These values have the same range of differences between measurements from different laboratories resulting from slightly dissimilar experimental conditions. Moreover, in our previous work with disaccharides [7], the calculated 13C-NMR chemical shifts of the training set using the CASPER program were validated with experimental values obtained from the literature for the test set and a maximum error and mean RMS (root mean square) error of 0.95 ppm and 0.15 ppm, respectively, was obtained.

3.2. Selection of Training and Test Sets for the Trisaccharides Model

The whole data set was divided into a training set of 92 compounds and a test set of 27 compounds, which were used for the development and external validation of the quantitative spectra-structure relationships models. The approximate 3:1 partition was assisted by a Kohonen Self-Organizing Map (SOM) [32] in such way that both sets span the chemical diversity of the data set. The 119 compounds were mapped on a SOM on the basis of unassigned twenty three 13C-NMR chemical shifts (independent variables). No information concerning the structure was used. A trend for clustering according to structural classes of compounds such as D-glucans trisaccharides (consisting of three monomers of D-glucose) was observed. Compounds belonging to the various clusters were selected for the test set from singly occupied neurons.

3.3. Random Forest (RF) [33,34,35]

A RF is an ensemble of unpruned classification trees created by using bootstrap samples of the training data. The best split at each node was defined among a randomly selected subset of descriptors. It is a high-dimensional nonparametric method that usually works well on large numbers of descriptors. Prediction is made by a majority vote of the classification trees in the forest. It has been shown that the method is extremely accurate in a variety of applications [34]. Additionally, performance is internally assessed with the prediction error for the objects left out, called out-of-bag (OOB) data, in the bootstrap procedure (internal cross-validation or OOB estimation). The method quantifies the importance of a descriptor by the increase in misclassification occurring when the values of the descriptor are randomly permuted, correlated with the mean decrease in accuracy parameter, or by the decrease in a node’s impurity every time the descriptor is used for splitting, correlated with mean decrease in Gini parameter. RFs also assign a probability to every prediction on the basis of the number of votes obtained by the predicted class. A measure of similarity between two objects can be calculated from the number of trees in the ensemble that classify the two objects in the same terminal node. Therefore, it is a supervised method because such comparison relies on the descriptors that were chosen by the forest to build the model. In this study, RFs were grown with the R program, version 2.12.1, [36] using the Random Forest library [37]. RFs were trained for the classification of: (1) anomeric configurations; (2) type of linkages; and (3) residues of di- and trisaccharides on the basis of their unassigned 13C-NMR chemical shifts (independent variables). The number of trees in a RF was set to 1,000.

3.4. Classification Tree (CT) [35,38]

Nine models of classification were built, and for each model a single classification tree was investigated to predict: (1) anomeric configuration of the first unit, second unit and reducing end; (2) type of first and second linkages; (3) reducing end, middle and first residue; and (4) chain type of trisaccharide. This was grown with the CART algorithm [38], which was different from the trees in the RFs. A classification tree is sequentially constructed by partitioning objects from a parent node into two child nodes. Each node is produced by a logical rule, usually defined for a single descriptor, where objects below a certain descriptor’s value fall into one of the two child nodes, and objects above fall into the other child node. The prediction for an object reaching a given terminal node is obtained by a majority vote of the objects (in the training set) reaching the same terminal node. The entire procedure comprises three main steps. First, an entire tree is constructed by data splitting into smaller nodes; each split produced is evaluated by an impurity function, which decreases as long as the new split permits the child node’s content to be more homogeneous than the parent node, which serves to minimize the Gini index. Secondly, a set of smaller, nested trees is obtained by the obliteration (pruning) of certain nodes of the tree obtained in the first step, therefore minimizing the entropy. The selection of the weakest branches is based on a cost– complexity measure that decides which subtree, from a set of subtrees with the same number of terminal nodes, has the lowest (within node) error. In this study, a classification tree was grown with the R program, version 2.12.1, [36] using the RPART library with the default parameters.

3.5. Counterpropagation Neural Network (CPGNN) [39]

A CPGNN consists of a Kohonen Self-Organizing Map (Kohonen SOM) [32] linked to an output layer of neurons aligned with the Kohonen layer. A Kohonen SOM distributes objects over a 2D surface (a grid of neurons) in such a way that objects bearing similar descriptors are mapped onto the same or adjacent neurons. The input data are stored in the two dimensional grid of neurons, each containing as many elements (weights) as there are input variables (twenty three 13C NMR chemical shifts). The nine output data (the three anomeric configurations, the two linkage types, the three residues and the chain type) are stored in the output layer that acts as a look-up table. CPNNs of toroidal topology and size 11 × 11 (number of neurons approximately 1.3 times the number of training cases) were trained with default parameters. The training was performed over 100 cycles. CPNNs were implemented with an in-house developed Java application derived from the JATOON Java applets [40,41].

4. Conclusions

The results indicate that machine learning techniques can be trained to predict the three anomeric configurations, the two types of linkages, the three residues and the chain type for linear or branched trisaccharides from the unassigned list of twenty three 13C chemical shifts with acceptable accuracy. The random forest method showed improved prediction performance compared to a single classification tree and a counterpropagation neural network to predict the nine tasks of 119 pyranosyl trisaccharides. Our previously model for predicting the structure of disaccharides was incorporated in this new model with an improvement of the predictive power.

Without the input of trisaccharide stereochemical data it was possible through the 1D 13C unassigned chemical shifts predict the three anomeric conformations corresponding to the stereochemistry of two glycosidic linkages and the reducing end of trisaccharide, as well as small changes in the stereochemistry of each one of residues on the trimer. For one side, the monomers D-mannose and D-galactose are epimers of D-glucose, they differ only in the stereochemistry of C-2 and C-4, respectively. And for other side, the monomers L-rhamanose and L-fucose are 6-deoxy sugars of L-mannose and L-galactose, respectively. Thus, we conclude that the results demonstrate that the 1D 13C chemical shifts can encode important 3D features. Therefore, the nine models built can be an important tool to predict the structure of oligosaccharides from the 13C-NMR chemical shifts without assigned.

Better models to predict the structure of more complex oligosaccharides as well as other natural compounds would probably require more data for calibration and can be an interesting approach in subsequent work. Applications of unassigned 13C-NMR chemical shifts as well as evaluations of their predictive ability as 3D molecular descriptors in quantitative structure-activity relationship (QSAR) or quantitative structure-property relationship (QSPR) analysis will be an interesting approach in subsequent work. Moreover, additional work has to be done to further investigate the application of unassigned 13C NMR chemical shifts as 3D molecular descriptors in quantitative structure-activity relationship (QSAR) or quantitative structure-property relationship (QSPR) analysis.

Supplementary Materials

Supplementary materials can be accessed at: http://www.mdpi.com/1420-3049/17/4/3818/s1.

Acknowledgments

The author thanks Fundação para a Ciência e a Tecnologia for the support through programme Ciência 2007.

References and Notes

  1. Herget, S.; Toukach, P.V.; Ranzinger, R.; Hull, W.E.; Knirel, Y.A.; von der Lieth, C-W. Statistical analysis of the bacterial carbohydrate structure data base (BCSDB): Characteristics and diversity of bacterial carbohydrates in comparison with mammalian glycans. BMC Struct. Biol. 2008, 8, 35, doi:10.1186/1472-6807-8-35.
  2. Bubb, W.A. NMR spectroscopy in the study of carbohydrates: Characterizing the structural complexity. Concepts Magn. Reson. 2003, 19A, 1–19.
  3. Duus, J.Ø.; Gotfredsen, C.H.; Bock, K. Carbohydrate structural determination by NMR spectroscopy: Modern methods and limitations. Chem. Rev. 2000, 100, 4589–4614, doi:10.1021/cr990302n.
  4. Vliegenthart, J.F.G.; Woods, R.J. NMR Spectroscopy and Computer Modeling of Carbohydrates—Recent Advances; American Chemical Society: Washington, DC, USA, 2006; pp. 1–19.
  5. Toukash, F.V.; Shashkov, A.S. Computer-assisted structural analysis of regular glycopolymers on the basis of 13C-NMR data. Carbohydr.Res. 2001, 335, 101–114.
  6. Maes, E.; Bonachera, F.; Strecker, G.; Guerardel, Y. SOACS index: An easy NMR-based query for glycan retrieval. Carbohydr.Res. 2009, 344, 322–330.
  7. Pereira, F. Prediction of the anomeric configuration, type of linkage, and residues in disaccharides from 1D 13C-NMR data. Carbohydr.Res. 2011, 346, 960–972, doi:10.1016/j.carres.2011.02.017.
  8. Jansson, P.E.; Kenne, L.; Wildmalm, G. CASPER—A Computer program used for structural analysis of carbohydrates. J. Chem. Inf. Comput. Sci. 1991, 31, 508–516.
  9. Stenutz, R.; Erbing, B.; Widmalm, G.; Jansson, P.E.; Nimmich, W. The structure of the capsular polysaccharide from klebsiella type 52, using the computerised approach CASPER and NMR spectroscopy. Carbohydr.Res. 1997, 302, 79–84.
  10. Stenutz, R.; Jansson, P.E.; Widmalm, G. Computer-assisted structural analysis of oligo- and polysaccharides: An extension of CASPER to multibranched structures. Carbohydr.Res. 1998, 306, 11–17.
  11. Jansson, P.E.; Stenutz, R.; Widmalm, G. Sequence determination of oligosaccharides and regular polysaccharides using NMR spectroscopy and a novel web-based version of the computer program CASPER. Carbohydr.Res. 2006, 341, 1003–1010.
  12. CASPER website, Available online: http://www.casperold.organ.su.se/casper/ (accessed on 23 February 2012).
  13. Hiltunen, Y.; Heiniemi, E.; Ala-Korpela, M. Lipoprotein-lipid quantification by neural-network analysis of 1H-NMR data from human blood plasma. J. Magn. Reson. B 1995, 106, 191–194.
  14. Bienfait, B. Applications of high-resolution self-organizing maps to retrosynthetic and QSAR analysis. J. Chem. Inf. Comput. Sci. 1994, 34, 890–898.
  15. Novic, M.; Zupan, J. Investigation of infrared spectra-structure correlation using kohonen and counterpropagation neural network. J. Chem. Inf. Comput. Sci. 1995, 35, 454–466.
  16. Munk, M.E.; Madison, M.S.; Robb, E.W. The neural network as a tool for multispectral interpretation. J. Chem. Inf. Comput. Sci. 1996, 36, 231–238.
  17. Rufino, A.R.; Brant, A.J.C.; Santos, J.B.O.; Ferreira, M.J.P.; Emerenciano, V.P. Simple method for identification of skeletons of aporphine alkaloids from 13C-NMR data using artificial neural networks. J. Chem. Inf. Comput. Sci. 2005, 45, 645–651.
  18. Emerenciano, V.P.; Alvarenga, S.A.V.; Scotti, M.T.; Ferreira, M.J.P.; Stefani, R.; Nuzillard, J.M. Automatic identification of terpenoid skeletons by feed-forward neural networks. Anal.Chim. Acta 2006, 579, 217–226.
  19. Dominik, M. NeuroCarb: Artificial neural networks for NMR structure elucidation of oligosaccharidesPh.D. Thesis, University of Basel, Basel, Switzerland, 2006.
  20. Shashkov, A.S.; Nifanťev, N.E.; Amochaeva, V.Y.; Kochetkov, N.K. 1H and 13C-NMR data for 2-O-, 3-O-and 2,3-di-O-glycosylated methyl α- and β-D-galactopyranosides. Magn.Reson. Chem. 1993, 31, 599–605.
  21. Usui, T.; Mizuno, T.; Kato, K.; Tomoda, M. 13C-NMR spectra of gluco-mamnooligosaccharides and structurally related glucomannan. Agric. Biol. Chem. 1979, 43, 863–865.
  22. Baumann, H.; Erbing, B.; Jansson, P.E.; Kenne, L. NMR and conformational studies of some 3-O, 4-O-, and 3,4-di-O-glycopyranosyl-substituted methyl α-D-galactopyranosides. J. Chem. Soc. Perkin Trans. 1989, 1, 2153–2165.
  23. Usui, T.; Yamaoka, N.; Matsuda, K.; Tuzimura, K.; Sugiyama, H.; Seto, S. 13C-Nuclear magnetic resonance spectra of glucobioses, glucotrioses, and glucans. J. Chem. Soc. Perkin Trans. 1973, 1, 2425–2432.
  24. Jansson, P.E.; Kjellberg, A.; Rundlölf, T.; Widmalm, G. Synthesis, NMR spectroscopy and conformational studies of two vicinally disubstituted trisaccharides. J. Chem. Soc. Perkin Trans. 1996, 2, 33–37.
  25. Baumann, H.; Erbing, B.; Jansson, P.E.; Kenne, L. Synthesis, NMR, and conformational studies of some 3,4-di-O-glycopyranosyl- substituted methyl α-D-galactopyranosides. J. Chem. Soc. Perkin Trans. 1989, 1, 2167–2168.
  26. Roslund, M.U.; Säwén, E.; Landström, J.; Rönnols, J.; Jonsson, K.H.M.; Lundborg, M.; Svensson, M.V.; Widmalm, G. Complete 1H and 13C-NMR chemical shift assignments of mono-, di-, and trisaccharides as basis for NMR chemical shift predictions of polysaccharides using the computer program CASPER. Carbohydr. Res. 2011, 346, 1311–1319, doi:10.1016/j.carres.2011.04.033.
  27. Urashima, T.; Bubb, W.A.; Messer, M.; Tsuji, Y.; Taneda, Y. Studies of the neutral trisaccharides of goat (Capra hircus) colostrum and of the one- and two-dimensional 1H and 13C-NMR spectra of 6'-N-acetylglucosaminyllactose. Carbohydr.Res. 1994, 262, 173–184, doi:10.1016/0008-6215(94)84177-2.
  28. Bock, K.; Duus, J.Ø.; Norman, B.; Pedersen, S. Assignment of structures to oligosaccharides produced by enzymic degradation of a β-D-glucan from barley by 1H- and 13C-NMR spectroscopy. Carbohydr. Res. 1991, 211, 219–233.
  29. Flugge, L.A.; Blank, J.T.; Petillo, P.A. Isolation, modification, and NMR assignments of a series of cellulose oligomers. J. Am. Chem. Soc. 1999, 121, 7228–7238.
  30. Jansson, P.E.; Stenutz, R.; Widmalm, G. Sequence determination of oligosaccharides and regular polysaccharides using NMR spectroscopy and a novel web-based version of the computer program CASPER. Carbohydr. Res. 2006, 341, 1003–1010.
  31. Loß, A.; Stenutz, R.; Schwarzer, E.; von der Lieth, C.W. GlyNest and CASPER: Two independent approaches to estimate 1H and 13C NMR shifts of glycans available through a common web-Interface. Nucleic Acids Res. 2006, 34, W733–W737.
  32. Kohonen, T. Self-Organization and Associative Memory; Springer: Berlin, Germany, 1988.
  33. Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32.
  34. Svetnik, V.; Liaw, A.; Tong, C.; Culberson, J.C.; Sheridan, R.P.; Feuston, B.P. Random forest: A classification and regression tool for compound classification and QSAR modeling. J. Chem. Inf. Comput. Sci. 2003, 43, 1947–1958.
  35. Zhang, Q.Y.; Aires-de-Sousa, J. Random forest prediction of mutagenicity from empirical physicochemical descriptors. J. Chem. Inf. Model. 2007, 47, 1–8.
  36. R Development Core Team. R. A language and environment for statistical computing. ISBN 3-900051-07.-0; R Foundation for Statistical Computing: Vienna, Austria, 2004. Available online: http://www.r-project.org/ (accessed on 23 February 2012).
  37. Liaw, A.; Weiner, M. randomForest (R software for random forest). Fortran original (Breiman,L.; Cutler,A.),R port (Liaw,A.; Wiener,M.). Version 4.6-6, Available online: http://stat-www.berkeley.edu/users/breiman/RandomForests/ (accessed on 23 February 2012).
  38. Breiman, L.; Friedman, J.H.; Olshen, R.A.; Stone, C.J. Classification and Regression Trees; Chapman & Hall/CRC: Boca Raton, FL, USA, 2000.
  39. Zupan, J.; Gasteiger, J. Neural Networks in Chemistry and Drug Design; Wiley-VCH: Weinheim, Germany, 1999.
  40. Aires-de-Sousa, J. JATOON: Java tools for neural networks. Chemometr.Intell. Lab. Syst. 2002, 61, 167–173.
  41. JATOON applets. Available online: http://joao.airesdesousa.com/jatoon/ (accessed on 23 February 2012).
  • Sample Availability: Not available.

Supplementary Files

  • Supplementary File 1::

    XLS-Document (XLS, 274 KB)

  • Molecules EISSN 1420-3049 Published by MDPI AG, Basel, Switzerland RSS E-Mail Table of Contents Alert