Combinatorial Sequences Approach by Using a Selection of Best Single Indices
This first approach consists of restraining the number of indices used for the encoding in order to have a smaller ensemble of Ext_SEQ. We call it “one phase combinatorial approach” because the idea behind it is to generate all the possible combinations of Ele_SEQ from a limited selected list of indices in one phase.
First, a modeling is done on the dataset with an encoding using each of the possible 566 AAindices, one by one and without indices concatenation: i.e., only Ele_SEQ, and not Ext_SEQ, are used as modeling inputs. The model performances are ranked, based on the lowest cvRMSE, to identify the best index to use alone. The ranking allows identification of the N best indices to use alone and which have to be selected for the combination of Ele_SEQ, so as to generate Ext_SEQ. The modulation of N allows modifying the size of our ensemble of Ext_SEQ.
At this stage, all the Ext_SEQ are built in one phase. In our study, we limited N to the top 10 best indices and Q ≤ 3. Thus, one sequence could be represented by 175 Ext_SEQ. Next, each Ext_SEQ is used as modeling input and a ranking of the models is performed. The best Ext_SEQ can thus be identified.
To illustrate this approach, let us consider the top 10 indices obtained after a ranking of the 566 indices, and let us consider only the “FFT_Seq”. The outcome after the combinatorial process could be, for example:
FFT_Seqj1--FFT_Seqji2
FFT_Seqj2--FFT_Seq j3
FFT_Seqj1--FFT_Seq j2_FFT_Seqj4
FFT_Seqj1-- FFT_Seqj2--FFT_Seqj3--FFT_Seqj4--“FFT_Seqj5--FFT_Seqj6-- FFT_Seqj7--FFT_Seqj8--FFT_Seqj9-- FFT_Seqj10
This “one phase combinatorial approach” is applied to our four datasets (GLP-2, TNF alpha, cytochrome P450, and epoxide hydrolase) in order to evaluate and identify the better models.
GLP-2. We applied this approach to the GLP-2 dataset and from a previous study [
22] the index 449 was shown as the best after a ranking of indices and encoding with FFT.
Figure 3 shows the results obtained with the index 449 alone and its model identified in our previous work [
22]. CvR
2 and cvRMSE are respectively 0.42 and 2.11 for the index 449.
We used the ranking of the indices and applied the one phase combinatorial approach. A combinatorial of three indices at most is run on the top 10 indices from the previous ranking. As FFT_Seqj1-FFT_Seqj2 is equivalent to FFT_Seqi2-FFT_Seq_i1, 175 combined extended sequences are obtained.
Table 1 shows that the best obtained cvR
2 and cvRMSE with three indices are respectively 0.47 and 1.99, with
p-value = 0.53. Thus, in this case the modeling performance appears better but the improvement is not significant according to the
p-value (
p-value = 0.531). Nevertheless, interesting findings were obtained. Indeed, we tested the model using the 10 indices to form the Ext_SEQ FFT
i1-FFT
i2….FFT
i10. The cvRMSE of this model jumps to 2.48 and the cvR
2 decreases to 0.11. So, it should be noted that the right number of indices has to be found: i.e., a combination of
m index is not always better than a combination of
n index (with m > n). In other words, large Ext_SEQ is not equal to better modeling performances. The addition of indices for the encoding step is not related to the improvement of the modeling. Moreover, we notice that index 449, the best index when only one index is selected, could not be the best to use for a combination of three indices from the top ten indices as exemplified in
Table 1. Indeed, 449 alone appears in position nine in the ranking.
Epoxide Hydrolase
We performed the same operation, with application of FFT, on the epoxide hydrolase dataset and results are presented in
Table 2.
In another study [
23] we identified, for epoxide hydrolase, the index 303 as the best after a ranking of indices.
Figure 4 shows the results obtained with the model based on the index 303 alone.
The performance with the best index, index 303, was already high, 0.96 and 0.12 for cvRMSE and cvR
2, respectively. The best performances, seen in
Table 2, are 0.105 and 0.969, respectively for cvRMSE and cvR
2, with the
p-value = 0.43. Thus, the combinatorial of multiple indices appears to slightly improve the modeling performances but the improvement is not significant.
It should be noted that index 303, identified as the best when ranking the 566 indices, is classified only in position 38 (Top 38) when a combinatorial approach is used: i.e., 37 combinations of indices are better than 303 alone (when considering only the top 10 in this example and when this best index (i.e., 303 here) is included in the top 10).
On the other datasets, we also did not have significant improvement with the combinatorial approach. One reason could be that the top 10 selected indices were not the best for the concatenation with three indices. So, we implemented another method with a different way of selecting an index for the improvement of modeling performances.
Successive Concatenation of a Protein Sequence Encoded by Multiple Indices
The second method is termed “successive concatenation” because we increase the size of Ext_SEQ incrementally in several iterations. In “successive concatenation”, N iterations of innov’SAR modeling are applied to find the best N indices and the associated Ext_SEQ, in an iterative process.
For each iteration, the best previous index or indices is/are kept. This allows us to construct incrementally the Ext_SEQ with different indices, i.e., at the end of an iteration, the best index for the modeling performances is determined and it will be kept for the next iteration.
In the first iteration, with the AAindex including 566 indices, the 566 indices are evaluated one by one as described in the modeling approach based on one index. A ranking of the 566 indices is performed according to cvRMSE values (from the cross-validation procedure) as detailed in Section “Modeling approach”. The best index j1 is the one that gives the lowest cvRMSE. Consequently, the index j1 is the first index used to construct the first part of the Ext_SEQ. The protein sequence is encoded according to the first index j1, using a sequence representation noFFT_Seq or FFT_Seq. In the second iteration, the process identified another index, j2, to use for the construction of Ext_SEQ of two Ele_SEQ, starting from the sequence encoded by j1 as a base block of Ext_SEQ. The index j2 is identified by a second ranking with all the indices except the one used in the base block of Ext_SEQ, j1, for the second iteration, i.e., the ranking on 566—one indice. For each iteration the same operation is repeated to find the best index for modeling and increasing the size of the Ext_SEQ.
Thus, an extended sequence Ext_SEQ such as FFT_Seqj1-FFT_Seq j2-..-FFT_Seqjn is obtained. This could be extended to any number of parts in the “Ext_SEQ”. Furthermore, a mix of noFFT and FFT could be used.
This procedure is illustrated in
Figure 5.
Exemplification of this procedure is done with the four datasets with three indices.
GLP-2 Dataset
The obtained modeling performances with the first best single index, 449, are 0.42 and 2.11, for cvR
2 and cvRMSE, respectively (cf.
Figure 3).
Figure 6 shows the results obtained using the three indices, 449, 341, and 193, gathered in the Ext_SEQ “FFT_SEQ
j1--FFT_Seq
j2--FFT_Seq
j3”. cvR
2 and cvRMSE are 0.55 and 1.75, respectively. Thus, using the three indices significantly improves the quality of the prediction. This is confirmed by the
p-value equal to 0.008 in Student’s test for the significance of the improvement.
Epoxide Hydrolase Dataset
We showed the best model based on one index in
Figure 4 with 0.96 and 0.12, respectively for the cvR
2 and the cvRMSE.
Figure 7 resulted from the application of the successive concatenation method to the epoxide hydrolase dataset. The combination with the indices 14 and 234 gives better performances since 0.97 and 0.09 are respectively obtained for the cvR
2 and the cvRMSE but the improvement in comparison to one index is not significant, with a
p-value of 0.343. Nevertheless, we note that here the
p-value is lower than the value (0.43) shown in
Table 2 above.
TNF Alpha Dataset
The performances with the best index (
Figure 8a), index 203, are 0.85 and 0.32, respectively for the cvR
2 and cvRMSE, for the TNF dataset. The combination with the indices 504 and 486 (
Figure 8b) allows increasing cvR
2 to 0.88 and decreasing cvRMSE to 0.28 (
p-value = 0.175).
Cytochrome P450 Dataset
The same method was applied to the cytochrome P450 dataset and is shown in
Figure 9. The performances with the best index, index 300, are shown in
Figure 2a, with 0.83 as cvR
2 and 1.91 as cvRMSE. The combination with the indices 39 and 226 allows significant improvement of these performances, up to 0.88 and 1.63, respectively for the cvR
2 and the cvRMSE. This improvement is confirmed by the
p-value (0.002).