#### Computational Methods

A nucleic acid is a long, unbrached polynucleotide – that is, a polymer consisting of nucleotides. Each nucleotide has the three following components: 1) A cyclic five-carbon sugar, 2) a purine o pyrimidine base attached to the 1’-carbon atom of sugar by N-glycoside bond, and 3) A phosphate attached to the 5’-carbon of the sugar by a phosphoester linkage. The nucleotides in nucleic acids are covalently linked by a second phosphoester bond that joins the 5’-phosphate of one nucleotide and the 3’-OH group of the adjacent nucleotides. The purine and pyrimidine bases are not engaged in any covalent bonds to each other. Thus, a polynucleotide consists of an alternating sugar-phosphate backbone and each nucleotide is characterized by the base attached to it, which can be either adenine (

A), cytosine (

C), guanine (

G) or thymine (

T) [RNA molecule contains the base uracil (

U) instead of

T]. Consequently, a RNA molecule is uniquely determined by the sequence of bases along its chain, and it has a definite orientation [

20,

21,

22,

23].

In particular, a typical RNA is the single-stranded polyribonucleotide. This macromolecule has a folded 3D conformation that is held together in part by noncovalent base-pairing interactions like those that hold together the two stands of the DNA helix. In the single-stranded RNA molecule, however, the complementary bases pairs form between nucleotides residues in the same chain, which causes the RNA molecule to fold up in a unique way that is important for its biochemical activity. In this sense, the RNA structure contains several sets of unpaired nucleotide residues. Most of the weak interactions (hydrogen bonds) form between Watson-Crick complementary bases (between pairs of non-consecutive bases), i.e., between

A and

U and between

C and

G, but a far from negligible amount of bonds also form between other pairs of bases, as for example the

G ^{ .}U wobble pairs [

20,

21,

22,

23].

On the other hand, the general principles of the molecular quadratic indices of the “molecular pseudograph`s atom adjacent matrix” for small-to-medium sized organic compounds have been explained in some detail elsewhere [

15,

16,

17,

18,

19]. However, this work gives an extended overview of this approach.

First, in analogy to the molecular vector X used to represent organic molecules, we introduce here the macromolecular vector (X

_{m}). The components of this vector are numeric values, which represent a certain nucleotide residues (DNA-RNA bases) properties. These properties characterize each kind of nucleotides (purine and pyrimidine bases) within the nucleic acid, because the only uncommon part of these nucleotides is these bases. Such properties can be experimental molar absorption coefficient

Є_{260} at 260 nm and PH = 7.0, first (

ΔE

_{1}) and second (

ΔE

_{2}) single excitation energies in eV, and first (

f_{1}) and second (

f_{2}) oscillator strength values (of the first singlet excitation energies) of the nucleotide DNA-RNA bases, and so on [

24]. For instance, the

f_{1(B)} property of the DNA-ARN bases B takes the values

f_{1(A)} = 0.28 for adenine,

f_{1(G)} = 0.20 for guanine,

f_{1(U)} = 0.18 for uracil and so on [

24].

Table 1 depicts nucleotides (bases) descriptors properties for the DNA-RNA bases.

**Table 1.**
Five properties of DNA-RNA bases using as labels to characterize each nucleotide. Experimental molar absorption coefficient

Є_{260} at 260 nm and pH=7.0, first (

ΔE

_{1}) and second (

ΔE

_{2}) single excitation energies in eV, and first (

f_{1}) and second (

f_{2}) oscillator strength values (of the first singlet excitation energies) of the nucleotide DNA-RNA bases [

24].

**Table 1.**
Five properties of DNA-RNA bases using as labels to characterize each nucleotide. Experimental molar absorption coefficient Є_{260} at 260 nm and pH=7.0, first (ΔE_{1}) and second (ΔE_{2}) single excitation energies in eV, and first (f_{1}) and second (f_{2}) oscillator strength values (of the first singlet excitation energies) of the nucleotide DNA-RNA bases [24].
Purine and pyrimidine bases (RNA/ADN) | f_{1} | f_{2} | Є_{260}/1000 | ΔE_{1} | ΔE_{2} |
---|

Adenine (A) | 0.28 | 0.54 | 15.4 | 4.75 | 5.99 |

Guanine (G) | 0.20 | 0.27 | 11.7 | 4.49 | 5.03 |

Uracil (U) | 0.18 | 0.3 | 9.9 | 4.81 | 6.11 |

Thymine (T) | 0.18 | 0.37 | 9.2 | 4.67 | 5.94 |

Cytosine (C) | 0.13 | 0.72 | 7.5 | 4.61 | 6.26 |

Thus, a RNA having 5, 10, 15,...,

n nucleotides can be represented by means of vectors, with 5, 10, 15,...,

n components, belonging to the spaces ℜ

^{5}, ℜ

^{10}, ℜ

^{15},...,ℜ

^{n}, respectively. Where

n is the dimension of these real sets (ℜ

^{n}). This approach allows us encoding RNA sequences such as AGUCACGUA through out the macromolecular vector X

_{m} = [0.28, 0.20, 0.18, 0.13, 0.28, 0.13, 0.20, 0.18, 0.28], in the

f_{1}-scale (see

Table 1). This vector belongs to the product space ℜ

^{9}. The use of other AND-ARN bases properties defines alternative macromolecular vectors.

For a given nucleic acid composed of nucleotides (

vector of ℜ

^{n}), the “macromolecular vector” (X

_{m}) is constructed and the

k^{th} nucleic acid’s total quadratic indices,

**q**_{k}(

x_{m}) are calculated as quadratic forms as shown in Eq. 1:

where,

^{k}a_{ij} =

^{k}a_{ji} (symmetric square matrix),

n is the number of nucleotides of the nucleic acid, and

^{m}X_{1},…,^{m}X_{n} are the coordinates or components of the macromolecular vector (X

_{m}) in a system of canonical basis vectors of ℜ

^{n}. In this case, the canonical (‘natural’) base of ℜ

^{n} {

e_{1},…,e_{n}} is used as the form’s base. Thereafter, the coordinates of any vector X

_{m} coincide with the components of this vector. For that reason, such coordinates can be considered as weights of the vertices (ADN-ARN bases) of the graph of the nucleic acid’s backbone. The coefficients

^{k}a_{ij} are the elements of the

k^{th} power of the macromolecular matrix M(G

_{m}) of the nucleic acid’s graph (G

_{m}). Here, M(G

_{m}) = [

a_{ij}], where

n is the number of bases (nucleotides) in sugar-phosphate’s backbone. The elements

a_{ij} are defined as follows:

**Table 2.**
A close up to the mathematical definition of total (RNA fragment) and local (nucleotide) nucleic acid quadratic indices of the “macromolecular graph’s nucleotide adjacency matrix” of a RNA fragment.

**Table 2.**
A close up to the mathematical definition of total (RNA fragment) and local (nucleotide) nucleic acid quadratic indices of the “macromolecular graph’s nucleotide adjacency matrix” of a RNA fragment.
**Secondary structure of an RNA fragment of the SL 2 motif (see Figure 1)** | **Macromolecular graph’s: an undirected graph with multiple edges G**_{m} | **X**_{m} = [G A C U G G U G A G U A C**];** **X**_{m} ∈ℜ^{13} In the definition of
**X**_{m}, as macromolecular vector, the symbol of the bases is used to indicate the corresponding AND-RNA bases property, for instance, f_{1}. That is: if we write A it means f_{1(A),} adenine first oscillator strength values or some bases property, which characterizes each nucleotide in the nucleic acid molecule. So, if we use the canonical bases of ℜ^{13}, the coordinates of any macromolecular vector **X**_{m} coincide with the components of that macromolecular vector.
**[X**_{m}]^{t} = [0.20 0.28 0.13 0.18 0.20 0.20 0.18 0.20 0.28 0.20 0.18 0.28 0.13]
**[X**_{m}]^{t}: Transposed of **[X**_{m}] and it means the vector of the coordinates of **X**_{m} in Canonical base of ℜ^{13} (a row matrix)
**[X**_{m}]: vector of the coordinates of **X**_{m} in Canonical base of ℜ^{13} (a columns matrix)
**M**^{1}(G_{m}): Macromolecular graph’s nucleotide Adjacency Matrix |

$\begin{array}{l}{q}_{0}({x}_{m})={\displaystyle \sum _{i=1}^{n}}{\displaystyle \sum _{j=1}^{n}{}^{0}a_{\mathrm{ij}}{}^{m}X_{i}{}^{m}X_{j}}\end{array}$ | = [^{m}X]^{t}**M**^{0}(G_{m}) [^{m}X] = 0.5662 |

$\begin{array}{l}{q}_{1}({x}_{m})={\displaystyle \sum _{i=1}^{n}}{\displaystyle \sum _{j=1}^{n}{}^{1}a_{\mathrm{ij}}{}^{m}X_{i}{}^{m}X_{j}}\end{array}$ | = [^{m}X]^{t}**M**^{1}(G_{m}) [^{m}X] = 1.7124 |

$\begin{array}{l}{q}_{2}({x}_{m})={\displaystyle \sum _{i=1}^{n}}{\displaystyle \sum _{j=1}^{n}{}^{2}a_{\mathrm{ij}}{}^{m}X_{i}{}^{m}X_{j}}\end{array}$ | = [^{m}X]^{t}**M**^{2}(G_{m}) [^{m}X] = 6.7533 |

$\begin{array}{l}{q}_{3}({x}_{m})={\displaystyle \sum _{i=1}^{n}}{\displaystyle \sum _{j=1}^{n}{}^{3}a_{\mathrm{ij}}{}^{m}X_{i}{}^{m}X_{j}}\\ \end{array}$ | = [^{m}X]^{t}**M**^{3}(G_{m}) [^{m}X] = 25.3806 |

$\begin{array}{l}{q}_{4}({x}_{m})={\displaystyle \sum _{i=1}^{n}}{\displaystyle \sum _{j=1}^{n}{}^{4}a_{\mathrm{ij}}{}^{m}X_{i}{}^{m}X_{j}}\\ \end{array}$ | = [^{m}X]^{t}**M**^{4}(G_{m}) [^{m}X] = 105.5649 |

**Nucleotide (N)** | **q**_{0L}(X_{m}, N) | **q**_{1L}(X_{m}, N) | **q**_{2L}(X_{m}, N) | **q**_{3L}(X_{m}, N) | **q**_{4L}(X_{m}, N) |

G285 | 0.04 | 0.134 | 0.666 | 2.154 | 9.654 |

A286 | 0.0784 | 0.1932 | 1.0668 | 3.5112 | 17.2256 |

C287 | 0.0169 | 0.1378 | 0.5369 | 2.8223 | 10.1634 |

U288 | 0.0324 | 0.1602 | 0.5328 | 2.0844 | 8.9226 |

G289 | 0.04 | 0.076 | 0.254 | 0.748 | 2.738 |

G290 | 0.04 | 0.076 | 0.156 | 0.422 | 1.136 |

U291 | 0.0324 | 0.072 | 0.1512 | 0.3492 | 1.0872 |

G292 | 0.04 | 0.092 | 0.232 | 0.786 | 2.8 |

A293 | 0.0784 | 0.2128 | 0.8652 | 3.3768 | 12.6308 |

G294 | 0.04 | 0.17 | 0.996 | 3.604 | 18.342 |

U295 | 0.0324 | 0.1872 | 0.4572 | 2.6136 | 8.6328 |

A296 | 0.0784 | 0.0868 | 0.5376 | 1.3608 | 7.4004 |

C297 | 0.0169 | 0.1144 | 0.3016 | 1.5483 | 4.8321 |

**ARN fragment** | 0.5662 | 1.7124 | 6.7533 | 25.3806 | 105.5649 |

where, E(G

_{m}) represents the set of edges of G

_{m} and

P_{ij} is the number of edges among the vertices (nucleotides)

v_{i} and

v_{j}. In this adjacency matrix M(G

_{m}) the row

i and column

i correspond to vertex

v_{i} from G

_{m}. The element

a_{ij} of this matrix represents a bond between a nucleotide

i and other

j. Here, we consider only covalent interaction (phosphodiester bond) and hydrogen bond interaction (between complementary bases). As a first approximation, we considered both interactions equivalent. The matrix M

^{k}(G

_{m}) provides the number of walks of length

k linking the nucleotides

i and

j.

Equation (1) for

q_{k}(

x_{m}) can be written as the single matrix equation:

where [

^{m}X] is a column vector (a

nx1 matrix), [

^{m}X]

^{t} the transpose of [

^{m}X] (a 1

xn matrix) and M

^{k}(G

_{m}) the

k^{th} power of the matrix M(G

_{m}) of the macromolecular pseudograph G

_{m} (mathematical quadratic form’s matrix).

Table 2 exemplifies the calculation of

q_{k}(

x_{m}) for a secondary structure RNA fragment.

In addition to total quadratic indices, computed for the whole-macromolecule, local-fragment (nucleotide and nucleotide-type) formalisms can be developed. These descriptors are termed local nucleic acid’s quadratic indices,

q_{kL}(

x_{m}). The definition of these descriptors is as follows:

where

m is the number of nucleotides of the fragment of interest and

^{k}a_{ijL} is the element of the file

i and column

j of the matrix M

^{k}_{L}(G

_{m}). This matrix is extracted from M

^{k}(G

_{m}) and contains information referred to the vertices of the specific nucleic acid fragments (F

_{R}) and also of the molecular environment. The matrix M

^{k}_{L}(G

_{m}) = [

^{k}a_{ijL}] with elements

^{k}a_{ijL} is defined as follows:

where, the

^{k}a_{ij} are the elements of the

k^{th} power of M(G

_{m}). These local analogues can also be expressed in matrix form by the expression:

Note that for any partition of a nucleic acid into Z macromolecular fragments there will be Z local macromolecular-fragment matrices. That is to say, if a nucleic acid is partitioned into Z macromolecular fragments, the matrix M

^{k}(G

_{m}) can be partitioned into Z local matrices M

^{k}_{L}(G

_{m}), L = 1,... Z. The

k^{th} power of the matrix M(G

_{m}) is exactly the sum of the

k^{th} power of the local Z matrices,

In the same way, M

^{k}(G

_{m}) = [

^{k}a_{ij}] where,

and the total nucleic acid’s quadratic indices are the sum of the macromolecular quadratic indices of the Z molecular fragments (see

Table 2),

Any local nucleic acid’s quadratic index has a particular meaning, especially for the first values of

k, where the information about the structure of the fragment F

_{R} is contained. Higher values of

k relate to the environment information of the fragment F

_{R} considered within the macromolecular graph (G

_{m}). In any case, a complete series of indices performs a specific characterization of the chemical structure. The generalization of the matrices and descriptors to “superior analogues” is necessary for the evaluation of situations where only one descriptor is unable to bring a good structural characterization [

25]. The local macromolecular indices can also be used together with total ones as variables for QSAR/QSPR (Quantitative Structure-Activity/Structure Relationship) modeling for properties or activities that depend more on a region or a fragment than on the macromolecule as a whole.

#### TOMOCOMD-CANAR Software

TOMOCOMD is an interactive program for molecular design and bioinformatics research [

26]. The program is composed by four subprograms, each one of them dealing with drawing structures (drawing mode) and calculating 2D and 3D molecular descriptors (calculation mode). The modules are named CARDD (Computed-Aided ‘Rational’ Drug Design), CAMPS (Computed-Aided Modeling in Protein Science), CANAR (Computed-Aided Nucleic Acid Research) and CABPD (Computed-Aided Bio-Polymers Docking).

**Figure 1.**
HIV-1 Ψ-RNA packaging region represented on the TOMOCOMD-CANAR interface. Nucleotides involved in binding and enhancement (structural changes) for RNAse I are shown as filled circles and triangles, respectively (open symbols indicates the use of RNAse T1).

**Figure 1.**
HIV-1 Ψ-RNA packaging region represented on the TOMOCOMD-CANAR interface. Nucleotides involved in binding and enhancement (structural changes) for RNAse I are shown as filled circles and triangles, respectively (open symbols indicates the use of RNAse T1).

In this paper we outline salient features concerning with only one of these subprograms: CANAR. This subprogram bases on a user-friendly philosophy without

prior knowledge of programming skills. The calculation of total and local macromolecular quadratic indices for any nucleic acids was implemented in the

TOMOCOMD-CANAR software [

26]. The following list briefly resumes the main steps for the application of this method in QSAR/QSPR:

1. Draw the macromolecular graphs (G_{m}) for each RNA/ADN of the data set, using the software’s drawing mode. Selection of the active nucleotide symbol carries out this procedure. Here, we consider only covalent interaction (phosphodiester bond) and hydrogen bond interaction (between complementary bases).

2. Use appropriated purine and pyrimidine bases weights in order to differentiate the residues in each nucleotide. This work uses as nucleotide weights five properties of DNA-RNA bases (see

Table 1) [

24]. This parametrization is done using the properties of U, T, A, G, and C only, because the only uncommon part of these nucleotides are these bases.

3. Compute the nucleic acid quadratic indices of the “macromolecular graph’s nucleotides adjacency matrix”. They can be performed in the software calculation mode, which you can select the DNA-RNA bases properties and the family descriptor previously to calculate the macromolecular indices. This software generates a table in which the rows and columns correspond to the compounds and the **q**_{k}(x_{m}), respectively.

4. Find a QSPR/QSAR equation by using statistical techniques, such as multilinear regression analysis (MRA), Neural Networks (NN), Linear Discrimination Analysis (LDA), and so on. That is to say, we can find a quantitative relation between a property

**P** and the

**q**_{k}(x_{m}**)** having, for instance, the following appearance,

Where

**P** is the measurement of the property,

**q**_{k}(

x_{m}) [or

**q**_{kL}(x

_{m})] is the

k^{th} total [or local] macro-molecular quadratic indices, an the

**a**_{k}’s are the coefficients obtained by the statistical analysis.

5. Test the robustness and predictive power of the QSPR/QSAR equation by using internal and external cross-validation techniques,

6. Develop a structural interpretation of the obtained QSAR/QSPR model using macromolecular quadratic indices as molecular descriptors.

#### Statistical Analysis

Based on the discussion above, two simple linear models were proposed to either discriminate between footprinted and interacting (binding) nucleotides or to predict drug–nucleotide affinity. Linear Discrimination Analysis (LDA) and Linear Multiple Regression (LMR) were used to obtain quantitative models, respectively. These statistical analyses were carried out with the STATISTICA software package [

27].

TOMOCOMD-CANAR model used for both statistical procedures the first 10

**q**_{kL}(

x_{m}) [from

**q**_{0L}(

x_{m}) to

**q**_{9L}(

x_{m})] for each nucleotides in RNA.

Forward stepwise was fixed as the strategy for variable selection. The tolerance parameter (proportion of variance that is unique to the respective variable) used was the default value for minimum acceptable tolerance, which is 0.01.

LDA is used in order to generate the classifier function on the basis of the simplicity of the method [

28]. To test the quality of the discriminant functions derived we used the Wilks’ λ and the Mahalanobis distance. The Wilks’ λ statistic for overall discrimination can takes values in the range of 0 (perfect discrimination) to 1 (no discrimination). The Mahalanobis distance indicates the separation of the respective groups. It shows whether the model possesses an appropriate discriminatory power for differentiating between the two respective groups. The classification of cases was performed means of the posterior classification probabilities, which is the probability that the respective case belogs to a particular group, i.e., footprinted or interacting (binding) nucleotides (see

Figure 1). In developing this classification function the values of -1 and 1 were assigned to these groups, respectively. The quality of the ADL model also was determined by examining the percentage of good classification and the proportion between the cases and variables in the equation. Validation of the discriminant function was corroborated by means of leave-

n-out cross validation procedures.

In addition, external prediction (test) sets assess the robustness and predictive power of the found model. This type of model validation is very important, if we take into consideration that the predictive ability of a QSAR model can only be estimated using an external test set of compounds that was not used for building the model [

29,

30]. The quality of the LMR model was determined examining the statistic parameters of multivariable comparison of regression and cross-validation procedures. In this sense, the quality of models was determined by examining the regression coefficients (R), determination coefficients (R

^{2}), Fisher ratio’s

p-level [

p(F)], standard deviations of the regression (s) and the leave-

one-out (LOO) press statistics (

q^{2}, s

_{cv}) [

30]. In recent years, the LOO press statistics (e.g.,

q^{2}) have been used as a means of indicating predictive ability. Many authors consider high

q^{2} values (for instance,

q^{2} > 0.5) as indicator or even as the ultimate proof of the high predictive power of a QSAR model.