This article is an open-access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

This paper provides an overview of recently developed two dimensional (2D) fragment-based QSAR methods as well as other multi-dimensional approaches. In particular, we present recent fragment-based QSAR methods such as fragment-similarity-based QSAR (FS-QSAR), fragment-based QSAR (FB-QSAR), Hologram QSAR (HQSAR), and top priority fragment QSAR in addition to 3D- and nD-QSAR methods such as comparative molecular field analysis (CoMFA), comparative molecular similarity analysis (CoMSIA), Topomer CoMFA, self-organizing molecular field analysis (SOMFA), comparative molecular moment analysis (COMMA), autocorrelation of molecular surfaces properties (AMSP), weighted holistic invariant molecular (WHIM) descriptor-based QSAR (WHIM), grid-independent descriptors (GRIND)-based QSAR, 4D-QSAR, 5D-QSAR and 6D-QSAR methods.

Quantitative structure-activity relationship (QSAR) is based on the general principle of medicinal chemistry that the biological activity of a ligand or compound is related to its molecular structure or properties, and structurally similar molecules may have similar biological activities [

In general, QSAR modeling (

A wide range of QSAR methodologies have been invented since the concept was first introduced by Free, Wilson, Hansch, and Fujita [

Over the years, improved methods—that are based on such traditional QSAR methods—have been introduced. 2D methods allow modeling of a wide variety of ligands or compounds including cases where 3D crystal receptor or target structures are not available [

One earlier example of a fragment-based method is HQSAR (Hologram QSAR) from Tripos [

where _{i}^{th} compound, _{ij}^{th} compound at position or bin _{j}

One drawback of HQSAR is a phenomenon called a fragment collision problem which happens during the hashing process of fragments. Although hashing reduces the length of the hologram, it causes bins to have different fragments in the same bin. The hologram length, a user-definable parameter, controls the number of bins in the hologram and alteration of hologram length can causes the pattern of bin occupancies to change. The program provides 12 default lengths which have been found to give good predictive models on different datasets. Each of these default lengths provides a unique set of fragment collisions [

Several HQSAR models for different ligand datasets including cases where the 3D crystal structure of receptor targets or proteins are unavailable have been developed in recent years [

Recently, Du _{i}^{o}_{i,α}

where Δ_{i,α}_{i,α}_{α}_{i,α}

where _{i,α,l}_{i,α}_{i}_{l}

In their studies, a total of 48 neuraminidase (NA) inhibitor analogs were used to train and test the model. Ten physicochemical properties were calculated for each substituent. Using an iterative double least square (IDLS) procedure, two sets of coefficients, one for fragments (_{α}_{l}^{2} = 0.91). They also tested on Free-Wilson and Hansch-Fujita models, which achieved r values of 0.2488 (r^{2} = 0.06) and 0.9373 (r^{2} = 0.88), respectively. The quantitative results proved the IDLS procedure enhanced the predictive power, and, given a novel method, more applications are necessary to fully explore its predictive potential.

More recently, a fragment-similarity based QSAR (FS-QSAR) method [

_{j}^{th} substituent position.

max = the max function picks the maximum score among similarity scores.

_{jk}^{th} fragment (a known fragment in the training set) at the j^{th} substituent position.

_{jg}^{th} substituent position.

_{jk,}_{jg}_{jg}_{jk}

_{j}^{MSF}^{th} substituent position.

The similarity function used in

where EV(F_{jk}) = lowest or highest eigen value of BCUT matrix of a fragment (F_{jk}).

The algorithm was developed and then tested on different datasets including 83 COX2 analogs and 85 triaryl bis-sulfone analogs. For statistical modeling, the model was repeatedly tested on five different testing sets which were generated by random selection of compounds. The average squared correlation coefficient, r^{2}, over five testing sets was 0.62 for COX2 analogs and 0.68 for bis-sulfone analogs. For comparison, the original Free-Wilson method was also tested, achieving the average r^{2} values of 0.46 for COX2 dataset and 0.42 for bis-sulfone dataset. Moreover, for better comparison the BCUT-similarity function was replaced by Tanimoto coefficient (Tc) method, the traditional 2D molecular similarity function, and the average r^{2} was 0.62 for both COX2 and bis-sulfone analogs. The FS-QSAR method was proved to have an effective predictive power compared to the traditional 2D-QSAR method since it solved the major limitation of the original Free-Wilson method by introducing the similarity concept into the regression equation. However, the predictive accuracy of FS-QSAR may not be as high as other higher dimension QSAR methods, but the method provides an objective, unique and reproducible 2D-QSAR model.

Casalegno ^{2} for the training set was 0.85 and 0.75 for the test set proving the model’s effectiveness.

In recent years, some new fragment-based QSAR methods have been discovered as well as applications to biological interests. Zhokhova

The 3D-QSAR methods have been developed to improve the prediction accuracies of 2D methods. 3D methods are computationally more complex and demanding than 2D approaches. In general, there are two families of 3D-QSAR methods: alignment-dependent methods and alignment-independent methods. Both families need experimentally or computationally derived bioactive conformations of ligands as templates for studies. Such 3D conformers are one of the most important factors to produce reliable 3D-QSAR models and are also the major drawbacks of 3D methods. Examples of both families are discussed below.

One of well-known methods is a three dimensional QSAR method called CoMFA developed by Cramer

Another 3D QSAR method named CoMSIA by Klebe

However, the major drawback of both methods is that all molecules have to be aligned and such alignment can affect the final CoMFA/CoMSIA model and predictions. A good alignment is necessary and quality of such alignment can be subjective, time-consuming [

Recently, Cramer ^{2} of 0.520 compared to literature average q^{2} of 0.636 [

Robinson

Using such a property master grid, an estimate of the activity of the i^{th} molecule as defined by a certain property can be derived as:

In the final stage, correlations between calculated SOMFA property values (SOMFA_{property, i}) and biological activities are derived via multiple linear regression and a final predictive model is produced. Robinson ^{2}) of 0.5776 (r = 0.76) and 0.5329 (r = 0.73) were achieved, respectively. Compared to other methods such as CoMFA [

In the last few decades, other 3D-QSAR methods which do not rely on alignments were introduced. Some examples include autocorrelation of molecular surfaces properties (AMSP) [

Wagener _{lower}, d_{upper}), a vector of autocorrelation coefficients is obtained as follows:

where _{i}_{j}

Therefore, the vector contains a compressed expression of the distribution of a property on the molecular surface. After autocorrelation vectors were obtained, a multilayer neural network was then trained using such vectors to derive a predictive model of biological activity of 31 steroid compounds. The correlation coefficient value, r, of 0.82 (r^{2} = 0.6724) was achieved with a cross-validated r^{2} of 0.63. In summary, the advantages of such autocorrelation vectors are the facts that they are shown to be invariant to translation and rotation since only spatial distances are used and have condensed description of molecular surface. However, original information cannot be reconstructed from such condensed vectors and the pharmacophore nature of a ligand may not be clear or interpretable [

Silverman ^{2} values ranging from 0.412 to 0.828 were obtained using electrostatic moment descriptors calculated from Gasteiger charges or Guassian molecular orbital

WHIM descriptors contain 3D molecular information such as molecular size, shape, symmetry and distribution of molecular surface point coordinates [

where _{i} is the weight of ith atom, _{ij}^{th} coordinate of the ^{th} atom and
^{th} coordinates [

In this expression, atoms can be weighted by mass, van der Waals volume, atomic electronegativity, electrotopological index of Kier and Hall, atomic polarizability and molecular electrostatic potential [

WHIM/MS-WHIM descriptors are invariant to 3D molecular orientation but both methods, like other 3D-QSAR methods, rely on ligand conformation, which may be subjective if ligand-receptor co-crystal structures are not known for the target of interest.

In an attempt to provide alignment-free descriptors which are easy to understand and interpret, Pastor

where E_{es} is the electrostatic energy, E_{hb} is the hydrogen-bonding energy, and E_{lj} is the Lennard-Jones potential energy [

In this method, electrostatic interactions, hydrophobic interactions, hydrogen bond acceptor and hydrogen bond donor fields are considered to get a set of positions which defines a ‘virtual receptor site’ (VRS). VRS regions are then encoded into GRIND via an auto- and cross-correlation transform so that those regions are no longer dependent upon their positions in the 3D space. In other words, autocorrelation descriptors of the fields are calculated and only the highest products of molecular interaction energies are stored while others are discarded. This difference is responsible for the ‘reversibility’ of GRIND and the descriptors can be back-projected in 3D space using another related program called ALMOND [

Multi-dimensional (nD) QSAR methods are essentially extensions of 3D-QSAR methods. These methods incorporate additional physical characteristics or properties (or a new dimension) to tackle the drawbacks of 3D-QSAR methods. One example is 4D-QSAR by Hopfinger

where E_{ligand-receptor} is the force field energy of the ligand-receptor interaction, E_{solvation,ligand} is the ligand desolvation energy, TΔS is the change in the ligand entrophy upon receptor binding, E_{internal strain} is the change in ligand internal energy upon receptor binding, and E_{induced fit} is the energy uptake required for adapting the receptor surrogate [

The 5D-QSAR method was tested on a set of 65 ^{2} values of 0.837 and 0.832, while 4D-QSAR model resulted in 0.834 and 0.795, respectively [^{2} of 0.885 [

In general, the predictive quality of 3D-QSAR methods depends on several factors such as the quality of molecular alignments/superimpositions, and information on ligand bioactive conformations. Especially molecular superimpositions are subjective and ligand bioactive conformations always remain unclear when there is no structural information on the corresponding receptor-ligand complexes. Conventional CoMFA results may often be non-reproducible because the model depends on the orientation of alignment of molecules, which can be varied and subjective. Although various improved methods and other procedures, which were discussed earlier in the paper, have been introduced to overcome major limitations of 3D-QSAR methods, ^{2} or r^{2} or SDEP values.

We have provided an overview of different QSAR methods and recent development in fragment-based approaches using selected studies as an illustration. Since each QSAR method has its own advantages and disadvantages, researchers should choose appropriate methods for modeling their systems. However, given a wide range of choices, it is a challenging task to pick appropriate models for one’s studies. This paper outlines many basic principles of new fragment-based QSAR methods as well as other 3D- and nD- QSAR models and illustrates some examples which may be helpful references to many researchers.

Authors would like to acknowledge the financial support for our laboratory from the NIH (R01DA025612 and P50 GM067082).

A general scheme of a QSAR model development which includes systematic training and testing processes.

Hologram-QSAR (HQSAR) model development, which includes molecular hologram generation and partial least square analysis to derive a final predictive HQSAR equation.

A general CoMFA workflow.

Summary of different QSAR methods and source information.

Method | nD | Dataset | Statistical model | Performance | Reference/Website |
---|---|---|---|---|---|

HQSAR | 2D | 21 Steroids | PLS | q^{2} = 0.71;^{2} = 0.85 [ |
[ |

FB-QSAR | 2D | 48 NA analogs | IDLS | r = 0.95^{2} = 0.91) [ |
[ |

FS-QSAR | 2D | 85 bis-sulfone analogs; |
MLR | r^{2} = 0.68;^{2} = 0.62 [ |
[ |

TPF-QSAR | 2D | 282 pesticides | PM-based prediction | r^{2} = 0.75 [ |
[ |

CoMFA | 3D | 21 Steroids |
PLS | q^{2} = 0.75; r^{2} = 0.96 [^{2} = 0.68; r^{2} = 0.69 [ |
[ |

CoMSIA | 3D | Thermolysin inhibitors |
PLS | q^{2} = [0.59, 0.64] [^{2} = 0.65; r^{2} = 0.73 [ |
[ |

Topomer CoMFA | 3D | 15 datasets from literature | PLS | average q^{2} = 0.636 [ |
[ |

SOMFA | 3D | 31 steroids; 35 sulfonamides | MLR | r^{2} = 0.58; r^{2} = 0.53 [ |
[ |

AMSP | 3D | 31 steroids | MNN | q^{2} = 0.63; r^{2} = 0.67 [ |
[ |

CoMMA | 3D | 31 steroids | PLS | q^{2} = [0.41, 0.82] [ |
[ |

WHIM | 3D | 31 steroids | PCA | SDEP = 1.750 [ |
[ |

MS-WHIM | 3D | 31 steroids | PCA | SDEP = 0.742 [ |
[ |

GRIND | 3D | 31 steroids |
PLS; PCA |
q^{2} = 0.64; SDEP = 0.26 [^{2} = 0.41; r^{2} = 0.57; SDEP = 0.72 [63] |
[ |

4D-QSAR | 4D | 20 DHFR inhibitors;_{2}a analogs; |
PLS |
r^{2} = [0.90, 0.95];^{2} = [0.73, 0.86];^{2} = [0.67, 0.76] [^{2} = [0.67, 0.85] [64] |
[ |

5D-QSAR | 5D | 65 NK-1 antagonists; |
MLR | r^{2} = 0.84;^{2} = 0.83 [ |
[ |

6D-QSAR | 6D | 106 estrogen receptor ligands | MLR | q^{2} = 0.90;^{2} = 0.89 [ |
[ |

HQSAR = Hologram QSAR |
PLS = Partial least square |
q^{2} = cross-validated r^{2} |