These authors contributed equally to this work

This article is an open-access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

This paper gives an overview of the mathematical methods currently used in quantitative structure-activity/property relationship (QASR/QSPR) studies. Recently, the mathematical methods applied to the regression of QASR/QSPR models are developing very fast, and new methods, such as Gene Expression Programming (GEP), Project Pursuit Regression (PPR) and Local Lazy Regression (LLR) have appeared on the QASR/QSPR stage. At the same time, the earlier methods, including Multiple Linear Regression (MLR), Partial Least Squares (PLS), Neural Networks (NN), Support Vector Machine (SVM) and so on, are being upgraded to improve their performance in QASR/QSPR studies. These new and upgraded methods and algorithms are described in detail, and their advantages and disadvantages are evaluated and discussed, to show their application potential in QASR/QSPR studies in the future.

As a common and successful research approach, quantitative structure activity/property relationship (QASR/QSPR) studies are applied extensively to chemometrics, pharmacodynamics, pharmacokinetics, toxicology and so on. Recently, the mathematical methods used as regression tools in QSAR/QSPR analysis have been developing quickly. Thus, not only are the previous methods, such as Multiple Linear Regression (MLR), Partial Least Squares (PLS), Neural Networks (NN), Support Vector Machine (SVM), being upgraded by improving the kernel algorithms or by combining them with other methods, but also some new methods, including Gene Expression Programming (GEP), Project Pursuit Regression (PPR) and Local Lazy Regression (LLR), are being mentioned in the current reported QSAR/QSPR studies. In light of this, this paper is divided as follows: first, the improved methods based on the earlier ones are described, then the new methods are presented. Finally, all these methods are jointly commented and their advantages and disadvantages discussed to show their application potential in future QASR/QSPR studies.

MLR is one of the earliest methods used for constructing QSAR/QSPR models, but it is still one of the most commonly used ones to date. The advantage of MLR is its simple form and easily interpretable mathematical expression. Although utilized to great effect, MLR is vulnerable to descriptors which are correlated to one another, making it incapable of deciding which correlated sets may be more significant to the model. Some new methodologies based on MLR have been developed and reported in recent papers aimed at improving this technique. These methods include Best Multiple Linear Regression (BMLR), Heuristic Method (HM), Genetic Algorithm based Multiple Linear Regression (GA-MLR), Stepwise MLR, Factor Analysis MLR and so on. The three most important and commonly used of these methods are described in detail below.

BMLR implements the following strategy to search for the multi-parameter regression with the maximum predicting ability. All orthogonal pairs of descriptors i and j (with R2ij < R2min, default value R2ij < 0.1) are found in a given data set. The property analyzed is treated by using the two-parameter regression with the pairs of descriptors, obtained in the first step. The Nc (default value Nc = 400) pairs with highest regression correlation coefficients are chosen for performing the higher-order regression treatments. For each descriptor pair, obtained in the previous step, a non-collinear descriptor scale, k (with R2ik < R2nc and R2kj < R2nc, default value R2 < 0.6) is added, and the respective three-parameter regression treatment is performed. If the Fisher criterion at a given probability level, F, is smaller than that for the best two-parameter correlation, the latter is chosen as the final result. Otherwise, the Nc (default value Nc = 400) descriptor triples with highest regression correlation coefficients are chosen for the next step. For each descriptor set, chosen in the previous step, an additional non-collinear descriptor scale is added, and the respective

As an improved method based on MLR, BMLR is instrumental for variable selection and QSAR/QSPR modeling [

HM, an advanced algorithm based on MLR, is popular for building linear QSAR/QSPR equations because of its convenience and high calculation speed. The advantage of HM is totally based on its unique strategy of selecting variables. The details of selecting descriptors are as follows: first of all, all descriptors are checked to ensure that values of each descriptor are available for each structure. Descriptors for which values are not available for every structure in the data are discarded. Descriptors having a constant value for all structures in the data set are also discarded. Thereafter all possible one-parameter regression models are tested and the insignificant descriptors are removed. As a next step, the program calculates the pair correlation matrix of descriptors and further reduces the descriptor pool by eliminating highly correlated descriptors. The details of validating intercorrelation are: (a) all quasi-orthogonal pairs of structural descriptors are selected from the initial set. Two descriptors are considered orthogonal if their intercorrelation coefficient _{ij}^{2}^{2}

HM is commonly used in linear QSAR and QSPR studies, and also as an excellent tool for descriptor selection before a linear or nonlinear model is built [

Combining Genetic Algorithm (GA) with MLR, a new method called GA-MLR is becoming popular in currently reported QSAR and QSPR studies [^{2}

GA, a well-estimated method for parameter selection, is embedded in GA-MLR method so as to overcome the shortage of MLR in variable selection. Like the MLR method, the regression tool in GA-MLR, is a simple and classical regression method, which can provide explicit equations. The two parts have a complementation for each other to make GA-MLR a promising method in QSAR/QSPR research.

The basic concept of PLS regression was originally developed by Wold [

G/PLS is derived from two QSAR calculation methods Genetic Function Approximation (GFA) [

This is the combination of Factor Analysis (FA) and PLS, where FA is used for initial selection of descriptors, after which PLS is performed. FA is a tool to find out the relationships among variables. It reduces variables into few latent factors from which important variables are selected for PLS regression. Most of the time, a leave-one-out method is used as a tool for selection of optimum number of components for PLS. We can find examples of FA-PLS used in QSAR analysis in [

Orthogonal signal correction (OSC) was introduced by Wold

As an alternative to the fitting of data to an equation and reporting the coefficients derived therefrom, neural networks are designed to process input information and generate hidden models of the relationships. One advantage of neural networks is that they are naturally capable of modeling nonlinear systems. Disadvantages include a tendency to overfit the data, and a significant level of difficulty in ascertaining which descriptors are most significant in the resulting model. In the recent QSAR/QSPR studies, RBFNN and GRNN are the most frequently used ones among NN.

The RBFNN consists of three layers: an input layer, a hidden layer and an output layer. The input layer does not process the information; it only distributes the input vectors to the hidden layer. Each neuron on the hidden layer employs a radial basis function as a nonlinear transfer function to operate on the input data. In general, there are several radial basis functions (RBF): linear, cubic, thin plate spline, Gaussian, multi-quadratic and inverse multi-quadratic. The most often used RBF is a Gaussian function that is characterized by a center (_{j}_{j}_{j}_{j}_{j}_{k}_{kj}_{k}

GRNN, one of the so-called Bayesian networks, is a type of neural network using kernel-based approximation to perform regression. It was introduced by Specht in 1991 [_{i}

GRNN consists of four layers: input, hidden, summation, and output layers. The greatest advantage of GRNN is the training speed. Meanwhile, it is relatively insensitive to outliers (wild points). Training a GRNN actually consists mostly of copying training cases into the network, and so is as close to instantaneous as can be expected. The greatest disadvantage is network size: a GRNN network actually contains the entire set of training cases, and is therefore space-consuming, requiring more memory space to store the model. Relative literatures are [

SVM, developed by Vapnik [

The LS-SVM, which was a modified SVM algorithm, was proposed by Suykens ^{n}^{m}_{k}

By substituting it into the original regression line (

It can be seen that the Lagrange multipliers can be defined as:

Finding these Lagrange multipliers is very simple, as opposed to with the SVM approach, in which a more difficult relation has to be solved to obtain these values. In addition, it easily allows for a nonlinear regression as an extension of the linear approach by introducing the kernel function. This leads to the following nonlinear regression function:
_{k}_{k}_{k}_{k}^{T}Φ(_{k}_{k}_{k}^{2}/σ^{2}) is commonly used, and then leave-one-out (LOO) cross-validation is employed to tune the optimized values of the two parameters γ and σ. LS-SVM is a simplification of traditional support vector machine (SVM). It encompasses similar advantages with SVM and its own additional advantages. It only requires solving a set of linear equations (linear programming), which is much easier and computationally simpler than nonlinear equations (quadratic programming) employed by traditional SVM. Therefore, LS-SVM is faster than traditional SVM in treating the same work. The related literature is presented in [

Gene expression programming was invented by Ferreira in 1999 and was developed from genetic algorithms and genetic programming (GP). GEP uses the same kind of diagram representation of GP, but the entities evolved by GEP (expression trees) are the expression of a genome. GEP is more simple than cellular gene progression. It mainly includes two sides: the chromosomes and the expression trees (ETs). The process of information of gene code and translation is very simple, such as a one-to-one relationship between the symbols of the chromosome and the functions or terminals they represent. The rules of GEP determine the spatial organization of the functions and terminals in the ETs and the type of interaction between sub-ETs. Therefore, the language of the genes and the ETs represents the language of GEP.

Each chromosome in GEP is a character string of fixed length, which can be composed of gene from the function set or the terminal set. Using the elements {+, −,*, /,

The purpose of symbolic regression or function finding is to find an expression that can give a good explanation for the dependent variable. The first step is to choose the fitness function. Mathematically, the fitness _{i}_{(ij)}_{j}_{i}_{max}_{(ij)}

The _{(ij)}

The second step consists of choosing the set of terminals T and the set of functions F to create the chromosomes. In this problem, the terminal set consists obviously of the independent variable, i.e., _{t}_{p}

PPR, which was developed by Friedman and Stuetzle [_{i}_{i}_{i}

Most QSAR/QSPR models often capture the global structure-activity/property trends which are present in a whole dataset. In many cases, there may be groups of molecules which exhibit a specific set of features which relate to their activity or property. Such a major feature can be said to represent a local structure activity/property relationship. Traditional models may not recognize such local relationships. LLR is an excellent approach which extracts a prediction by locally interpolating the neighboring examples of the query which are considered relevant according to a distance measure, rather than considering the whole dataset. That will cause the basic core of this approach which is a simple assumption that similar compounds have similar activities or properties; that is, the activities or properties of molecules will change concurrently with the changes in the chemical structure. For one or more query points, “lazy” estimates the value of an unknown multivariate function on the basis of a set of possibly noisy samples of the function itself. Each sample is an input/output pair where the input is a vector and the output is a number. For each query point, the estimation of the input is obtained by combining different local models. Local models considered for combination by lazy are polynomials of zeroth, first, and second degree that fit a set of samples in the neighborhood of the query point. The neighbors are selected according to either the “Manhattan” or the “Euclidean” distance. It is possible to assign weights to the different directions of the input domain for modifying their importance in computation of the distance. The number of neighbors used for identifying local models is automatically adjusted on a query-by-query basis through a leave-one-out validation of models, each fitting a different number of neighbors. The local models are identified by using the recursive least-squares algorithm, and the leave-one-out cross-validation is obtained through the PRESS statistic. We assumed that there existed a small region around _{NN(x)}_{(x)}_{NN(x)}_{NN(x)}_{x}

In this paper, we focus on the current mathematical methods used as regression tools in recent QSAR/QSPR studies. Mathematical regression methods are so important for the QSAR/QSPR modeling that the choice of the regression method, most of the time, will determine if the resulted model will be successful or not. Fortunately, more and more new methods and algorithms have been applied to the studies of QSAR/QSPR, including linear and nonlinear, statistics and machine learning. At the same time, the existing methods have been improved. However, it is still a challenge for the researchers to choose suitable methods for modeling their systems. This paper may give some help on the knowledge of these methods, but more practical applications are needed so as to get a thorough understanding and then perform a better application.