A Quantum-Based Similarity Method in Virtual Screening

One of the most widely-used techniques for ligand-based virtual screening is similarity searching. This study adopted the concepts of quantum mechanics to present as state-of-the-art similarity method of molecules inspired from quantum theory. The representation of molecular compounds in mathematical quantum space plays a vital role in the development of quantum-based similarity approach. One of the key concepts of quantum theory is the use of complex numbers. Hence, this study proposed three various techniques to embed and to re-represent the molecular compounds to correspond with complex numbers format. The quantum-based similarity method that developed in this study depending on complex pure Hilbert space of molecules called Standard Quantum-Based (SQB). The recall of retrieved active molecules were at top 1% and top 5%, and significant test is used to evaluate our proposed methods. The MDL drug data report (MDDR), maximum unbiased validation (MUV) and Directory of Useful Decoys (DUD) data sets were used for experiments and were represented by 2D fingerprints. Simulated virtual screening experiment show that the effectiveness of SQB method was significantly increased due to the role of representational power of molecular compounds in complex numbers forms compared to Tanimoto benchmark similarity measure.


Introduction
Virtual screening refers to the use of a computer-based method to process compounds from a library or database of compounds in order to identify and select the ones that are likely to possess a desired biological activity, such as the ability to inhibit the action of a particular therapeutic target. The selection of molecules with a virtual screening algorithm should yield a higher proportion of active compounds, as assessed by experiment, relative to a random selection of the same number of molecules [1,2].
Many virtual screening (VS) approaches have been implemented for searching chemical databases, such as substructure search, similarity, docking and QSAR [3]. Of these, similarity searching is the simplest, and one of the most widely-used techniques for ligand-based virtual screening (LBVS) [4]. The increasing of the importance of similarity searching applications is particularly due to its role in lead optimization in drug discovery programs, where the nearest neighbors for an initial lead compound are sought in order to find better compounds.
There are many studies in the literature associated with the measurement of molecular similarity [5][6][7]. Similarity searching aims to search and scan chemical databases to identify those molecules that are most similar to a user-defined reference structure using some quantitative measures of intermolecular structural similarity. However, the most common approaches are based on 2D fingerprints, with the similarity between a reference structure and a database structure computed using association coefficients such as the Tanimoto coefficient [2,8]. Several methods have been used to further optimize the measures of similarity between molecules, including weighting, standardization, and data fusion [9][10][11][12][13].
Similarity measures methods play a significant role in detecting the rate for pairwise molecular similarity. In this study, the similarity method that developed inspired from quantum machines theory. The quantum machines was recently employed in the information retrieval field [14][15][16], and due to many similarities aspects between the text and chemical information retrieval, it was adapted for chemoinformatics as well [17]. These analogies have provided the basis for the work in this paper, which is to introduce a new similarity method inspired from quantum theory to calculate the similarity for chemical database according to the reference structures. The Standard Quantum-Based (SQB) similarity method requires Re-representation of molecular compounds in order to be adapted with the mathematical quantum space which is called complex Hilbert space. The representation of molecular compounds formulates in complex numbers formats which can play vital role in development of SQB method. The use of complex numbers is one of the key concepts of the mathematical formalism of quantum theory. This study also developed three different techniques to re-represent and embed the molecular compounds in complex numbers formats. Finally, the screening experiment was simulated with three popular datasets converted to Pipeline Pilot ECFC_4 2D fingerprints, which are MDL Drug Data Report (MDDR), Maximum Unbiased Validation (MUV) and Directory of Useful Decoys (DUD).

Related Work
In a broad sense, the comprehensive umbrella of chemoinformatics is not restricted to molecular searching and drug design, but includes several classical chemical disciplines such as physical chemistry, medicinal chemistry, analytical chemistry, and others. Hence, the interaction among disciplines plays a fundamental role in generating new approaches to deal with chemoinformatic issues such as quantum-chemistry approaches to deal with mathematical aspects of chemistry. Generally during chemical bonding, there is no substitute to identify the behavior of atoms than quantum-chemistry approaches. This has led several scientists to consider only molecular descriptors derived from quantum chemistry. The quantum-chemical descriptors can be classified depending on the type of descriptors into orbital-based, energy-based, wave function based, and others [18].
The bridge between chemical concepts and quantum-chemistry was introduced by Atoms-in-Molecules (AIM) quantum approach [19]. The quantum-chemical algorithms in QSAR/QSPR play a fundamental role in providing higher accuracy than Force-Field based methods. Holder, et al. [20] employed quantum chemical concepts first for verification of graph theory of molecules, and second to predict the refractive index of polymer matrices. The quantum-chemistry descriptors in the development of QSAR/QSPR deal with the chemical, physical, biochemical, and pharmacological properties of compounds [21,22]. The combination of two types of connectivity 2D and 3D of molecular descriptors with quantum-chemistry descriptors showed a more preferable approach to QSPR than using each connectivity individually [23]. Within the same environment, the approaches of quantum-chemical are also used with 3D-QSAR models to calculate stereo-electronic properties [24][25][26]. Another study [27] employed quantum-chemistry depending on some properties of orbital calculations of molecules in order to overcome the limitation of classical QSAR approaches.
On the other hand, the nature of quantum mechanics as well as computing power were also derived and used in docking and improvement of known lead compounds in order to provide highest accuracy [28][29][30]. In addition, Quantum Mechanics Methods (QMM) were extensively applied in Linear Scaling in order to evaluate the binding enthalpy between ligand and proteins [31][32][33][34]. Other usage of QMM was to calculate energies and optimization of molecular structures at semi-empirical level [35]. In drug discovery process, the QMM has been devoted to investigate and describe electronic properties of molecules such as electronic and polarize effects, charge distribution, and bond state (forming/breaking) [36,37].
One of the approaches that was used for compounds similarity searching is Molecular Quantum Similarity Measure (MQSM) which was proposed by Carbó [38]. The MQSM approach depends on quantum-chemical descriptors to measure the similarity/diversity of molecules through the analysis of electron density function which is calculated by QMM. As a consequence, the QSAR models were developed based on MQSM [39][40][41]. In addition, the Molecular Quantum Self-Similarity Measures (MQSSM) was developed based on MQSM by comparing each molecule with itself [37,42,43]. The MQSM was also employed to classify the quantum objects of molecules by using dendrograms [44].
Moreover, Maldonado et al. [7] introduced comprehensive study of the applications and theories of molecular similarity measures in chemoinformatics. Generally, the measures of molecular similarity rely mainly on three factors. First are the features of molecular structures which can be used to detect the similarity/diversity of molecular compounds, which is known as descriptor. The molecular descriptors differ from 1D, 2D, and 3D molecular structures. The molecular descriptors influence the chemical, physical and biological properties as well as the order of atoms and chemical bonds [5,7]. Second, the similarity coefficients, which are mostly categorized to distance, correlation and probabilistic coefficient, for instance, Tanimoto, Euclidean distance, Pearson and so on. More details about similarity coefficients are discussed in the recent comprehensive study presented by Todeschini et al. [8]. The third and last is the degree of importance for molecular fragments represented by weighting schema approaches. Various techniques were used for this purpose such as Bayesian Inference Network [11]. Evident from the above studies, the Quantum Mechanics Methods were used at semi-empirical level of computational chemistry whether for quantum-chemical descriptors or quantum-similarity measure. In this study, the adoption of quantum mechanics concepts is investigated to be used as similarity searching method to find similarity/dissimilarity between reference and library molecules.

Quantum Model
Each similarity measure is made up of two elements: a mathematical representation of the relevant molecular information (i.e., vectors, graphs, functions) and some form of index or coefficient compatible with the representation. Similarity coefficients can be classified into correlation, associative, distance coefficients, and probabilistic [7,10,45]. The quantum-based similarity methods inspired from quantum probability formalism which can be considered a geometric generalization of standard probability theory that makes use of Hilbert space, subspace and unit vectors. On the other hand, quantum physics offers probabilistic, logic and geometric formalism based on mathematics of Hilbert space to describe the behavior of matter at atomic and subatomic scales. In this paper, we employed the concepts of quantum mechanics theory at two levels. Firstly, creating quantum framework in virtual screening by representation of molecular compounds and references based on complex Hilbert space via three proposed embedded techniques as well as real Hilbert space as special case of representation. Second, presenting a new similarity method, namely Standard Quantum-Based (SQB), based on quantum model components.
The following subsections presented more details about the quantum model components.

Creating Molecular Compounds Subspace
The mathematical representation of molecular information is a non-trivial task, and it plays a fundamental and significant role for similarity measure. Therefore, the molecular compounds should be reformulated for adapting to quantum mechanics. The probabilistic formalism of quantum relies on a multidimensional representation of objects to provide more powerful way to tackle the challenges. The probabilistic events are represented as subspace in a Hilbert space. The latter can be an extended version (i.e., infinite-dimensional space) of the notion of Euclidean space. The subspace is spanned by the basis vectors. All these components can be finished with geometry of chemical-information space. Moreover, the inner product gives the geometry which can be used to derive the probability of similarity between reference structure and library compounds. Therefore, the strong connection between probabilities and geometry presents in the quantum probability formalism.
The quantum probability framework which relies on linear algebra of Hilbert space can be thought a generalization of classical probability. Dealing with Hilbert space lead us to use Dirac notation (bracket notation) as a sequel of quantum theory [46]. A vector of the Hilbert space is denoted by ϕ namely Ket, while the Bra ϕ is the transpose of Ket (i.e., † ϕ ϕ = ). For simplicity, let us assume chemical-information space included three subspaces { , , } x y z as shown in Figure 1, the probability distribution over these subspaces is defined through a finite sample space where each fi corresponds to a distinct fragment, which also corresponds to chemical-information space. The latter requires orthogonal and normal basis which is known as orthonormal (i.e., ij δ where i j ≠ ). For instance, the orthonormal basis of subspace where this basis with norm 1. The probability distribution ( ) i pr f is associated with each fragment in δ. Then, the probability of subspace x can be defined in space δ as:

Creating Reference Density
The query would be better represented by a subspace, which is spanned by the basis vector mentioned in the query. A Gleason's theorem is an algorithm based on Hermitian operator that plays fundamental role to compute probabilities over subspaces in Hilbert space and expressed in terms of density operators. The density operator is a self-adjoint linear operator that belongs to a certain sub-class of self-adjoint operator. To each distinct fragment of the sample space is associated a one-dimensional projector f P corresponding to the one-dimensional subspace defined by f , which is equal to f f .
Then, all information about the probability distribution is contained into a density operator D defined by: where ( ) pr f adds up to 1.

Complex Hilbert Space
One of a fundamental aspect of the mathematical framework of quantum physics theory is based on the presence of complex numbers. In contrast of other traditional ligand-based virtual screening models, such as vector space models and probability models rely on the use of real numbers only. Generally, the complex numbers format consist of two main parts, real part and imaginary part which can be expressed in the form x yi + , where 1 i = − . While the real numbers that consist of only real part can be considered a special case. Therefore, the complex numbers that can be considered one of key concepts of the mathematical formalism of quantum provide more freedom in term of quantum mechanics theory.
The use of complex numbers in LBVS has not been used, where the representation of molecular structure only relies on real-valued vectors representation of molecular information. While the representation of information in quantum models are based on complex-valued vector representation. Although the quantum models deal with complex Hilbert space, the result of an interaction of complex vectors is always real. The first attempt that sketched out the use of complex numbers in information retrieval was proposed by [14], which stored the term frequency and the inverse term frequency in imagery and real component respectively.
This study embedded complex numbers of quantum model in LBVS based on weighting functions via three proposed ways. The global weight of any fragment which is given by a function of how many times this fragment occurs in the entire compound collection, embedded in the real component. The difference among the three techniques is the representation of imaginary part. The first representation technique use only local weight function. The local weight of any fragment is given by a function of how many times this fragment occurs in a compound/reference structure. While the second technique use both local and global weight functions. The third representation technique employed Okapi weight function which is composed of three different weight schemas, which are local, global and a new schema that produced by the integration between local and global weight schema. The Okapi weighting function with its three components previously developed by our research group as a new weighting function for a molecular similarity method based on the Bayesian network [47]. While this study modified Okapi function to embed the complex numbers of quantum model. The complex-valued representation of molecular information used by encoding fragments of molecules. While the real-valued representation also took into account as a special case of complex representation, which produced real Hilbert space. The different representations that are given by the three proposed techniques played vital role when calculate the similarity of molecules. The representation of real part can be given by Equation (3). While the imaginary component of three proposed techniques can be given by Equations (4)-(6) respectively: where ij ff and ir ff are the frequency of the th i fragment within th j compound and r reference structure respectively, i cf is the number of compounds containing th i fragment, j c is the size (in terms of the number of fragments) of the th j compound, while avg c is the average size of all the compounds in the database, and m is the total number of compounds.

Tanimoto-Based Similarity Model
This model used the continuous form of the Tanimoto coefficient, which is applicable to the non-binary data of the fingerprint, SK,L is the similarity between objects or molecules K and L, which, using Tanimoto, is given by Equation (7): For molecules described by continuous variables, the molecular space is defined by an M × N matrix, where entry wji is the value of the j th fragments (1 ≤ j ≤ M) in the i th molecule (1 ≤ I ≤ N). The origins of this coefficient can be found in [48].

Standard Quantum-Based Similarity Model
The probability which is computed by this model relies on quantum framework constructed of molecular compounds as shown in the previous section. The probability that a molecule is relevant to user's reference structure is determined by projection of its vector representation onto the corresponding subspace. Let us assume the probabilistic event of molecule in Hilbert space H represented as a subspace m S . A probability measure μ can be first defined for a pure chemical-information space represented as a unit vector ϕ , by computing the square of the length of the projection of the vector ϕ onto the molecular subspace m S , (i.e., 2 Sϕ ). The probability can be computed by: Where tr is trace operator, and † i i d = ϕ ϕ is called density operator. In general, any operator characterized by the fact that it is both positive-semi-definite (means † 0 v pv ≥ for any vector ν ) and of trace 1 defines the probability distribution over the subspaces [14].

Experimental Design
The experiments were carried out using the most popular chemoinformatics databases, the MDL Drug Data Report (MDDR) [49], Maximum Unbiased Validation (MUV) [50], and Directory of Useful Decoys (DUD) [51]. All molecules in these databases were converted to Pipeline Pilot ECFC_4 (extended connectivity fingerprints and folded to size 1024 bits) [52], and these data sets have been used recently by our research group area [11][12][13]47].
The screening experiments were performed with ten reference structures selected randomly from each activity class. These structures were unified and applied on TAN and four cases of SQB similarity method. For the MDDR dataset, three data sets (MDDR-DS1, MDDR-DS2 and MDDR-DS3) with 102516 molecules were chosen. The MDDR-DS1 contains 11 activity classes, with some of the classes involving actives that are structurally homogeneous and with others involving actives that are structurally heterogeneous (i.e., structurally diverse). The MDDR-DS2 data set contains 10 homogeneous activity classes, while the MDDR-DS3 data set contains 10 heterogeneous activity classes. Details of these three data sets are given in Tables 1-3. Each row of a table contains an activity class, the number of molecules belonging to the class, and the class's diversity, which was computed as the mean pairwise Tanimoto similarity calculated across all pairs of molecules in the class using ECFC_4. The second data set, (MUV) as shown in Table 4, was reported by Rohrer and Baumann [50]. This data set contains 17 activity classes, with each class containing up to 30 actives and 15,000 inactive molecules. The diversity of the class for this dataset shows that it contains high diversity or more heterogeneous activity classes. This data set was also used in the previous study by our research group [12,47].   The third data set used in this study is Directory of Useful Decoys (DUD), which has been recently compiled as a benchmark data set specifically for docking methods. It was introduced by [51] and was used recently in molecular virtual screening [53] as well as molecular docking [54]. The decoys for each target were chosen specifically to fulfil a number of criteria to make them relevant and as unbiased as possible. In this study twelve subsets of DUD with 704 active compounds and 25,828 decoys were used as shown in Table 5.
This study presented SQB similarity method which deals with complex and real Hilbert space. The complex Hilbert space generated by three different proposed techniques (i.e., SQB-Complex Tech. 1, SQB-Complex Tech. 2, and SQB-Complex Tech. 3), while the real Hilbert space generated as a special case of complex space (i.e., SQB-Real). The comparison of the retrieval results obtained by four cases of SQB method that have been compared with Tanimoto (TAN). The TAN coefficient has been used in ligand-based virtual screening for many years and is now considered a reference standard.

Results and Discussion
The experimental results on MDDR-DS1, MDDR-DS2, MDDR-DS3, MUV, and DUD are presented in Tables 6-10 respectively, using cut offs at 1% and 5%. In these Tables, the results of SQB method which used four different cases of molecular compounds representation compared with benchmark TAN are reported. The three techniques of complex space (i.e., Tech.1-Tech.3) used different schema of weighting functions to embed in complex numbers format, the real Hilbert space also generated as special case of quantum space. Each row in the tables lists the recall for top 1% and 5% of the activity class, and the best recall rate in each row are shaded. The Mean rows in the tables correspond to the mean when averaged over all activity classes (the best average is bolded), and the Shaded Cells rows correspond to the total number of shaded cells for each technique across the full set of activity classes.
The recall values of MDDR-DS1, MDDR-DS2 and MDDR-DS3 that reported in Tables 6-8 respectively showed that the proposed SQB method which deals with complex Hilbert space are obviously superior to TAN especially at cutoff 1% for DS1-DS3, and for DS1 and DS2 at cutoff 5%. The complex Hilbert space generated through the re-representation of molecular compounds in term of complex numbers format via three different ways. The Okapi technique (i.e., Tech. 3) which used to embed complex Hilbert space gave superior retrieval results than TAN in DS1-DS3 for top 1%. While for 5% cutoff, the representation techniques which are Tech 1 and Tech 2 provide preferable retrieval results to TAN for DS1 and DS2 respectively. In the contrast, the results of SQB method which deals with real space are slightly inferior to TAN for both cutoffs 1% and 5% for DS1 and DS3, while it outperformed TAN for DS2 at top 5%.
For MUV Dataset, Table 9 shows the results of SQB in complex and real cases as well as TAN. It is shown that SQB method with complex representation evidently gave superior retrieval results than TAN for nine activity classes for 1% and for eight activity classes for 5%, as a consequence the average Mean is better than TAN. However, the real representation of SQB method provided superior recall values in five activity classes as well as the overall Mean outperformed TAN for top 5%, but the best Mean presented by SQB was with Tech. 3 complex representation for both cutoffs. On the other hand, for the DUD dataset that was reported in Table 10, the best retrieval results for both cutoffs were also obtained by SQB with Tech. 3 complex representation for the number of activity classes, and the overall Mean. In contrast, the performance of SQB with real space representation outperformed TAN for the cutoff 1% while it was inferior to TAN for cutoff 5%. Therefore, statistical analysis was required to provide strong judgment about the performances of the proposed methods.
Some of the activity classes, such as low-diversity activity classes, may contribute disproportionately to the overall value of mean recall. Therefore, using the mean recall value as the evaluation criterion could be impartial in some methods, but not in others. To avoid this bias, the effective performances of the different methods have been further investigated based on the total number of shaded cells for each method across the full set of activity classes. This is shown in the bottom row of Tables 6-10. According to the total number of shaded cells in these Tables, the Tech. 3 representation of SQB method at top 1% was the best performing search across the three data sets. In contrast, at top 5% case the TAN was equal to Tech. 3 of SQB only in high-diversity data set (i.e., MDDR-DS3) while for other data sets the SQB was preferable. Moreover the SQB method was superior in terms of total number of shaded cells in MUV and DUD dataset.
The Kendall W test of concordance was used for ranking the performance of the similarity methods for the MDDR, MUV and DUD datasets. Here, the values of the recall for all activity classes (11 classes for DS1, 10 classes for DS2 and DS3, 17 classes for MUV, and 12 classes for DUD) was considered as a judge ranking (raters) of the similarity methods (ranked objects). The outputs of this test are the Kendall coefficient (W), Chi-Square (X 2 ) and the significance level (p value). Hence, the p value is considered as significant if p < 0.05, and then it is possible to give an overall ranking to the similarity methods. For instance, the value of the Kendall coefficient for DS1 in Table 11 is 0.222 while the p value is significant (p < 0.05) and the overall rankings of the similarity methods is: SQB(C./T3) > SQB(C./T1) > SQB(C./T2) > TAN > SQB(R.). In Table 11, the results of Kendall W test of top 1% for all used datasets show that the values of associated probability (p) are less than 0.05. This indicates that the SQB method is significant in cut-off 1% for all cases. As a consequence, the overall ranking of techniques indicates that the SQB with Tech. 3 complex representation is superior to TAN and SQB with real representation.
On the other hand, the Kendall W test results of similarity methods in case of the top 5% for three data sets are shown in Table 12 reported. The P values for MUV and DUD datasets are 0.031, and 0.0001 respectively, which indicate that the Tech. 3 representation of SQB method still outperformed other methods. In contrast of the MDDR data set, Tech. 1 and Tech. 2 representation techniques significantly outperformed other methods for DS1 and DS2. While only for MDDR-DS3, TAN provided better ranking among other methods.         The results of the MDDR search shown in Tables 6-8 show that the use of complex numbers format in Hilbert space of SQB method at cut-off 1% produced the highest mean values and number of shaded cells compared with TAN and real representation of SQB method. The best p-value at top 1% was 0.0006 for MDDR-DS3dataset. The technique that employed Okapi function to embed the molecules in term of the complex numbers format provided the best retrieval results compared with other embedded techniques as well as real representation and TAN. It was preferable for five activity classes and average Mean for DS1-DS3. However, the mean of retrieval results for the real SQB method was slightly inferior to TAN for DS1 and DS2. On the other hand, the TAN method was preferable for average mean to SQB complex/Tech. 3 in MDDR-DS3 dataset at 5% cut-off despite the fact both methods were equivalent for shaded cells of activity classes. While the case in MDDR-DS1 dataset is reversed, the Tech. 1 complex representation of SQB method outperformed TAN in mean recall criteria despite the shaded cells for both methods were equivalent. In contrast to homogeneous dataset, the mean of real SQB method was superior to the TAN method. While the MDDR-DS2 dataset includes highly similar activities, the MUV and DUD datasets have been carefully designed to include sets of highly dissimilar actives. Most of the similarity methods as well as our proposed methods here show a very high recall rate for the low diversity dataset and very low recall for the high diversity datasets, such as MDDR-DS3, MUV and DUD used in this study.
The results of MUV and DUD datasets are shown in Tables 9 and 10 respectively. The results of both 1% and 5% cut-off for MUV are slightly preferable for both SQB representation methods compared to benchmark TAN method. The mean retrieval of complex Tech. 3 SQB method exceeded the TAN for both cut-offs, while according to shaded cells of activity classes, the proposed methods also was superior to TAN. In contrast, the real representation of SQB method outperformed TAN at top 5%. On the other hand, for DUD dataset the complex and real representation of SQB method at both cut-offs outperformed other methods for both criteria, whether mean recall values, and higher activity class values (i.e., shaded cells). Moreover, the Kendall W test for DUD dataset show the superiority of the results of complex cases of SQB method than TAN method, where the p-values were 0.014 for top 1% and 0.0001 for top 5%.
For MUV dataset, the results of complex Tech. 3 and Tech. 2 of SQB method outperformed other methods with significant level, where p-values 0.0009 and 0.031 for top 1% and 5% respectively, while the preferable results of real SQB method was only at top 5% which exceed TAN method.
From the above discussion, the proposed SQB method with four different representation cases of molecules have investigated using ten cases for three popular data sets, and both cut-offs 1% and 5%. The use of complex numbers format for molecules representation proved to be superior compared with real representation and benchmark TAN method, where the complex SQB method outperformed TAN in nine cases. The best proposed technique to embed in term of complex format is that obtained by Okapi function, which outperformed other complex techniques and real SQB as well as TAN. In contrast, the real representation of SQB method was slightly preferable in three cases for MDDR-DS2 and MUV data sets at top 5% as well as DUD dataset at top 1%.

Conclusions
This study introduced first attempt to adapt the concepts of quantum theory to present as quantum-based similarity method in ligand-based virtual screening. Moreover, the use of complex numbers to re-represent molecular compounds to correspond with the mathematical quantum space has been investigated via three different proposed techniques based on weighting function. The role of complex numbers representation of molecules played vital role in efficiency of SQB method. The SQB method deals with four different spaces depend on molecular compounds representation. The results of these proposed methods show that the similarity searching was improved and the performance of these methods outperformed the Tanimoto which is considered the conventional similarity method. The superior results for three popular chemoinformatic datasets were obtained by the embedded technique which based on Okapi function to re-represent the molecular compounds.