Ligand-Based Virtual Screening Based on the Graph Edit Distance

Chemical compounds can be represented as attributed graphs. An attributed graph is a mathematical model of an object composed of two types of representations: nodes and edges. Nodes are individual components, and edges are relations between these components. In this case, pharmacophore-type node descriptions are represented by nodes and chemical bounds by edges. If we want to obtain the bioactivity dissimilarity between two chemical compounds, a distance between attributed graphs can be used. The Graph Edit Distance allows computing this distance, and it is defined as the cost of transforming one graph into another. Nevertheless, to define this dissimilarity, the transformation cost must be properly tuned. The aim of this paper is to analyse the structural-based screening methods to verify the quality of the Harper transformation costs proposal and to present an algorithm to learn these transformation costs such that the bioactivity dissimilarity is properly defined in a ligand-based virtual screening application. The goodness of the dissimilarity is represented by the classification accuracy. Six publicly available datasets—CAPST, DUD-E, GLL&GDD, NRLiSt-BDB, MUV and ULS-UDS—have been used to validate our methodology and show that with our learned costs, we obtain the highest ratios in identifying the bioactivity similarity in a structurally diverse group of molecules.


Introduction
The high increase in chemical compounds data has created the need to develop computational tools to reduce the drug synthesis and drug test cycle runtimes. When activity data are analysed, these tools are required to generate new models for virtual screening techniques [1][2][3]. In the drug discovery process, virtual screening is a common step in which computational techniques are used to search and filter chemical compounds in databases. Basically, there are two main types of methods in the virtual screening: ligand-based virtual screening (LBVS) [4] and structure-based virtual screening (SBVS) [5]. In this work, we focus only in LBVS applications. The idea of the LBVS method is to predict the unknown activity of new molecules [6] using the information about the known activity of some molecules-specifically, their behaviour as ligands that bind to a receptor.
Some LBVS approaches are shape-based similarity [7], pharmacophore mapping [6], fingerprint similarity and machine learning methods [8]. According to [9], structurally similar molecules are presumed to have similar activity properties, then, in the context of LBVS methods, the chosen molecular similarity metric is important because it can determine the success of a virtual screening method to discover proper drug candidates. Various similarity methods are used in several applications [10][11][12][13][14].
To compute molecular similarity, it is necessary to define a distance and define a descriptor representing the molecule. Hundreds of molecular descriptors have been reported in the literature [15]. For instance, one-dimensional descriptors include general molecular properties, such as size, molecular weight, logP or dipole moment, or BCUT parameters [16][17][18][19]. Two-dimensional descriptors generate an array of representations of the molecules by simplifying the atomic information within them, such as 2D fingerprints [20][21][22]. Finally, three-dimensional descriptors use 3D information, such as molecular volume [23,24]. Other existing methods, instead of representing molecules by an Ndimensional vector, use relational structures, such as trees [25] or graphs [26,27]. Regarding the molecule representation by graphs, some methods represent compounds using reduced graphs [28][29][30][31] and other ones, such as extended reduced graphs (ErGs) [28]. Reduced graphs group atomic sub-structures that have related features, e.g., pharmacophoric features, ring systems, hydrogen-bonding or others. Moreover, ErGs are an extension of reduced graphs that introduce some changes to better represent shape, size and pharmacophoric properties of the molecules. The method presented in [28] has demonstrated its use as a powerful tool for virtual screening.
To perform reduced graph comparisons, three different similarity measures have been used: In [28,29,32], they map the reduced graphs into a 2D fingerprint. In [33], they map reduced graphs into sets of shortest paths. Finally, in [34,35], they perform the comparison on the graphs using the Graph Edit Distance (GED). GED considers the distance between two graphs as the minimum cost of modifications required to transform one graph into another. Each modification can be one of the following six operations: insertion, deletion and substitution of both nodes and edges in the graph [36][37][38]. The main goal of this paper is to present an algorithm that learns the edit costs in the GED to improve the classification ratio returned by the system when the Harper costs were used.
In an initial paper [34], the edit costs were imposed and extracted from [33], given the chemical expertise of the authors and considering the different node and edge types. Later, in [35], authors presented an algorithm for optimising those edit costs based on minimising the distance between correctly classified molecules and maximising the distance between incorrectly classified molecules. That work was inspired in a similar one carried out by Birchall et al. [39], in which the authors optimise the transformation costs of a String Edit Distance method to compare molecules using reduced graphs.
The main problem of the algorithm in [35] was the huge computational cost, which depends on the number of edit costs to be optimised. Thus, for practical reasons, in the experimental section in [35], they presented four experiments, in which only one edit cost was optimised in each experiment. They imposed the other costs (126 in total) to be the ones defined in [33]. In contrast, starting from the costs defined by [33], our method learns the whole edit costs of the GED to compare molecules with a lower computational cost obtaining higher classification ratios in the ligand-based screening application, as shown in the experimental section.
The outline of this paper is as follows. First, materials and methods are explained in detail, including the datasets, the GED and the learning algorithm. Second, the results of the practical experiments are described and discussed. Third and finally, a general discussion about the method and the results is presented.

Datasets
In this study, we have used six available public datasets: ULS-UDS [40], GLL&GDD [41], CAPST [42], DUD-E [43], NRLiSt-BDB [44] and MUV [45]. All these datasets had been formatted and standardized by the LBVS benchmarking platform developed by Skoda and Hoksza [46]. The datasets are composed of various groups of active and inactive molecules arranged according to the purpose of a target. Each group is split in two halves, the test and train sets, which are required when using machine learning methods. The train set is used to optimize the transformation costs, and the test set is used to evaluate the classification ratio. The targets of the datasets are shown in Table 1. In our experimentation, we have taken a subset of the first 100 active molecules and 100 of the first inactive molecules per target. Some datasets have less than 100 active molecules; in this case, all active molecules are taken and also the same number of inactive molecules.

Molecular Representation
Reduced graphs are compact representations of chemical compounds, in which the main information is condensed in feature nodes to give abstractions of the chemical structures. Different versions of reduced graphs have been presented [26,28,30,32,33] that depend on the features they summarise or the use that is given to them. In the virtual screening context, the structures are reduced to track down features or sub-structures that have the potential to interact with a specific receptor and, at the same time, try to keep the topology and spatial distribution of those features. Figure 1 presents an example of molecule reduction.

The Proposed Method
We explain our proposed method in the next three subsections. The first one explains the classification of compounds based on structural information; in the second one, we explain the learning algorithm; and in the third one, we detail the code of the algorithm.

Classification of Molecules
Once the molecules are represented as ErGs, we can compare them by means of the Graph Edit Distance (GED) [47,48]. The GED is defined as the minimum cost of transformations required to convert one graph into the other. Thus, in our application, it is the cost to transform an ErG into the other one. These modifications are called edit operations, and six of them have been defined: insertion, deletion and substitution of both nodes and edges. To classify a molecule, we apply the Nearest Neighbour (NN) strategy that consists of calculating the GED between this molecule and the other ones, which we know their class, and predicting its class (active or inactive) to be the class of the nearest molecule. In the case the molecule is equidistant from more than one classified molecule, the method arbitrarily selects one of the closest molecules.
Edit costs have been introduced to quantitatively evaluate each edit operation. The aim of the edit costs is to designate a coherent transformation penalty in proportion to the extent to which it modifies the transformation sequence. For instance, when ErGs are compared, it makes sense that the cost of substituting a "hydrogen-bond donor" feature with a joint "hydrogen-bond donor-acceptor" feature be less heavily penalized than the cost of substituting a "hydrogen-bond donor" feature with an "aromatic ring" system. Similarly, inserting a single bond should have a lower penalization cost than inserting a double bond, and so on. In a previous work [34], the edit costs proposed by Harper et al. [33] were used. The node and edge descriptions are shown in Table 2, and the specific costs proposed by Harper et al. are exposed in Tables 3 and 4.
The final edit cost for a given transformation sequence is obtained by adding up all of the individual edition costs. Figure 2 shows a schematic example of a transformation of a molecule G 1 into another one, G 2 . As we can see, the executed operations in this transformation are: a deletion of node type [1], a deletion of a simple edge, an insertion of node type [5], an insertion of a simple edge a substitution of node type [7] by node of type [2], and a substitution of a simple edge with a double edge. If we sum the values of Harper costs associated with these operations in Tables 3 and 4, we obtain that the cost of this transformation equals: 2 + 0 + 2 + 0 + 3 + 3 = 10. Table 2. Node and edge attributes description in an ErG.

Edge Attributes
Attribute Description Since several transformation sequences can be applied to transform a graph into another one, the GED resulting for any pair of graphs is defined as the minimum cost under all those possible transformation sequences. Usually, the final distance is normalized according to the number of nodes in both graphs being compared. This is performed in order to make the measure independent of the size of the graphs.
More formally, we define the GED as follows, where C t is the imposed cost of the tth edit operation on nodes and edges, and N t is the number of times this edit operation has been applied. Moreover, the combination of N 1 , N 2 , . . . is restricted to be one that transforms G a into G b . Finally, L is the sum of the number of nodes of both graphs, and n is the number of different edit operations on nodes and edges.
Several GED computational methods have been proposed during the last three decades, they can be classified into two groups: those returning the exact value of the GED in the exponential computational cost with respect to the number of nodes [49], and those returning an approximation of the GED in the polynomial cost [50][51][52][53]. These two groups of GED computational methods have been widely studied [54,55]. In our experiments, we used the fast bipartite graph matching method [50] (polynomial computational cost), although our learning method is independent of the matching algorithm.   Table 4. Substitution, insertion and deletion costs for edges proposed by Harper et al. [33].

Insertion/Deletion Costs For Edges
Initially, the edit costs were manually set in a trial and error process considering the application at hand [33,34]. (As previously commented, Tables 3 and 4 show their edit cost proposal.) Nevertheless, there has been a tendency to automatically learn these costs since it has been seen that a proper tuning of them is crucial to achieve good classification ratios in virtual screening [35] and other applications [56][57][58][59][60]. In [35], authors presented a learning algorithm that is forced to learn only one edit cost at once due to runtime restrictions. Thus, they perform four different experiments on the same data as [34] in which they use all the costs of [34] Table 5 shows their learnt costs. The next section presents our method, which has the advantage of learning the whole set of edit costs at once.

The Learning Method
We present an iterative algorithm, in which, in each iteration, the NN strategy is applied and the initial edit costs are modified such that one molecule that has been incorrectly classified becomes correctly classified. Modifying the edit costs could cause other incorrectly classified molecules to also be properly classified, but, unfortunately, some other ones that were properly classified become incorrectly classified. This is the reason why we want to generate the minimum modification on the edit costs. To do so, the selected molecule is the one that it is easier to move from the incorrectly classified ones to the correctly classified ones. In the next paragraphs, our learning algorithm is explained in detail.
Let G j be a molecule in the learning set that has been incorrectly classified using the NN strategy and the current costs C 1 , . . . , C n . We define D j as the minimal GED between G j and all the molecules but restricted to be the ones that have a different class: Moreover, we define D j as the minimal GED between G j and all the molecules of the learning set but restricted to be the ones that have the same class: Since G j is incorrectly classified, we can confirm that D j > D j . Figure 3 schematically shows this situation. It turns out that G j and G q belong to different classes even though the distance between them is smaller than the distance between G j and its closest molecule that has the same class, G p . Figure 3. Classification of molecule G j . The true classes are in solid colours. G j is classified in the wrong class (blue), but the correct class is the red one. The distance between G j and G q is lower than the distance between G j and G p .
The main idea of our method is to permute D j and D j , modifying the edit costs. With this exchange, we achieve a lower distance between G j and the molecule of its same class (G p ) than the distance between G j and the molecule with different classes (G q ). Thus, G j will be correctly classified. However, considering that adapting these distances affects all the molecules' classifications, we select a molecule G i among the incorrectly-classified ones, {G j |D j > D j , ∀G j }, which satisfies that the difference of the distances D j − D j is the minimum one, as shown in Equation (4). Note that in Equation (4), all the values of D j − D j are always positive because D j > D j by definition of G j . Figure 4 shows this idea. However, what is crucial to understand is that this modification is performed in the distances since the molecule representations are not modified. Furthermore, this is carried out by modifying the edit costs. Thus, the strategy is to define the new edit costs such that D i becomes D i and vice versa.
The rest of this section is devoted to explaining how to modify the edit costs. Considering Equation (1), the distance is composed of edit costs C 1 , . . . , C n and the number of times the specific edit operations have been taken N 1 , . . . , N n . Our method modifies the edit costs without altering the number of operations N 1 , . . . , N n .
Thus, we define D i and D i as follows: Then, we exchange the distances D i and D i and modify the edits costs by adding new terms: Note that these new terms α 1 , . . . , α n and also α 1 , . . . , α n are defined such that the new value of D i is D i instead of D i and vice versa. Moreover, the edit costs C 1 , . . . , C n are the same in both expressions. We proceed to explain below how to deduce the terms α 1 , . . . , α n and also α 1 , . . . , α n .
From Equation (6), we obtain: We observe that the first terms in both expressions are D j and D j , respectively: By regrouping the terms again, we have: Furthermore, finally, we divide by D i − D i and D i − D i in each expression to arrive at the following normalised expressions: (10) Note that, as commented in the definition of the GED, not all of the edit operations are used to transform a molecule into another. These edit operations are the ones that N t = 0 or N t = 0. Because of this, in Equation (10), there are some addends that are null. We use m and m to denote the number of edit operations that have been used, that is, the ones that N t = 0 or N t = 0, respectively.
We want to deduce α 1 , α 2 , . . . and also α 1 , α 2 , . . . such that Equation (10) is fulfilled. The easiest way is to impose that each non-null term in these expressions equal 1/m or 1/m, respectively. Then, we achieve the following expressions, From the previous expressions, we arrive at the definitions of α t that allow the modification from D i to D i . Moreover, we also arrive at the definitions of α t that allow the modification from D i to D i .
Note that considering Equations (5), (6) and (12), we have, on one hand, that the new costs C t = C t + α t and, on the other hand, that C t = C t + α t . Since it may happen that α t = α t , we assume the average option is the best choice when both weights are computed, In the next subsection, we present our algorithm.

Algorithm
Algorithm 1 consists of an iterative process in which, in each iteration, the edit costs are updated to correct the classification of one selected molecule. The updated costs are used in the next iteration to classify all the molecules again, select a molecule and modify the costs again.

3.
Compute D j and D j : (Equations (2) and (3)) for all G j incorrectly classified.

End Algorithm
This algorithm has been coded in Matlab, and it is available in https://deim.urv.cat/ francesc.serratosa/SW/, accessed on 12 November 2021. Table 6 shows the classification ratios obtained in each dataset using different edit cost configurations, algorithms and initialisations. The first row corresponds to the accuracies obtained with the costs proposed by Harper [33], the second row corresponds to the accuracies deduced by setting all the costs to 1 (no learning algorithm). The next four rows correspond to the accuracies obtained using the costs deduced in García et al. [35] in their four experiments (C1, C2, C3 and C4). Finally, the last two rows present the accuracies obtained by our method: the first row by initialising the algorithm by the Harper costs and the second one by initialising all the costs to 1. We note the used costs are the mean of the learned costs in all the databases, and our algorithm performed 50 iterations. We realise that in all the datasets, except for MUV and ULS-UDS, our costs with Harper initialisation obtained the highest classification ratios. In these two datasets, the best accuracy is obtained by Harper costs. Note that our method initialised by all-ones costs returns lower accuracies than our method initialised by Harper costs, except for the ULS-UDS dataset. This behaviour makes us think that the initialisation point is very important in this type of algorithm. Another highlight is that we have achieved better accuracies than the four experiments presented by García et al. [35] in all the tests. In the ULS-UDS dataset, our method returns close accuracy to the Harper costs. Nevertheless, that is not the case for MUV dataset. To deeply analyse this behaviour, we have computed the accuracy using the costs learned by only the MUV targets. In this case, the accuracy is 64.9%, which is significantly lower than using mean costs. This is not the normal behaviour in learning algorithms since while conducting specific learning, the classification ratio tends to increase. We think there are other reasons for this abnormal behaviour: one could be the small size of this dataset and the other the separability between ligands and decoys in MUV is low, which makes our algorithm not to converge to a proper solution.

Results
In Figure 5, we present the classification ratio obtained in the 127 targets in the six datasets. At a glance, we realise that our method achieves most of the highest accuracies in all the targets in CAPST, DUD-E, GLL&GDD and NRLiSt-BDB databases. Specifically, we point out targets from 19 to 31 in the GLL&GDD dataset where the other cost combinations have very low accuracies while our method achieves much higher results. We observe that targets in the datasets MUV and ULS-UDS, in which our method does not return the highest accuracies, have a high variability because the same costs produce very different results.
Note that in [35], authors computed a cost per each of the six datasets and each target. Conversely, we learn the edit costs given the six datasets at once. In general, using several datasets at once makes the learnt parameters less specific for the application at hand, and thus, the classification ratios tend to decrease. In spite of this possible disadvantage, our method returns better classification ratios than the one in [35] in all the datasets. Figure 6 shows the percentage of times that each cost configuration obtains the highest classification ratio taking into account all the 127 targets, given the four configurations proposed in [35], one configuration proposed in [33] and our deduced configuration. Our method obtains the best classification ratio the highest number of times.  Tables 7 and 8 show our learned edition costs for nodes and edges, respectively. In red and bold are the ones that are different to the ones proposed by Harper et al. [33]. As we can see, the results are very similar to Harper costs because we introduce a very small modification in each step. In addition, there are many costs that have not been modified. This is because these costs were not involved in the modifications of molecules that are improperly classified, minimising D i − D i .

Discussion
We present an iterative algorithm such that, in each iteration, the current costs are modified to properly classify an improperly classified molecule. While updating the costs, other improperly classified molecules could also be properly classified and vice versa. This is the reason why we cannot guarantee the algorithm's convergence. To reduce the no-convergence impact and the possible solution oscillation, the algorithm selects the molecule that requires the minimum modification of the costs with the aim of slightly moving to the best solution.
The algorithm requires some initial costs. We have initialised the algorithm by some aleatory costs and by the costs proposed by Harper [33]. In all the tests, the highest accuracies appeared while initialising the costs by the Harper proposal. We believe this behaviour appears because the optimisation function of the learning algorithm is highly non-convex. Generally, in these situations, the selected initialisation has a high impact on the solution. Finally, we have seen that the classification accuracy is highly dependent on the edit costs. That is, a slight modification of one of the costs could make the classification be completely different. Considering that the computational cost of this learning problem is extremely high, sub-optimal algorithms, as the one presented, are needed to achieve an acceptable classification accuracy. Thus, any proposal that achieves better classification ratios would have to be considered and analysed.

Conclusions and Future Research
In some ligand-based virtual screening (LBVS) methods, molecules are represented by extended reduced graphs. In this case, the Graph Edit Distance can be used to compute the dissimilarity between molecules.
In this article, we have presented a new method that automatically learns the edit costs in the Graph Edit Distance. In each iteration, our method introduces slight modifications in the current costs to correct the classification of a selected molecule that had been incorrectly classified in the previous step.
The obtained costs have been tested in six publicly available datasets and have been compared to previous works published in [34,35]. We achieve better classification ratios than [35] in the six datasets and better classification ratios than [34] in four of them.
In the experimental section, we realised that small modifications in the costs could produce considerable improvement in the classification ratio.
In future work, we will analyse which types of molecules cause the algorithm to converge and which are the best initial values to obtain higher classification accuracy.
Author Contributions: Conceptualization, methodology and investigation: E.R., S.Á. and F.S.; Supervision: S.Á. and F.S. All authors have read and agreed to the published version of the manuscript.