Comparing a Query Compound with Drug Target Classes Using 3D-Chemical Similarity

Sang-Hyeok Lee; Sangjin Ahn; Mi-hyun Kim

doi:10.3390/ijms21124208

Abstract

3D similarity is useful in predicting the profiles of unprecedented molecular frameworks that are 2D dissimilar to known compounds. When comparing pairs of compounds, 3D similarity of the pairs depends on conformational sampling, the alignment method, the chosen descriptors, and the similarity coefficients. In addition to these four factors, 3D chemocentric target prediction of an unknown compound requires compound–target associations, which replace compound-to-compound comparisons with compound-to-target comparisons. In this study, quantitative comparison of query compounds to target classes (one-to-group) was achieved via two types of 3D similarity distributions for the respective target class with parameter optimization for the fitting models: (1) maximum likelihood (ML) estimation of queries, and (2) the Gaussian mixture model (GMM) of target classes. While Jaccard–Tanimoto similarity of query-to-ligand pairs with 3D structures (sampled multi-conformers) can be transformed into query distribution using ML estimation, the ligand pair similarity within each target class can be transformed into a representative distribution of a target class through GMM, which is hyperparameterized via the expectation–maximization (EM) algorithm. To quantify the discriminativeness of a query ligand against target classes, the Kullback–Leibler (K–L) divergence of each query was calculated and compared between targets. 3D similarity-based K–L divergence together with the probability and the feasibility index, (F_m), showed discriminative power with regard to some query–class associations. The K–L divergence of 3D similarity distributions can be an additional method for (1) the rank of the 3D similarity score or (2) the p-value of one 3D similarity distribution to predict the target of unprecedented drug scaffolds.

Keywords:

Kullback–Leibler (K–L) divergence; chemocentric similarity; Jaccard–Tanimoto coefficient; Gaussian mixture model (GMM); expectation-maximization (EM) algorithm; maximum likelihood (ML) estimation; machine learning

1. Introduction

An unpresented molecular framework such as that in Figure 1a can be investigated in drug space. In early stages of drug discovery, three-dimensional (3D) similarity between chemicals has been used to find desirable ligands of a chosen therapeutic target in virtual screening (VS; Figure 1b) [1,2]. To our knowledge, chemical similarity is a coarse predictor for filtering out less promising chemicals rather than selecting the most desirable compound. Chemical similarity has also contributed to target screening (in other words, retro-VS) under the chemocentric assumption in Figure 1c. Chemocentric assumption means if two similar molecules are likely to possess similar properties, they can share biological targets or may show similar pharmacological profiles [3,4]. Remarkably, Jain’s group conducted on-target and off-target prediction through the comparison of two-dimensional (2D) and 3D chemical similarity [5]. Based on this comparison, while dual 2D and 3D similarity-based predictions showed superiority for either 2D or 3D predictions, 3D predictions did not show dramatic improvement over 2D predictions. In addition, the increase of data points, according to the conformer sampling sizes, makes the computing cost of 3D features increase more rapidly than 2D features. However, despite it being less cost-effective, 3D similarity is the best feature for in silico target screening of unprecedented drug scaffolds and new drug-like molecular frameworks [6] because (1) novel, unprecedented drug scaffolds have very low 2D similarity to known bioactive molecules [7,8,9], (2) novel pharmacological profiles of drugs are more frequently found using 3D similar off-target predictions [5], and (3) realistic drug properties can be generated from their factual and flexible 3D structures [10,11,12].

Figure 1. The problem definition of 3D chemo-centric screening. (a) BNDS-A as a new molecular framework. (b) The role of chemical similarity in virtual screening. (c) The role of chemical similarity in chemo-centric retro-virtual screening. (d) The workflow of this work of an unprecedented drug scaffold.

The internalization of Michelangelo Buonarroti’s quote, “Every block of stone (chemical) has a statue (utility) inside it, and it is the task of the sculptor (chemist) to discover it”, inspired this research for the ‘chemistry-oriented synthesis’ of an unprecedented drug scaffold [7,8,9] and the chemocentric target profiling of this scaffold [7]. For this purpose, we have intensively studied the 3D similarity of unprecedented drug scaffolds (the query compounds) with known molecular frameworks (the reference compounds). When comparing query and reference compound pairs, 3D similarity of the pairs depends on (1) conformational sampling of the compounds, (2) the alignment method, (3) the chosen descriptors, and (4) the distance coefficients (e.g., Jaccard–Tanimoto). In addition to the four factors of 3D VS, retro-VS of unprecedented drug scaffolds (query compounds) requires compound–target associations (target class information), as shown in Figure 1. These associations are the source of the substantial difference between VS and retro-VS in problem-solving in data science, specifically, (1) one-to-one comparison for VS, as shown in Figure 1b; (2) one-to-group (class) comparison for retro-VS, as shown in Figure 1c; and (3) group-to-group comparison for typical parametric statistics such as ANOVA and t-test. When we calculated the similarity of compound pairs in retro-VS, the hope was to ultimately identify the primary target of the query through calculated chemical similarity rather than finding the most similar compound to the query structure. To achieve this, one-to-group comparison must be essentially quantified. To our knowledge, such measurements have not been properly reported in cheminformatics. Notably, 2D similarity distributions with target annotation have been reported using statistical fitting models such as Shoichet’s group [3], Bajorath’s group [13], and Nasr’s group [14]. However, even though the number of studies using 3D similarity is enormous with review articles by Zhang et al. [15] and Shin et al. [16], 3D similarity distribution is rarely mentioned in the literature. Other than the distribution, network analysis (edge: similarity, node: chemical) such as that by Torres et al. [17] or the machine-learning algorithm-based classifiers have also been used [11,18]. Most classifiers do not only use chemical similarity, but also use other descriptors together [18]. Although several studies have treated 3D similarity distribution such as Jain’s group [5], Medina-Franco’s group [19], and Pérez-Nueno’s group [20], the distribution comprised every compound instead of compounds grouped by target [5,19]. In addition, it was either visualized without a fitting model [19] or its statistical model was chosen without parameter optimization [5]. Exceptionally, although Pérez-Nueno’s group reported Gaussian distribution using 3D similarity, the study assumed Gaussian distribution with only one centroid and fitting parameter was also not optimized, despite the small number of ligands [20].

In this study, we quantitatively compared a query compound with a target class (one-to-group) using two types of similarity distributions, namely, maximum likelihood (ML) estimation of queries and a Gaussian mixture model (GMM) of target classes (Figure 1d). As raw data of this study, the Jaccard–Tanimoto similarity coefficients were calculated for (1) query-to-ligand pairs (e.g., the left second row of the Figure 1d) and (2) ligand pairs within each target class (e.g., the left first row of Figure 1d). The query-to-ligand similarity was transformed into query distribution via ML estimation, and the ligand pair similarity was also transformed into a representative distribution of a target class using GMM. The difference between two distributions was quantified by Kullback–Leibler (K–L) divergence, which represented the quantitative comparison between a query and a target class. In order to evaluate whether the K–L divergence accurately achieved one-to-group comparison, a query chosen from a group of known ligands for a target was tested to observe discrimination between the original target and other targets. In sequence, the target profiles of an unprecedented drug scaffold was explained by K–L divergence.

2. Theoretical Background

Kullback–Leibler divergence: K–L divergence measures the difference between two statistical or probabilistic distributions. In particular, K–L divergence is employed in various machine learning and deep learning algorithms for statistical inference [21,22]. Since K–L divergence implies relative entropy, which is an important concept in understanding statistical phenomena, it applies to statistical physics, chemistry, and social science.

Let us define two probability spaces,

(Ω, F, P)

and

(Ω, F, Q)

, where

Ω

is the sample space,

F

is

σ

–algebra, and

P

and

Q

are probability distributions. Then, to define Kullback–Leibler divergence, a unique measurable function is devised,

\frac{d Q}{d P} : Ω \to ℝ^{+}

, known as the Radon–Nykodym derivative, so that

Q (E) = \int_{E} \frac{d Q}{d P} d P

(1)

For any measurable set,

E \in Ω

[22] when using the measurable function

\frac{d Q}{d P}

. The Kullback–Leibler divergence,

D (P ∥ Q)

, is defined as either

D (P ∥ Q) : = \int_{Ω} - l n (\frac{d P}{d Q}) d P

(2)

or

D (P ∥ Q) : = \int_{- \infty}^{\infty} l n (\frac{p (x)}{q (x)}) p (x) d x,

(3)

where the probability density functions p(x) and q(x) are defined as

P (x) : = \int_{- \infty}^{x} p (x) d x and Q (x) : = \int_{- \infty}^{x} q (x) d x

(4)

The Kullback–Leibler divergence represents the information for comparing P(x) and Q(x) distributions [23]. Hence, the implication of Kullback–Leibler divergence depends on the definitions of P(x) and Q(x). For example,

Model Inference: If P(x) represents the testing distribution based on the model, and Q(x) represents the distribution from the raw data, the difference is the error between the model and reality [24];
Informatics: If P(x) and Q(x) represent information extracted from two objectives, the divergence is a measurement for the discrimination between two objectives [13,25];
Bayesian Statistics: If P(x) represents a prior distribution and Q(x) represents a posterior distribution, the divergence represents the information gained through updating [26,27].

In sequence, let us consider a special example. Assume the probability distributions P(x) and Q(x) replace the Gaussian distributions

G (x; m_{i}, σ_{i})

and

G (x; m_{j}, σ_{j})

, where

G (x; m_{i}, σ_{i}) : = \int_{- \infty}^{x} g (s; m_{i}, σ_{i}) d s and G (x; m_{j}, σ_{j}) : = \int_{- \infty}^{x} g (s; m_{j}, σ_{j}) d s

(5)

Using Equations (3) and (5), the Kullback–Leibler divergence between the two Gaussian distributions

G (x; m_{i}, σ_{i})

and

G (x; m_{j}, σ_{j})

in Equation (5) are as follows:

D (G (x; m_{i}, σ_{i}) ∥ G (x; m_{j}, σ_{j})) = l n (\frac{σ_{j}}{σ_{i}}) + \frac{{(σ_{i})}^{2} + {(m_{i} - m_{j})}^{2}}{2 {(σ_{j})}^{2}} - \frac{1}{2}

(6)

This Kullback–Leibler divergence between the univariate normal distributions (Equation (6)) therefore extends to multivariate distributions [28].

Gaussian mixture model: The mixture models are methods that analyze compositional data. With

Φ

representing a probabilistic density generated from the unknown compositional data,

p

representing a well-known probability density, and x representing a random vector, the functional operator,

Ξ (Φ (x) | p, K)

, is defined as

Ξ (Φ (x) | p, ω, λ, K) : = \sum_{k = 1}^{K} ω_{k} p (x : λ_{k})

(7)

where for k = 1, 2, …, K,

ω_{k}

,

λ_{k}

are the weights and vectors of the hyperparameters and

p_{i}

is the

i_{t h}

component, which is independently and identically distributed (iid) [29]. In this work, GMM was adopted to obtain a representative distribution [30]. Notably, GMM is a model that describes non-Gaussian distributions as well as Gaussian distributions [31]. The probability density

p (x : λ_{k})

represents the Gaussian density function

g (x; m_{k}, σ_{k})

in Equation (5). In the Gaussian mixture model, estimations of the weight (

ω_{k}

), the mean (

m_{k}

), and the standard deviation (

σ_{k}

) are essential. Herein, the two methods (i.e., the EM algorithm [32] and ML estimation [33]) were chosen to estimate the hyperparameters from sparse and incomplete data. The EM algorithm for GMM consists of an initial guess for the GMM parameters and iterative calculation (E-step)–parameter determination (M-step). The iterative steps continue until the set of hyperparameters,

θ

, are less than positive, and infinitesimal number,

ϵ

, as shown in the ccccccmathematical elucidation (Supplementary Materials Equations (S1.6)–(S1.12) [34]. For convenience, when applying the ML estimation,

Φ (x)

is transformed into the mixture model and

Ξ (Φ (x) | p, ω, λ, K)

is replaced by

Ξ_{E M} (Φ (x) | p, ω, λ, K)

.

3. Results and Discussion

In this study, a quantitative method was developed to describe discriminative information for target prediction of a query compound only from chemical similarity and known compound–target association information. For this purpose, 3D similarity distributions were acquired from a 3D similarity matrix occupied by Jaccard–Tanimoto coefficients [35] regarding (1) query-to-ligand pairs and (2) ligand pairs within each target class. The Jaccard–Tanimoto coefficients were calculated from two types of features, molecular shape and pharmacophore features, using the Openeye Toolkit. Query compounds and target classes were compared and quantified according to the following process:

Step 1. EM algorithm-based GMM allowed to obtain a representative distribution (Q-distribution) for a target class, following either Gaussian or non-Gaussian distribution;
Step 2. A query-to-ligand similarity distribution was fitted onto a Gaussian distribution using ML estimation;
Step 3. K–L divergence between the two distributions from Step 1 and Step 2 allowed target predictions of the query compound. Greater deviation of K–L divergence values between target classes indicated that the query compound was a more representative ligand of a class than other query compounds. In addition, the probability, $ℙ (ν (l_{m}) = i)$ , derived from the K–L divergence values and the feasibility index, $F_{m}$ , allowed for quantification of discrimination between the target classes.

Dataset: In order to select example target classes for this study, an unprecedented scaffold with structural novelty and its targets were focused. Among our previous studies, bis-N,N-dimethylaminophenylamino tetrahydropyran (BNDS-A), which was the most potent to regulate in vitro inflammation (IC₅₀ of nitric oxide production = 12 μM), was chosen for this quantitative method (Figure 1a). The association of two targets with BNDS-A, estrogen receptor alpha (ESR), and vitamin D receptor (VDR) was proven by the stepwise approach consisting of (1) 2D similarity search, (2) multiplication of 3D similarity coefficients of every ligand within each target, P(Tc)/C(hits), (3) self/cross-similarity, and (4) western blot analysis in our previous work [7]. However, despite low predicted probability, capthesin D (CTSD) and cyclooxygenase-2 (COX2) could also be regulated by BNDS-A in the same study. Neither the most similar compound to BNDS-A (one-to-one comparison) nor ANOVA test between target pairs (group-to-group comparison) could suggest the primary target of BNDS-A. Therefore, to quantitatively compare them with BNDS-A, the four targets, ESR, VDR, COX2, and CTSD, were selected. In addition, an additional four targets, HIV-1 protease (HIV1), heat shock protein 90 (HSP90), transient receptor potential cation channel subfamily V4 (TRPV4), DNA topoisomerase I (TOP1), were randomly selected from the target prediction literature [36] to evaluate our methodology. For convenience, simple numbers denoted the target classes, in other words,

\{\begin{array}{c} E s t r o g e n r e c e p t o r a l p h a \to 1, \\ V i t a m i n D r e c e p t o r \to 2, \\ C y c l o o x y g e n a s e - 2 \to 3, \\ C a t h e p s i n D \to 4 . \end{array}

(8)

Either m or n were called the class number, which was an integer between 1 and 4, as in Equation (8), and

C_{L} (m)

and

C_{L} (n) \in ℝ^{N}

represent vectors whose elements are the Tanimoto coefficients of query compounds in the mth class.

T_{M} : ℝ^{2 N} \to ℝ^{N} \times ℝ^{N}

was defined as the Tanimoto matrix operator, so

{(T_{M} [C_{l} (m), C_{l} (n)])}_{i j} : = T_{c} (< e_{i} \cdot C_{l} (m) >, < e_{j}, C_{l} (n) >)

(9)

where

T_{c} (i, j)

is a scalar operator between the ith and jth queries to calculate the Tanimoto coefficient and

e_{i}

and

e_{j}

are unit vectors for the i-axis and j-axis, where <, > is the inner product.

Representative distributions Q for target classes: The representative distributions corresponding to each target class using GMM of ligand pair similarity were obtained. First, using the similarity matrix

T_{M} {[C_{l} (m), C_{l} (n)]}_{i j}

in Equation (9), where m = n, the following univariate probability densities,

Φ_{n} (x_{k})

, were defined by

Φ_{n} (x_{i}) δ x : = ℙ (x_{k} \leq X = T_{M} {[C_{l} (m), C_{l} (n)]}_{i j} \leq x_{k + 1}),

(10)

where

ℙ

is the probability measure; x is the Tanimoto–Jaccard coefficients;

0 = x_{0}

and the range of x is [0, 2]; and

x_{k + 1} = x_{k} + δ x

. Therefore, the probability densities,

Φ_{n} (x)

, satisfy the following equation:

\sum_{i = 0}^{999} Φ_{n} (x_{i}) δ x = 1

(11)

Second, to extract representative distributions from

Φ_{n} (x)

, the Gaussian mixture model was utilized, in which probability densities,

Φ_{n} (x)

, are expressed as approximated from

Ξ_{E M} (Φ_{n} (x) | G, ω, μ, σ, K)

, which is the weighted sum of K univariate Gaussian distributions. That is,

Ξ_{E M} (Φ_{n} (x) | g, ω, μ, σ, K) = \sum_{k = 1}^{K} ω_{k} g (x; m_{k}, σ_{k}),

(12)

where

ω_{i}, m_{i},

and

σ_{i}

are shown in Table 1. To estimate the hyperparameters

ω_{i}, m_{i},

and

σ_{i}

, the EM algorithm was used as described in Section 4. Table 1 shows the mean, standard deviation, and weight corresponding to the components of the mixture model. Figure 2 depicts the difference between the probability densities,

Φ_{n} (x)

, and

Ξ_{E M} (Φ_{n} (x) | g, ω, μ, σ, K)

, where K = 1, 3, and 7. When comparing component K, raw data were similarly fitted to histograms when K = 3 and K = 7, and normal Gaussian modeling showed insufficient fitting for ESR, COX2, and CTSD (Figure 2). Commonly, the means and modes of the representative distributions existed near 0.5, and every distribution was skewed to the right.

Table 1. Hyperparameters of Q distributions for target classes.

Figure 2. Representative distributions (Q-distributions) of target classes using EM based Gaussian mixture model (

Ξ_{E M} (Φ_{n} (x) | g, ω, μ, σ, K)

of ligand pair similarity. (a) Q-distribution of ESR; (b) Q-distribution of VDR; (c) Q-distribution of COX2; (d) Q-distribution of CTSD. The red line: GMM K = 1, blue line: GMM K = 3, black line: GMM K = 7, pink bar: histogram of raw data.

Gaussian distributions for queries: To quantitatively compare the representative distributions corresponding to ESR, VDR, COX2, and CTSD with the query distributions, Kullback–Leibler divergence was introduced and calculated by building each distribution for each query.

For this purpose,

T_{M} [C_{l} (m), C_{l} (n)]

of Equation (9) was used in a similar way to the described method for the representative distributions of the target classes. When a query was the lth ligand of

C_{l} (n)

, the lth column’s elements in the above matrix were used for the lth column vector,

τ_{m} (m, n, l)

, as in

τ_{m} (m, n, l) : = T_{M} [C_{l} (m), C_{l} (n)] E_{l}

(13)

where the values of

E_{l}

for j = 1, 2, …, N were represented by the N × N matrices, for which the elements

{(E_{l})}_{i j}

satisfied

{(E_{l})}_{i j} : = \{\begin{array}{l} 1, i f i = j \\ 0, o t h e r w i s e \end{array}

(14)

Using the vector

τ_{m} (m, n, l)

from Equation (13), the following univariate probability densities,

Φ_{m n}^{(l)} (x_{k})

, were defined as

Φ_{m n}^{(l)} (x_{k}) δ x : = ℙ (x_{k} \leq X = {(τ_{m} (m, n, l))}_{i} \leq x_{k + 1})

(15)

where the probability measure

ℙ

was derived from Equation (10).

Before obtaining the probability distribution, two assumptions were made. First, it was assumed that a distribution from one query was not a weighted sum of Gaussian distributions, but rather a simple Gaussian distribution. It was reasonable that a distribution from one query was simpler than the Q-distribution of a target class with 13,957 queries. Second, to estimate the parameters of the Gaussian distribution, ML estimation was chosen as a general method, in which

Ξ_{M L} (Φ_{m n}^{(l)} (x_{k}) | g, ω, μ, σ, 1) = g (x; μ_{1}, σ_{1})

(16)

where μ₁ and σ₁ are hyperparameters and are maximized log likelihood functions for normal distribution, in other words,

(μ_{1}, σ_{1}) : = \arg \max_{(μ, σ)} \sum_{k = 1}^{100} \frac{{(x_{k} - μ)}^{2}}{σ^{2}}

(17)

Using definitions Equations (16) and (17), each query resulted in four distributions corresponding to the four classes (i.e., ESR, VDR, COX2, and CTSD). For example, when CHEMBL539392 was chosen as a query (l) among the ligands of ESR (Class 1), the distributions

Φ_{11}^{(l)} (x_{k}), Φ_{12}^{(l)} (x_{k})

,

Φ_{13}^{(l)} (x_{k})

, and

Φ_{14}^{(l)} (x_{k})

were obtained under the definitions of Equations (8) and (15). According to Equations (16) and (17), four representative Gaussian distributions of the query compound CHEMBL539392 were acquired from the column vector between CHEMBL539392 and 13,957 ligands of each class, which were

\{\begin{array}{l} Ξ_{M L} (Φ_{11}^{(l)} (x_{k}) | g, ω, μ, σ, 1) = g (x; 0.24055, 0.07472), \\ Ξ_{M L} (Φ_{12}^{(l)} (x_{k}) | g, ω, μ, σ, 1) = g (x; 0.21976, 0.06466), \\ Ξ_{M L} (Φ_{13}^{(l)} (x_{k}) | g, ω, μ, σ, 1) = g (x; 0.24389, 0.04857), \\ Ξ_{M L} (Φ_{14}^{(l)} (x_{k}) | g, ω, μ, σ, 1) = g (x; 0.21187, 0.06631), \end{array} for k = 0, 1, \dots, 99 .

(18)

In the same way, univariate normal distributions were obtained of all of the query compounds in each class. Since the number of classes was four and there were 13,957 query compounds in each class, the Gaussian distributions

G (x; μ_{1}, σ_{1})

, derived from

Ξ_{M L} (Φ_{m n}^{(l)} (x_{k}) | g, ω, μ, σ, 1)

, presented the class number, either m or n, which was an integer between 1 and 4, and the query number, l, which was an integer from 1 to 13,957. As a result, the frequency distributions of the estimates, alongside the means (

μ_{1}

) and standard deviations (

σ_{1}

), were described as shown in Figure 3 and Supplementary Figures S5–S7. ML estimation did not show any difference between self-query (m = n) and cross-query (m ≠ n) with regard to frequency. Even though cathepsin D (CTSD) showed a slightly lower mean than the other classes, self-comparison also showed a low mean, as shown in Figure 3. Regardless of whether a class or a query compound was used (self/cross), 3D similarity of ligand pairs within a class showed the mode near 0.6, thereby confirming the need for quantitative comparison between queries. Notably, the univariate probability distributions of 3D similarity did not discriminate between target class at all.

Figure 3. Frequency distributions of

Ξ_{M L} (Φ_{4 n}^{(l)} (x_{k}) | g, ω, μ, σ, 1)

estimates (

μ_{1}

and

σ_{1}

). Query (l)

\in

CTSD (class = 4). (a) CTSD-ESR, (b) CTSD-VDR, (c) CTSD-COX2, and (d) CTSD-CTSD. * The color bars (right side of the distribution) indicate frequency (e.g., yellow in 3(a) represents 3500 to 4000 queries, the mean of the ML estimates varied from 0.45 to 0.5 and their standard deviation varied from 0.08 to 0.1 in the standard).

Discrimination and K–L divergence: In sequence, 3D similarity distributions of target classes and query compounds were quantitatively compared through K–L divergence calculations. First, the information describing specific Tanimoto–Jaccard coefficients, x, were defined as

l n (\frac{Ξ_{M L} (Φ_{m n}^{(l)} (x) | g, ω, μ, σ, 1)}{Ξ_{E M} (Φ_{n} (x) | g, ω, μ, σ, K)})

(19)

from two probability density distributions,

Ξ_{M L} (Φ_{m n}^{(l)} (x) | g, ω, μ, σ, 1)

and

Ξ_{E M} (Φ_{n} (x) | g, ω, μ, σ, K)

, which were generated from a query compound and a class. Hence, following the expected value from the above information in Equation (19) with respect to one query compound, the K–L divergence,

D (Ξ_{M L} (ϕ_{m n}^{(l)} (x) | g, ω, μ, σ, 1) ∥ Ξ_{E M} (ϕ_{n} (x) | g, ω, μ, σ, K)) = \int Ξ_{M L} (ϕ_{m n}^{(l)} (x) | g, ω, μ, σ, 1) l n (\frac{Ξ_{M L} (ϕ_{m n}^{(l)} (x) | g, ω, μ, σ, 1)}{Ξ_{E M} (ϕ_{n} (x) | g, ω, μ, σ, K)}) d x

(20)

represented a measurement for the discrimination.

In a one-component GMM (K = 1), the K–L divergence between Gaussian distributions of every query and the Q-distributions (Table 1) are calculated; randomly chosen query compounds are described in Table 2. To show the calculation process in detail, CHEMBL539392 was chosen as an example. Using the above equation for Kullback–Leibler divergence between normal distributions,

D (G (x; m_{i}, σ_{i}) ∥ G (x; m_{j}, σ_{j})) = l n (\frac{σ_{j}}{σ_{i}}) + \frac{{(σ_{i})}^{2} + {(m_{i} - m_{j})}^{2}}{2 {(σ_{j})}^{2}} - \frac{1}{2}

(21)

where

\{\begin{array}{l} G (x; m_{i}, σ_{i}) = Ξ_{M L} (ϕ_{1 n}^{(1)} (x) | g, ω, μ, σ, 1) \\ G (x; m_{j}, σ_{j}) = Ξ_{E M} (ϕ_{n} (x) | g, ω, μ, σ, 1) \end{array}

(22)

Table 2. K–L divergence of randomly chosen queries between Q distributions and the distributions of queries.

We obtained four K–L divergences corresponding to the queries of 2.1493, 4.6939, 2.0810, and 1.6354, respectively (see calculation procedure in the Supplementary Materials Equations (S2.1–S2.8). As shown in Table 2 and Supplementary Table S3, the K–L divergence of every query compound was not always the smallest value from their original targets, as annotated by ChEMBL DB. Even though a considerable number of query compounds showed that the K–L divergence resulting from an original target was smaller than values from other target classes, CHEMBL539392 of ESR, CHEMBL1163237 of COX2, and CHEMBL263810 of CTSD were considered to be less different than other targets, therefore giving a false prediction (Table 2). When we counted the query compounds that discriminated between the original targets and other targets from the 13,957 query compounds under the four classes via GMM (K = 1), the correct prediction numbers were 6300, 5200, 4100, and 6400 among each of the 13,957 queries from ESR, VDR, COX2, and CTSD, respectively. When applying GMM (K = 3) and (K = 7) for the Q-distributions, the true positive ratio decreased (ESR: 5100; VDR: 4500; COX2: 3200; CTSD: 4900 (K = 3); ESR: 4900; VDR: 4500; COX2: 3100; CTSD: 4800 (K = 7)).

In order to further evaluate the discriminative power of K–L divergence between target classes, an additional four classes as well as the four classes for BNDS-A were compared with the shared ligands in Table 3 and Supplementary Table S2. In Table 3, ritonavir (CHEMBL163) is a clinically approved drug on the HIV1 (human immunodeficiency virus type 1) protease as its primary target. Notably, ritonavir showed the distinct K–L divergence value to discriminate HIV1 with other targets. In addition, the result can rationalize why ritonavir cannot show a distinct difference between VDR and COX2. In contrast, myricetin (CHEMBL 164) showed very disappointing result with poor discrimination between K–L divergence values. However, when we checked every target of myrcetin, the natural compound did not show target specificity on any single protein to explain the result. The annotated activities were limited to the known targets (VDR: 31–40 μM, COX2: 100 μM, HSP90 13.5 μM in cell-based assay, TOP1: IC50 = 11.9 μg mL⁻¹) in ChEMBL DB. Furthermore, despite the absent data on HIV1 of myrcetin, the flavonoid compound with multiple hydroxyl groups showed experimental activity on ubiquitin-specific protease having functional similarity (peptidase domain) with HIV1 to explain the K–L divergence value of 0.0393. In sequence, because reserpine (CHEMBL772), a clinically approved natural product, has target specificity on vesicular monoamine transporters with trivial activities on the annotated targets (VDR/COX2/TOP1), every target did not show a difference with untested targets (ESR/CPTD/HIV1). In addition, even though CHEMBL1813048 was the ligand of COX2 and TRPV4, K–L divergence could not support the finding. However, the result can be explained by the experimental data: (1) Ki against TRPV4 was more than 10 μM and (2) indirect regulation of COX2 was recorded through the Prostaglandin H2 receptor in ChEMBL DB. When compared with a 2D fingerprint based Top5 prediction of the additional target classes [36], our method can provide how much each query is quantitatively different with each target class from the raw data without any refinements such as assay, activity index, and duplicated ligands. This point is very important for investigating unprecedented drug scaffolds having weak activity out of the Top5 of a target class.

Table 3. K–L divergence of ligands shared with eight target classes *.

After the individual K–L divergence comparisons of each query, comparisons between the target classes were quantified. In sequence, the K–L divergence between the Gaussian distributions of 13,957 queries and the Q-distributions (K = 1, 3, and 7) for the four target classes were presented as a cumulative distribution, as seen in Figure 4, Figure 5, Figure 6 and Figure 7. To investigate the feasibility of the information, the following distribution was defined:

ℙ (ν (l_{m}) = i) for i = 1, 2, 3, 4,

(23)

where

l_{m}

is the query number in class m and the random variable

ν (l_{m})

represents a class number, so that

ν (l_{m}) : = \arg \min_{n} {D {Ξ_{M L} (ϕ_{m n}^{l_{m}} (x) | g, ω, μ, σ, 1) ∥ Ξ_{E M} (ϕ_{n} (x) | g, ω, μ, σ, 1)} | 1 \leq n \leq 4, 1 \leq l_{m} \leq 13,957}

(24)

Figure 4. The cumulative densities of K–L distance between Q-distribution (Target class: ESR) and queries. X-axis: K–L divergence, Y-axis: cumulative density; Q-distribution of ESR through GMM and the distribution of queries were calculated. (a) ESR(Query)-ESR(Class), (b) VDR(Query)-ESR(Class), (c) COX2(Query)-ESR(Class), and (d) ESR(Query)-ESR(Class).

Figure 5. The cumulative densities of K–L distance between Q-distribution (Target class: VDR) and queries. X-axis: K–L divergence, Y-axis: cumulative density; Q-distribution of VDR through GMM and the distribution of queries were calculated. (a) ESR(Query)-VDR(Class), (b) VDR(Query)-VDR(Class), (c) COX2(Query)-VDR(Class), and (d) ESR(Query)-VDR(Class).

Figure 6. The cumulative densities of K–L distance between Q-distribution (Target class: COX2) and queries. X-axis: K–L divergence, Y-axis: cumulative density; Q-distribution of COX2 through GMM and the distribution of queries were calculated. (a) ESR(Query)-COX2(Class), (b) VDR(Query)-COX2(Class), (c) COX2(Query)-COX2(Class), and (d) ESR(Query)-COX2(Class).

Figure 7. The cumulative densities of K–L distance between Q-distribution (Target class: CTSD) and queries. X-axis: K–L divergence, Y-axis: cumulative density; Q-distribution of CTSD through GMM and the distribution of queries were calculated. (a) ESR(Query)-CTSD(Class), (b) VDR(Query)-CTSD(Class), (c) COX2(Query)-CTSD(Class), and (d) ESR(Query)-CTSD(Class).

If the K–L divergence (Equation (20)) is an ideal measurement for discrimination between target classes, (

ν (l_{m}) = i

) would satisfy the following conditions:

Necessary condition:

$ℙ (ν (l_{m}) = m) \geq \underset{i \neq m}{m a x} ℙ (ν (l_{m}) = i)$

(25)
Sufficient condition: The feasibility index, $F_{m}$ , is defined as

$F_{m} : = \sqrt{\frac{ℙ (ν (l_{m}) = m)}{1 - ℙ (ν (l_{m}) = m)}} \geq 1$

(26)

The above conditions implied a quantitative measurement for the discrimination. In particular,

F_{m}

in the sufficient condition represents the ratio between two probabilities (i.e., that a query compound belonged to a class of itself as well as belonging to other classes). A larger value of

F_{m}

indicated better feasibility or resolution of discrimination. Table 4 depicts the probability of the K–L divergence

ℙ (ν (l_{m}) = i)

for

1 \leq i, m \leq 4

, indicating that, except for example m = 3 where the class was COX2, the tested classes met the necessary conditions

ℙ (ν (l_{m}) = m) \geq \underset{i \neq m}{m a x} ℙ (ν (l_{m}) = i)

in Equation (25) with respect to the feasibility index in Equation (26), it was easier to distinguish a query compound in the CTSD class where m = 4 from every class except itself (Figure 8). When the feasibility index resulting from the GMM (K = 1) was compared with the index calculated from the GMM (K = 3) and (K = 7) for the Q-distributions, GMM (K = 1) showed superior feasibility for class discrimination using GMM (K = 3) or (K = 7), as shown in Table 4.

Table 4. The description on

ℙ (ν (l_{m}) = i)

and

F_{m}

according to the number of components of Gaussian Mixture Model K, and the class

ν (l_{m})

of queries

l_{m}

^a.

Figure 8. Feasibility index according to target class and GMM component (K).

Representative ligands for better discriminative predictions: According to the results described in Figure 4, Figure 5, Figure 6, Figure 7 and Table 4, 3D similarity-based K–L divergence together with

ℙ (ν (l_{m}) = m)

and F_m showed discriminative power with regard to some query–class associations. The question therefore remains regarding the efficient use of the 3D-chemocentric approach under the current discriminative power, where it can be applied to investigate the novel pharmacology of an unprecedented compound. For this purpose, K–L divergence of an unprecedented compound should be calculated to compare known ligands and target classes. In detail, representative ligands within each target class were chosen for the comparison. For example, we selected four representative queries based on their Tanimoto–Jaccard coefficients (x), and K-L divergence value, namely, (1) x is the nearest to the mean of the Q distribution (GMM, K = 1), (2) x is the nearest to an outlier of the Q distribution (mean ± 2SD), (3) the range of K–L divergence between two target classes, and (4) the highest similarity with an unprecedented compound (Table 4). As an example, BNDS-A, a recently reported in-house compound [7], was used as the unprecedented compound due to the absence of ChEMBL DB. The first query compound close to the mean of the Q distribution showed a smaller K–L divergence than the other compounds (Table 5). The initial assumption and initial selection of the target class of BNDS-A (in other words, the selection of the Q distribution), resulted in a critical effect on the K–L divergence of BNDS-A as a query compound to predict the target class. When ESR was assumed as the initial target of BNDS-A, BNDS-A was more ESR ligand-like than CHEMBL558943 (at mean − 2SD for the ESR Q distribution) and CHEMBL604989 (which exhibited the biggest K–L divergence gap), and was less ESR-like than CHEMBL499809 (at the mean for the ESR Q distribution) and CHEMBL2 (at the mean + 2SD). Under the Q of ESR assumption, BNDS-A showed the lowest K–L divergence with the VDR ligands (0.0588 of VDR < 0.2116 of ESR), suggesting that BNDS-A was more VDR ligand-like than ESR ligand-like. When the initial target was transferred to VDR or COX2, BNDS-A showed the lowest K–L divergence required to satisfy the assumption (chosen Q). In all BNDS-A rows of Table 4, while the order of K–L divergence of BNDS-A (VDR < ESR < CTSD) was retained under the assumed every target class of BNDS-A, COX2 showed the lowest K–L divergence under only COX2 Q distribution and did not show consistent prediction. Therefore, BNDS-A was more VDR ligand-like than COX2 ligand-like. Experimentally, BNDS-A regulated the expression level of targets in a concentration-dependent manner (VDR > CTSD >> ESR) [7]. Notably, K–L divergence of 3D similarity distributions can be an additional comparison method of known methods to predict the target of a novel compound such as (1) the rank of 3D similarity score [7,15,16] or (2) p-value of one 3D similarity distribution [20]. Whenever achieving the relevance between a novel query and a target class is the aim, K–L divergence can be used for 3D-chemocentric informatics, as seen in the example of BNDS-A.

Table 5. The comparison between representative queries and unprecedented drug BNDS-A as a query.

4. Materials and Methods

Data collection: All data, except for the in-house compound (BNDS-A), were extracted from the ChEMBL database (1. ESR, VDR, COX2, and CTSD: version 23 through KNIME community node, 2. HIV1, HSP90, TRPV4, and TOP1: version 25 through MySQL) [37]. Version 23 was available in both the ChEMBL community node of KNIME and in-house MySQL built from the dump file from ChEMBL ftp (ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/). HIV1 protease, HSP90, TRPV4, and TOP1 data were chosen based on the literature [36] and downloaded from the ChEMBL 25 version.

Conformational sampling: Extracted compounds were converted from 2D structures into 3D conformation using Omega of the Openeye software [38] under the following conditions: (1) the MMFF94 force field excluding Coulomb interactions and the attractive part of Van der Waals interactions (option: mmff94s_Trunc) to retain the forces: bonding stretching, angle bending, stretch-bend interaction, out-of-plane bending at tricooridnate centers, torsion interaction, and the repulsive part of Van der Waals interactions; (2) 15 kcal/mol as the energy window; (3) hydrogen deletion from the input file fragment prior to the substructure search (option: deleteFixHydrogens); (4) permission to generate stereoisomers; and (5) maximum acceptable number of rotatable bonds of 25 [39]. Due to computational burden and space limitation to write similarity into a matrix during calculation at posterior work, 3D structures of every compound were merged into the structure files (file extension: sdf) according to target class, and 13,957 3D structures (with duplication due to different conformation) from the files were chosen via stratified sampling in KNIME to produce the dataset for similarity matrices as shown in Supplementary Table S1.

Alignment method: In order to align the 3D-structures of compound pairs, center of the mass was used [40]. In detail, it is reported that SIMPLEX algorithm for the alignment is already implemented in ROCS [15]. Shape Toolkit in the Openeye software [40,41] provides ‘OEBOOrientation’ used in OEBestOverlay. To optimize the alignment of each paired 3D structures, the starting point should be chosen before finding centers-of-mass of two conformers and OEBestOverlay uses an inertial frame alignment method to decide on starting positions by default. Under the default condition (‘OEBOOrientation_Inertial’), the first 3D structure (refmol in the python code in the Supplementary Materials) was aligned by its principal moments of inertia, then the second structure (fitmol in the python code in the Supplementary Materials) object was aligned in four positions with the primary and secondary moments of inertia in both possible directions. Therefore, the alignment of a compound pair (A, B) is approximately the same and absolutely not identical with the alignment (B, A).

3D Descriptors: In order to describe a molecular shape, atom-centered Gaussian sphere model was implemented in OE-MPI/ROCS and the Shape Toolkit [40,41]. OE-MPI, a kind of MPI (message passing interface), was also provided by Openeye for thread parallel calculation with a high number of CPUs. The Gaussian sphere model describing the 3D shape of compounds used the sum of Gaussian functions of individual heavy atoms except for hydrogen. f and g are characteristic functions to present the 3D atomic structure of each compound, I: self-volume overlaps of each entity, independent; O: the overlap between the two functions, dependent on orientation of two molecules.

Shape (f, g) = \sqrt{\int {[f (x, y, z) - g (x, y, z)]}^{2} d V}

(27)

Shape {(f, g)}^{2} = \int {[f (x, y, z)]}^{2} d V + \int {[g (x, y, z)]}^{2} d V - 2 \int f (x, y, z) g (x, y, z) d V

Shape (f, g) = I_{f} + I_{g} - 2 O_{f, g}

Jaccard–Tanimoto coefficient of Shape (f, g) = \frac{O_{f, g}}{I_{f} + I_{g} - O_{f, g}}

Color features of every query were generated under the default algorithm of the Shape Toolkit. Color features were defined by pharmacophore types (H-bond donor, H-bond acceptor, negative charge, positive charge, hydrophobic, and ring) in a color force field (Implicit Mills Dean) and color atoms were described by Gaussian functions as being relatively hard with a steep gradient.

3D Similarity matrix: The Jaccard–Tanimoto coefficient of two features, shape and color were calculated, combined, and written into 3D similarity matrices using the functions in the supplementary python script [42].

-: OEOverlay(): optimization of the alignment(overlap) between query and database
-: OEBestOverlayScoreIter(): sorting all scores to highest Tanimoto coefficient before writing similarity score into an empty matrix.

In this study, while the dimension of 3D similarity matrices for Q distributions (GMM) was 13,957 by 13,957, the dimension of 3D similarity matrices for query distributions (ML estimation) was 1 by 13,957. Every sampled compound of four target classes (13,957 conformers x four target classes) was used as the query to show the performance of K–L divergence. The BNDS-A compound is only one query not existing in any target class.

Script for K–L divergence. In order to realize (1) the GMM model, (2) the ML estimation, and (3) K–L divergence, python scripts were written using python libraries such as pandas [43], numpy [44], and scipy [45] under anaconda installation [46], so that every code was uploaded to GitHub [47].

5. Conclusions

We developed a quantitative method comparing query compounds to target classes. The discriminative comparison was achieved by K–L divergence of 3D similarity distributions. The distributions were generated from 3D structures (sampled multi-conformers) with target annotation and optimized with parameters to best fit to frequent histograms. The feasibility index, F_m, and the probability, P(ν(l_m) = i), derived from the K–L divergence demonstrates the discrimination of queries against target classes. The feasibility index resulting from the GMM (K = 1) showed better feasibility for class discrimination than the GMM (K = 3) and (K = 7). Among the target classes, CTSD showed the most desirable feasibility and COX2 was the least desirable target for chemocentric informatics. K–L divergence comparison of an unprecedented compound, BNDS-A showed the consistent order of K–L divergence of BNDS-A (VDR < ESR < CTSD) under different target assumptions of BNDS-A so that our method is applicable for discriminative predictions of unknown query compounds in chemocentric informatics. This study will contribute to 3D chemocentric target deconvolution for unprecedented drug scaffolds. In the recent future, this quantitative method should be further studied with regard to the field of chemical optimization between the chemical space and pharmacological space.

Supplementary Materials

Supplementary Materials can be found at https://www.mdpi.com/1422-0067/21/12/4208/s1.

Author Contributions

M.-h.K. conceived and designed for the study. S.-H.L. and S.A. carried out all computational experiments. M.-h.K., S.-H.L., and S.A. analyzed all the data. M.-h.K. and S.-H.L. wrote the manuscript and M.-h.K. revised it. M.-h.K. provided the research work facility and funding. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by the Basic Science Research Program of the National Research Foundation of Korea (NRF), funded by the Ministry of Education, Science, and Technology (No.: 2017R1E1A1A01076642) and by the Agency for Defense Development (Grant ID: PD1806130GD).

Acknowledgments

The authors would like to thank OpenEye Scientific Software providing the academic free license.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

ESR	Estrogen receptor alpha
VDR	Vitamin D receptor
COX2	Cyclooxygenase-2
CTSD	Cathepsin D

References

Hawkins, P.C.D.; Skillman, A.G.; Nicholls, A. Comparison of shape-matching and docking as virtual screening tools. J. Med. Chem. 2007, 50, 74–82. [Google Scholar] [CrossRef] [PubMed]
Gadhe, C.G.; Lee, E.H.; Kim, M.H. Finding new scaffolds of JAK3 inhibitors in public database: 3D-QSAR models & shape-based screening. Arch. Pharmacal Res. 2015, 38, 2008–2019. [Google Scholar] [CrossRef]
Keiser, M.J.; Roth, B.L.; Armbruster, B.N.; Ernsberger, P.; Irwin, J.J.; Shoichet, B.K. Relating protein pharmacology by ligand chemistry. Nat. Biotechnol. 2007, 25, 197–206. [Google Scholar] [CrossRef] [PubMed]
Eckert, H.; Bajorath, J. Molecular similarity analysis in virtual screening: Foundations, limitations and novel approaches. Drug Discov. Today 2007, 12, 225–233. [Google Scholar] [CrossRef] [PubMed]
Year, E.R.; Cleves, A.E.; Jain, A.N. Chemical structural novelty: On-targets and off-targets. J. Med. Chem. 2011, 54, 6771–6785. [Google Scholar] [CrossRef]
Taylor, R.D.; MacCoss, M.; Lawson, A.D. Rings in drugs: Miniperspective. J. Med. Chem. 2014, 57, 5845–5859. [Google Scholar] [CrossRef]
Venkanna, A.; Kwon, O.W.; Afzal, S.; Jang, C.; Cho, K.; Yadav, D.K.; Kim, K.; Park, H.G.; Chun, K.H.; Kim, S.Y.; et al. Pharmacological use of a novel scaffold, anomeric n,n-diarylamino tetrahydropyran: Molecular similarity search, chemocentric target profiling, and experimental evidence. Sci. Rep. 2017, 7, 12535. [Google Scholar] [CrossRef]
Afzal, S.; Venkanna, A.; Park, H.G.; Kim, M.H. Metal-free α-C (sp3)—H functionalized oxidative cyclization of tertiary N,N-diarylamino alcohols: Construction of N,N-diarylaminotetrahydropyran scaffolds. Asian J. Org. Chem. 2016, 5, 232–239. [Google Scholar] [CrossRef]
Venkanna, A.; Cho, K.; Dorma, L.P.; Kumar, D.N.; Hah, J.M.; Park, H.G.; Kim, S.Y.; Kim, M.H. Chemistry-oriented synthesis (ChOS) and target deconvolution on neuroprotective effect of a novel scaffold, oxaza spiroquinone. Eur. J. Med. Chem. 2019, 163, 453–480. [Google Scholar] [CrossRef]
Hu, G.; Kuang, G.; Xiao, W.; Li, W.; Liu, G.; Tang, Y. Performance evaluation of 2D fingerprint and 3D shape similarity methods in virtual screening. J. Chem. Inf. Model. 2012, 52, 1103–1113. [Google Scholar] [CrossRef]
Vilar, S.; Hripcsak, G. Leveraging 3D chemical similarity, target and phenotypic data in the identification of drug-protein and drug-adverse effect associations. J. Cheminf. 2016, 8, 35. [Google Scholar] [CrossRef] [PubMed]
Pacureanu, L.; Avram, S.; Bora, A.; Kurunczi, L.; Crisan, L. Portraying the selectivity of GSK-3 inhibitors towards CDK-2 by 3D similarity and molecular docking. Struct. Chem. 2019, 30, 911–923. [Google Scholar] [CrossRef]
Vogt, M.; Bajorath, J. Introduction of an information-theoretic method to predict recovery rates of active compounds for bayesian in silico screening: Theory and screening trials. J. Chem. Inf. Model. 2007, 47, 337–341. [Google Scholar] [CrossRef]
Baldi, P.; Nasr, R. When is chemical similarity significant? The statistical distribution of chemical similarity scores and its extreme values. J. Chem. Inf. Model. 2010, 50, 1205–1222. [Google Scholar] [CrossRef] [PubMed]
Kumar, A.; Zhang, K.Y. Advances in the development of shape similarity methods and their application in drug discovery. Front. Chem. 2018, 6, 315. [Google Scholar] [CrossRef] [PubMed]
Shin, W.-H.; Zhu, X.; Bures, M.G.; Kihara, D. Three-dimensional compound comparison methods and their application in drug discovery. Molecules 2015, 20, 12841–12862. [Google Scholar] [CrossRef]
Lo, Y.-C.; Senese, S.; Damoiseaux, R.; Torres, J.Z. 3D chemical similarity networks for structure-based target prediction and scaffold hopping. ACS Chem. Biol. 2016, 11, 2244–2253. [Google Scholar] [CrossRef]
Seo, S.; Lee, T.; Kim, M.H.; Yoon, Y. Prediction of side effects using comprehensive similarity measures. BioMed Res. Int. 2020, 2020, 1–10. [Google Scholar] [CrossRef]
Méndez-Lucio, O.; Kooistra, A.J.; Graaf, C.D.; Bender, A.; Medina-Franco, J.L. Analyzing multitarget activity landscapes using protein–Ligand interaction fingerprints: Interaction cliffs. J. Chem. Inf. Model. 2015, 55, 251–262. [Google Scholar] [CrossRef]
Pérez-Nueno, V.I.; Venkatraman, V.; Mavridis, L.; Ritchie, D. Detecting drug promiscuity using gaussian ensemble screening. J. Chem. Inf. Model. 2012, 52, 1948–1961. [Google Scholar] [CrossRef]
Maaten, L.V.D.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
Hershey, J.R.; Olsen, P.A. Approximating the Kullback Leibler Divergence Between Gaussian Mixture Models. In Proceedings of the 2007 IEEE International Conference on Acoustics, Speech and Signal Processing—ICASSP ’07, Honolulu, HI, USA, 15–20 April 2007. [Google Scholar]
Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
Burnham, K.P.; Anderson, D.R. Kullback-Leibler information as a basis for strong inference in ecological studies. Wildl. Res. 2001, 28, 111–119. [Google Scholar] [CrossRef]
Nalewajski, R.F.; Parr, R.G. Information theory, atoms in molecules, and molecular similarity. Proc. Natl. Acad. Sci. USA 2000, 97, 8879–8882. [Google Scholar] [CrossRef] [PubMed]
Koller, D.; Sahami, M. Toward Optimal Feature Selection. In Proceedings of the Thirteenth International Conference on Machine Learning, Bari, Italy, 3–6 July 1996; pp. 284–292. [Google Scholar]
Kümmerer, M.; Wallis, T.S.; Bethge, M. Information-theoretic model comparison unifies saliency metrics. Proc. Natl. Acad. Sci. USA 2015, 112, 16054–16059. [Google Scholar] [CrossRef]
Duchi, J. Derivations for linear algebra and optimization. Berkeley Calif. 2007, 3, 2325–5870. [Google Scholar]
McLachlan, G.J.; McGiffin, D.C. On the role of finite mixture models in survival analysis. Stat. Methods. Med. Res. 1994, 3, 211–226. [Google Scholar] [CrossRef]
Singh, R.; Pal, B.C.; Jabr, R.A. Statistical representation of distribution system loads using gaussian mixture model. IEEE Trans. Power Syst. 2009, 25, 29–37. [Google Scholar] [CrossRef]
Duda, R.O.; Hart, P.E.; Stork, D.G. Pattern Classification and Scene Analysis, 2nd ed.; John Wiley & Sons: Hoboken, NJ, USA, 1995. [Google Scholar]
Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B Methodol. 1977, 39, 1–22. [Google Scholar] [CrossRef]
Hartley, H.O. Maximum likelihood estimation from incomplete data. Biometrics 1958, 14, 174–194. [Google Scholar] [CrossRef]
McLachlan, G.; Krishnan, T. The EM Algorithm and Extensions; John Wiley & Sons: Hoboken, NJ, USA, 2007; Volume 382. [Google Scholar]
Bajusz, D.; Rácz, A.; Héberger, K. Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations? J. Cheminf. 2015, 20, 1–13. [Google Scholar] [CrossRef] [PubMed]
Montaruli, M.; Alberga, D.; Ciriaco, F.; Trisciuzzi, D.; Tondo, A.R.; Mangiatordi, G.F.; Nicolotti, O. Accelerating Drug Discovery by Early Protein Drug Target Prediction Based on a Multi-Fingerprint Similarity Search. Molecules 2019, 24, 2233. [Google Scholar] [CrossRef]
Berthold, M.R.; Cebron, N.; Dill, F.; Gabriel, T.R.; Kötter, T.; Meinl, T.; Ohl, P.; Thiel, K.; Wiswedel, B. KNIME—the Konstanz information miner. ACM SIGKDD Explor. Newsl. 2009, 11, 26–31. [Google Scholar] [CrossRef]
OpenEye Scientific. OMEGA Software (ver. 2.4.6); OpenEye Scientific: Santa Fe, NM, USA, 2015. Available online: https://www.eyesopen.com/omega (accessed on 18 May 2020).
Kim, H.R.; Jang, C.Y.; Yadav, D.K.; Kim, M.H. The Comparison of Automated Clustering Algorithms for Resampling Representative Conformer Ensembles with RMSD Matrix. J. Cheminf. 2017, 9, 21. [Google Scholar] [CrossRef] [PubMed]
Grant, J.A.; Gallardo, M.A.; Pickup, B.T. A Fast Method of Molecular Shape Comparison: A Simple Application of a Gaussian Description of Molecular Shape. J. Comput. Chem. 1996, 17, 1653–1666. [Google Scholar] [CrossRef]
OpenEye Scientific. Shape TK Software (ver. 1.9.3); OpenEye Scientific: Santa Fe, NM, USA, 2015. Available online: https://www.eyesopen.com/shape-tk (accessed on 18 May 2020).
Shape Toolkit 2.0.4. Available online: https://docs.eyesopen.com/toolkits/python/shapetk (accessed on 18 May 2020).
Pandas documentation, Version: 1.0.4. Available online: https://pandas.pydata.org/docs/ (accessed on 18 May 2020).
NumPy v1.18 Manual. Available online: https://numpy.org/ (accessed on 18 May 2020).
SciPy. Available online: https://www.scipy.org/ (accessed on 18 May 2020).
Anaconda.Documentation. Available online: https://docs.anaconda.com/anaconda/install/ (accessed on 18 May 2020).
GitHub. Available online: https://github.com/college-of-pharmacy-gachon-university/KLD-Pharmacological_Class_Similarity (accessed on 18 May 2020).

Figure 1. The problem definition of 3D chemo-centric screening. (a) BNDS-A as a new molecular framework. (b) The role of chemical similarity in virtual screening. (c) The role of chemical similarity in chemo-centric retro-virtual screening. (d) The workflow of this work of an unprecedented drug scaffold.

Figure 2. Representative distributions (Q-distributions) of target classes using EM based Gaussian mixture model (

Ξ_{E M} (Φ_{n} (x) | g, ω, μ, σ, K)

of ligand pair similarity. (a) Q-distribution of ESR; (b) Q-distribution of VDR; (c) Q-distribution of COX2; (d) Q-distribution of CTSD. The red line: GMM K = 1, blue line: GMM K = 3, black line: GMM K = 7, pink bar: histogram of raw data.

Figure 2. Representative distributions (Q-distributions) of target classes using EM based Gaussian mixture model (

Ξ_{E M} (Φ_{n} (x) | g, ω, μ, σ, K)

of ligand pair similarity. (a) Q-distribution of ESR; (b) Q-distribution of VDR; (c) Q-distribution of COX2; (d) Q-distribution of CTSD. The red line: GMM K = 1, blue line: GMM K = 3, black line: GMM K = 7, pink bar: histogram of raw data.

Figure 3. Frequency distributions of

Ξ_{M L} (Φ_{4 n}^{(l)} (x_{k}) | g, ω, μ, σ, 1)

estimates (

μ_{1}

and

σ_{1}

). Query (l)

\in

CTSD (class = 4). (a) CTSD-ESR, (b) CTSD-VDR, (c) CTSD-COX2, and (d) CTSD-CTSD. * The color bars (right side of the distribution) indicate frequency (e.g., yellow in 3(a) represents 3500 to 4000 queries, the mean of the ML estimates varied from 0.45 to 0.5 and their standard deviation varied from 0.08 to 0.1 in the standard).

Figure 3. Frequency distributions of

Ξ_{M L} (Φ_{4 n}^{(l)} (x_{k}) | g, ω, μ, σ, 1)

estimates (

μ_{1}

and

σ_{1}

). Query (l)

\in

CTSD (class = 4). (a) CTSD-ESR, (b) CTSD-VDR, (c) CTSD-COX2, and (d) CTSD-CTSD. * The color bars (right side of the distribution) indicate frequency (e.g., yellow in 3(a) represents 3500 to 4000 queries, the mean of the ML estimates varied from 0.45 to 0.5 and their standard deviation varied from 0.08 to 0.1 in the standard).

Figure 4. The cumulative densities of K–L distance between Q-distribution (Target class: ESR) and queries. X-axis: K–L divergence, Y-axis: cumulative density; Q-distribution of ESR through GMM and the distribution of queries were calculated. (a) ESR(Query)-ESR(Class), (b) VDR(Query)-ESR(Class), (c) COX2(Query)-ESR(Class), and (d) ESR(Query)-ESR(Class).

Figure 5. The cumulative densities of K–L distance between Q-distribution (Target class: VDR) and queries. X-axis: K–L divergence, Y-axis: cumulative density; Q-distribution of VDR through GMM and the distribution of queries were calculated. (a) ESR(Query)-VDR(Class), (b) VDR(Query)-VDR(Class), (c) COX2(Query)-VDR(Class), and (d) ESR(Query)-VDR(Class).

Figure 6. The cumulative densities of K–L distance between Q-distribution (Target class: COX2) and queries. X-axis: K–L divergence, Y-axis: cumulative density; Q-distribution of COX2 through GMM and the distribution of queries were calculated. (a) ESR(Query)-COX2(Class), (b) VDR(Query)-COX2(Class), (c) COX2(Query)-COX2(Class), and (d) ESR(Query)-COX2(Class).

Figure 7. The cumulative densities of K–L distance between Q-distribution (Target class: CTSD) and queries. X-axis: K–L divergence, Y-axis: cumulative density; Q-distribution of CTSD through GMM and the distribution of queries were calculated. (a) ESR(Query)-CTSD(Class), (b) VDR(Query)-CTSD(Class), (c) COX2(Query)-CTSD(Class), and (d) ESR(Query)-CTSD(Class).

Figure 8. Feasibility index according to target class and GMM component (K).

Table 1. Hyperparameters of Q distributions for target classes.

GMM	ESR		VDR		COX2		CTSD
No(i)	$m_{i}$	$σ_{i}$	$m_{i}$	$σ_{i}$	$m_{i}$	$σ_{i}$	$m_{i}$	$σ_{i}$
1	0.5483	0.1458	0.5981	0.1224	0.5941	0.1758	0.4560	0.1320
GMM	HIV1		HSP90		TRPV1		TOP1
No(i)	$m_{i}$	$σ_{i}$	$m_{i}$	$σ_{i}$	$m_{i}$	$σ_{i}$	$m_{i}$	$σ_{i}$
1	0.419	0.123	0.614	0.206	0.667	0.176	0.510	0.222

Table 2. K–L divergence of randomly chosen queries between Q distributions and the distributions of queries.

Class	Query	K–L Divergence
Class	Query	ESR	VDR	COX2	CTSD
ESR	CHEMBL	2.6310	5.2420	2.9952	1.9426
	539392	2.6310	5.2420	2.9952	1.9426
	CHEMBL	0.0223	0.1144	0.0685	0.0363
	193280	0.0223	0.1144	0.0685	0.0363
	CHEMBL	0.0564	0.1847	0.1638	0.2186
	443605	0.0564	0.1847	0.1638	0.2186
VDR	CHEMBL	0.0658	0.0107	0.0795	0.0637
	7162	0.0658	0.0107	0.0795	0.0637
	CHEMBL	0.0488	0.0420	0.2391	0.0682
	1322390	0.0488	0.0420	0.2391	0.0682
	CHEMBL	0.0983	0.0849	0.3748	0.1003
	1452735	0.0983	0.0849	0.3748	0.1003
COX2	CHEMBL	0.4773	0.7264	0.4693	0.2694
	1163237	0.4773	0.7264	0.4693	0.2694
	CHEMBL	0.0811	0.0436	0.0326	0.0490
	127560	0.0811	0.0436	0.0326	0.0490
	CHEMBL	0.0704	0.0417	0.0684	0.0724
	271614	0.0704	0.0417	0.0684	0.0724
CTSD	CHEMBL	0.0889	0.0146	0.2667	0.1014
	263810	0.0889	0.0146	0.2667	0.1014
	CHEMBL	0.6800	1.0065	0.9193	0.1174
	252655	0.6800	1.0065	0.9193	0.1174
	CHEMBL	0.5331	0.8771	0.8109	0.0766
	436438	0.5331	0.8771	0.8109	0.0766

Table 3. K–L divergence of ligands shared with eight target classes *.

Query	Targets	ESR	VDR	COX2	CTSD	HIV1	HSP90	TRPV4	TOP1
CHEMBL	VDR/COX2/HIV1	1.2649	2.2088	1.6702	0.6982	0.3587	1.6040	1.9256	1.2754
163
(RITONAVIR)
CHEMBL	VDR/COX2/HSP90/TOP1	0.0718	0.0526	0.1148	0.0475	0.0393	0.1655	0.5684	0.0915
164
(MYRICETIN)
CHEMBL	ESR/VDR/COX2/TOP1	0.3075	0.4963	0.6972	0.2792	0.1685	0.8460	0.7630	0.5009
772
(RESERPINE)
CHEMBL	COX2/TPRV4	0.2385	0.3053	0.4731	0.2322	0.1704	0.6374	0.6669	0.5810
1813048	COX2/TPRV4	0.2385	0.3053	0.4731	0.2322	0.1704	0.6374	0.6669	0.5810

* The smallest K–L divergence value among the experimentally tested targets of each query is presented in bold.

Table 4. The description on

ℙ (ν (l_{m}) = i)

and

F_{m}

according to the number of components of Gaussian Mixture Model K, and the class

ν (l_{m})

of queries

l_{m}

^a.

Table 4. The description on

ℙ (ν (l_{m}) = i)

and

F_{m}

according to the number of components of Gaussian Mixture Model K, and the class

ν (l_{m})

of queries

l_{m}

^a.

K = 1		$ℙ (ν (l_{m}) = i)$				$F_{m}$ ^b
		Class of representative distributions $(i)$
		ESR	VDR	COX2	CTSD
Class $ν (l_{m})$ of queries $l_{m}$	ESR	0.4623	0.2172	0.0082	0.3123	0.9272
	VDR	0.1116	0.5101	0.0054	0.3729	1.0205
	COX2	0.0882	0.3216	0.2046	0.3856	0.5071
	CTSD	0.0051	0.0489	0.0057	0.9404	3.9718
K = 3		$ℙ (ν (l_{m}) = i)$				$F_{m}$ ^b
		Class of representative distributions $(i)$
		ESR	VDR	COX2	CTSD
Class $ν (l_{m})$ of queries $l_{m}$	ESR	0.3289	0.2616	0.0725	0.3370	0.7001
	VDR	0.1653	0.5199	0.0517	0.2631	1.0406
	COX2	0.1024	0.4922	0.1534	0.2520	0.4257
	CTSD	0.1348	0.0741	0.0128	0.7783	1.8738
K = 7		$ℙ (ν (l_{m}) = i)$				$F_{m}$ ^b
		Class of representative distributions $(i)$
		ESR	VDR	COX2	CTSD
Class $ν (l_{m})$ of queries $l_{m}$	ESR	0.3669	0.2553	0.0713	0.3065	0.7613
	VDR	0.2164	0.5005	0.0476	0.2356	1.0009
	COX2	0.1387	0.4891	0.1477	0.2245	0.4164
	CTSD	0.1437	0.0705	0.0084	0.7775	1.8691

^a This table represents the feasibility of discrimination depending on the number of components in GMM, K, and the class ν(l_m) of queries l_m. ^b The larger

F_{m}

, the better performance of discrimination between one class and others. Estrogen receptor alpha = ESR, Vitamin D receptor = VDR, Cyclooxygenase-2 = COX2, Cathepsin D = CTSD.

Table 5. The comparison between representative queries and unprecedented drug BNDS-A as a query.

Class	Query	Selection Type	Max. of K–L Divergence
Class	Query	Selection Type	ESR	VDR	COX2	CTSD
ESR	CHEMBL 499809	Mean of Q	0.0363	0.1991	0.1611	0.2772
	CHEMBL 2	(Mean + 2SD) of Q	0.1180	0.1001	0.1547	0.0883
	CHEMBL 558943	(Mean − 2SD) of Q	2.7919	5.2859	2.9632	2.0501
	CHEMBL 604989	Biggest gap of K–L divergence	6.2458	10.9899	6.1578	5.4983
	CHEMBL 292033	Highest Similarity with BNDS-A	0.0298	0.2570	0.2096	0.1082
	BNDS-A	Unknown	0.2116	0.0588	0.1139	0.9704
VDR	CHEMBL 7463	Mean of Q	0.0237	0.0442	0.1446	0.1262
	CHEMBL 603	(Mean + 2SD) of Q	0.0999	0.2738	0.1257	0.0655
	CHEMBL 1116	(Mean − 2SD) of Q	1.2883	2.1898	1.6169	0.4702
	CHEMBL 486541	Biggest gap of K–L divergence	4.2675	7.2936	3.9890	3.3430
	CHEMBL 62136	Highest Similarity with BNDS-A	0.2090	0.1854	0.4785	0.1086
	BNDS-A	Unknown	0.2859	0.0864	0.1888	1.0807
COX2	CHEMBL 1201356	Mean of Q	0.0963	0.1054	0.2187	0.0948
	CHEMBL 16516	(Mean + 2SD) of Q	0.1445	0.1172	0.0385	0.1205
	CHEMBL 1171450	(Mean − 2SD) of Q	3.2143	5.5460	3.1399	2.4262
	CHEMBL 1171454	Biggest gap of K–L divergence	4.4382	7.8994	4.1848	4.1940
	CHEMBL 942	Highest Similarity with BNDS-A	0.1285	0.0546	0.09018	0.06225
	BNDS-A	Unknown	0.6987	0.65378	0.2273	2.0276
CTSD	CHEMBL 263810	Mean of Q	0.0850	0.0113	0.2512	0.1038
	CHEMBL 504438	(Mean + 2SD) of Q	0.6941	1.1751	1.1002	0.3305
	CHEMBL 567893	(Mean − 2SD) of Q	3.5366	6.1606	3.5399	2.0713
	CHEMBL 567893	Biggest gap of K–L divergence	3.5684	6.1606	3.5399	2.0713
	CHEMBL 387576	Highest Similarity with BNDS-A	0.0835	0.1467	0.0952	0.0129
	BNDS-A	Unknown	0.0556	0.26421	0.2092	0.087

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Article Metrics

Citations

Article Access Statistics

Journal Statistics

Multiple requests from the same IP address are counted as one view.