# Application of Information—Theoretic Concepts in Chemoinformatics

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Chemical Descriptors and Shannon Entropy

#### 2.1. Information-Theoretic Concepts for Characterizing Topological Properties of Molecules

**Figure 1.**Abstract representation of a molecule as a mathematical graph. Topological properties of molecules are typically derived from a hydrogen-supressed representation by reducing molecules to their connectivity information and disregarding atom type and bond order.

**Figure 2.**Partitioning of atoms into sets of indistinguishable atoms. Atoms of a molecule are grouped together if they are topologically indistinguishable. For each molecule, atoms belonging to the same group are shown in the same color.

**(a)**All (non-hydrogen) atoms of benzene are indistinguishable. When a hydroxyl group is added, the carbons in positions 2 and 6 as well as those in positions 3 and 5 positions are indistinguishable, as indicated by the color code.

**(b)**For benzene-1,4-diol, groups of indistinguishable atoms are shown in the same color and the probability distribution resulting from the partitioning is reported.

_{i}, i=1…k, the values $\frac{{n}_{i}}{n}$ can be treated as probabilities of a discrete probability distribution and the topological information index is defined as:

_{1}=6, and:

_{TOP}(benzene) = 0.

_{1}=1, n

_{2}=1, n

_{3}=1, n

_{4}=2, n

_{5}=2 and:

_{1}=2, n

_{2}=2, n

_{3}=4 and:

#### 2.2. Shannon Entropy Descriptors (SHED)

## 3. Information Theory-Based Assessment of Feature Significance in Active Compounds

**Figure 3.**Calculation of SHED descriptors. Each atom in a molecule is assigned one or more of the following atomic features: hydrophobic (H), aromatic (R), hydrogen bond acceptor (A), or hydrogen bond donor (D). The pairwise combination of these features results into 10 possible feature pairs. For each pair, the distribution of topological distances between atoms with the respective features is determined. Here, the distribution is shown for the aromatic-donor feature (RD) pair. As can be seen, there is only one donor atom in the molecule. To obtain the RD distribution, the shortest path lengths (number of bonds) of the donor atom to 14 aromatic atoms present in the molecule are calculated. From this distribution, the Shannon entropy for the feature pair is derived and further transformed into a projected Shannon entropy value, i.e. E-value. The projected entropies of the 10 feature pair distributions are graphically displayed in a wheel chart. The small circle in the chart indicates the E-value for the RD feature pair.

**Figure 4.**Schematic substructural fingerprint. A substructural fingerprint consists of a predefined set of substructural features represented by specific positions in a binary string. If a feature is present in a molecule the bit is set on in its fingerprint; otherwise it is set off. The bit pattern is illustrated by the bar where each square corresponds to a specific substructural feature, as indicated for five different fragments. The molecule shown at the top contains two of the five fragments. For these fragments, the corresponding bits are set on (filled square), whereas the bits for the three other substructures that do not occur in the molecule are set off (empty square).

#### 3.1. Kullback-Leibler (KL) Divergence

#### 3.2. Mutual Information (MI)

#### 3.3. Shannon Entropy-Based Fingerprint Similarity Search Strategy

**Figure 5.**Shannon entropy-based fingerprint similarity. The fingerprint Shannon entropy score (SE) is calculated as the sum of the Shannon entropies of individual bit positions (left). The SE of a reference set of four acetylcholine esterase (ACE) inhibitors is shown on the right. The SE is recalculated when individual test compounds are added to the set. For active compounds, the SE value changes only very little (top three compounds), but for inactive compounds, the SE increases notably (bottom three compounds).

## 4. Conclusions

## References

- Engel, T. Basic overview of chemoinformatics. J. Chem. Inf. Model.
**2006**, 46, 2267–2277. [Google Scholar] [CrossRef] [PubMed] - Brown, F.K. Chemoinformatics: What is it and how does it impact drug discovery. Annu. Rep. Med. Chem.
**1998**, 33, 375–384. [Google Scholar] - Todeschini, R.; Consonni, V. Handbook of Molecular Descriptors; Wiley-VCH: Weinheim, Germany, 2000. [Google Scholar]
- Bajorath, J. Integration of virtual and high-throughput screening. Nat. Rev. Drug. Discov.
**2002**, 1, 882–894. [Google Scholar] [CrossRef] [PubMed] - Shannon, C.E.; Weaver, W. The Mathematical Theory of Communication; University of Illinois Press: Urbana, IL, USA, 1963. [Google Scholar]
- Rashevsky, N. Life, information theory, and topology. Bull. Math. Biophys.
**1955**, 17, 229–235. [Google Scholar] [CrossRef] - Trucco, E. A note on the information content of graphs. Bull. Math. Biophys.
**1956**, 18, 129–135. [Google Scholar] [CrossRef] - Trucco, E. On the information content of graphs: Compound symbols; Different states for each point. Bull. Math. Biophys.
**1956**, 8, 237–253. [Google Scholar] [CrossRef] - Mowshowitz, A. Entropy and the complexity of graphs: I. An index of the relative complexity of a graph. Bull. Math. Biophys.
**1968**, 30, 175–204. [Google Scholar] [CrossRef] [PubMed] - Mowshowitz, A. Entropy and the complexity of graphs: II. The information content of digraphs and infinite graphs. Bull. Math. Biophys.
**1968**, 30, 225–240. [Google Scholar] [CrossRef] [PubMed] - Mowshowitz, A. Entropy and the complexity of graphs: III. Graphs with prescribed information content. Bull. Math. Biophys.
**1968**, 30, 387–414. [Google Scholar] [CrossRef] - Mowshowitz, A. Entropy and the complexity of graphs: IV. Entropy measures and graphical structure. Bull. Math. Biophys.
**1968**, 30, 533–546. [Google Scholar] [CrossRef] - Bertz, S.H. The first general index of molecular complexity. J. Am. Chem. Soc.
**1981**, 103, 3599–3601. [Google Scholar] [CrossRef] - Bertz, S.H. On the complexity of graphs and molecules. Bull. Math. Biol.
**1983**, 45, 849–855. [Google Scholar] [CrossRef] - Bonchev, D.; Kamenski, D.; Kamenska, V. Symmetry and information content of chemical structures. Bull. Math. Biol.
**1976**, 38, 119–133. [Google Scholar] [CrossRef] - Bonchev, D.; Trinajstić, N. Information theory, distance matrix, and molecular branching. J. Chem. Phys.
**1977**, 67, 4517–4533. [Google Scholar] [CrossRef] - Gregori-Puigjané, E.; Mestres, J. SHED: Shannon entropy descriptors from topological feature distributions. J. Chem. Inf. Model.
**2006**, 46, 1615–1622. [Google Scholar] [CrossRef] [PubMed] - Eckert, H.; Bajorath, J. Molecular similarity analysis in virtual screening: foundations, limitations and novel approaches. Drug Discov. Today
**2007**, 12, 225–233. [Google Scholar] [CrossRef] [PubMed] - Johnson, M.A.; Maggiora, G. (Eds.) Concepts and Applications of Molecular Similarity; John Wiley & Sons: New York, NY, USA, 1990.
- Willett, P.; Barnard, J.M.; Downs, G.M. Chemical similarity searching. J. Chem. Inf. Comput. Sci.
**1998**, 38, 983–996. [Google Scholar] [CrossRef] - Wang, Y.; Bajorath, J. Bit silencing in fingerprints enables the derivation of compound class-directed similarity metrics. J. Chem. Inf. Model.
**2008**, 48, 1754–1759. [Google Scholar] [CrossRef] [PubMed] - Hu, Y.; Lounkine, E.; Bajorath, J. Improving the performance of extended connectivity fingerprints through activity-oriented feature filtering and application of a bit density-dependent similarity function. ChemMedChem.
**2009**, 4, 540–548. [Google Scholar] [CrossRef] [PubMed] - Kullback, S. Information Theory and Statistics; Dover Publications: Mineola, MN, USA, 1997. [Google Scholar]
- Quinlan, J.R. Induction of decision trees. Mach. Learn.
**1986**, 1, 81–106. [Google Scholar] - Nisius, B.; Vogt, M.; Bajorath, J. Development of a fingerprint reduction approach for Bayesian similarity searching based on Kullback-Leibler divergence analysis. J. Chem. Inf. Model.
**2009**, 49, 1347–1358. [Google Scholar] [CrossRef] [PubMed] - Vogt, M.; Godden, J.W.; Bajorath, J. Bayesian interpretation of a distance function for navigating high-dimensional descriptor spaces. J. Chem. Inf. Model.
**2007**, 47, 39–46. [Google Scholar] [CrossRef] [PubMed] - Vogt, M; Bajorath, J. Bayesian similarity searching in high-dimensional descriptor spaces combined with Kullback-Leibler descriptor divergence analysis. J. Chem. Inf. Model.
**2008**, 48, 247–255. [Google Scholar] [CrossRef] [PubMed] - Nisius, B.; Bajorath, J. Molecular fingerprint recombination: generating hybrid fingerprints for similarity searching from different fingerprint types. ChemMedChem
**2009**, 4, 1859–1863. [Google Scholar] [CrossRef] [PubMed] - Nisius, B.; Bajorath, J. Reduction and recombination of fingerprints of different design increase compound recall and the structural diversity of hits. Chem. Biol. Drug Des.
**2010**, 75, 152–160. [Google Scholar] [CrossRef] [PubMed] - Vogt, M.; Bajorath, J. Introduction of an information-theoretic method to predict recovery rates of active compounds for Bayesian in silico screening: Theory and screening trials. J. Chem. Inf. Model.
**2007**, 47, 337–341. [Google Scholar] [CrossRef] - Vogt, M.; Bajorath, J. Introduction of a generally applicable method to estimate retrieval of active molecules for similarity searching using fingerprints. ChemMedChem
**2007**, 2, 1311–1320. [Google Scholar] [CrossRef] [PubMed] - Cover, T.M.; Thomas, J.A. Elements of Information Theory; John Wiley & Sons, Inc.: New York, NY, USA, 1991. [Google Scholar]
- Liu, Y. A Comparative study on feature selection methods for drug discovery. J. Chem. Inf. Model.
**2004**, 44, 1823–1828. [Google Scholar] - Venkatraman, V.; Dalby, A.R.; Yang, Z.R. Evaluation of mutual information and genetic programming for feature selection in QSAR. J. Chem. Inf. Comput. Sci.
**2004**, 44, 1686–1692. [Google Scholar] [CrossRef] [PubMed] - Rogers, D.; Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model.
**2010**, 50, 742–754. [Google Scholar] [CrossRef] [PubMed] - Bender, A; Mussa, H.Y.; Glen, R.C.; Reiling, S. Similarity searching of chemical databases using atom environment descriptors (MOLPRINT 2D): Evaluation of performance. J. Chem. Inf. Comput. Sci.
**2004**, 44, 1708–1718. [Google Scholar] [CrossRef] [PubMed] - Lin, J. Divergence measures based on Shannon entropy. IEEE Trans. Inf. Theory
**1991**, 37, 145–151. [Google Scholar] [CrossRef] - Godden, J.W.; Bajorath, J. Differential Shannon entropy as a sensitive measure of differences in database variability of molecular descriptors. J. Chem. Inf. Comput. Sci.
**2001**, 41, 1060–1066. [Google Scholar] [CrossRef] [PubMed] - Stahura, F.L.; Godden, J.W.; Bajorath, J. Differential Shannon entropy analysis identifies molecular property descriptors that predict aqueous solubility of synthetic compounds with high accuracy in binary QSAR calculations. J. Chem. Inf. Comput. Sci.
**2002**, 42, 550–558. [Google Scholar] [CrossRef] [PubMed] - Wang, Y.; Geppert, H.; Bajorath, J. Shannon entropy-based fingerprint similarity search strategy. J. Chem. Inf. Model.
**2009**, 49, 1687–1691. [Google Scholar] [CrossRef] [PubMed] - Hert, J.; Willett, P.; Wilton, D.J.; Acklin, P.; Azzaoui, K.; Jacoby, E.; Schuffenhauer, A. Comparison of fingerprint-based methods for virtual screening using multiple bioactive reference structures. J. Chem. Inf. Comput. Sci.
**2004**, 44, 1177–1185. [Google Scholar] [CrossRef] [PubMed]

© 2010 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

## Share and Cite

**MDPI and ACS Style**

Vogt, M.; Wassermann, A.M.; Bajorath, J. Application of Information—Theoretic Concepts in Chemoinformatics. *Information* **2010**, *1*, 60-73.
https://doi.org/10.3390/info1020060

**AMA Style**

Vogt M, Wassermann AM, Bajorath J. Application of Information—Theoretic Concepts in Chemoinformatics. *Information*. 2010; 1(2):60-73.
https://doi.org/10.3390/info1020060

**Chicago/Turabian Style**

Vogt, Martin, Anne Mai Wassermann, and Jürgen Bajorath. 2010. "Application of Information—Theoretic Concepts in Chemoinformatics" *Information* 1, no. 2: 60-73.
https://doi.org/10.3390/info1020060