# Similarity Downselection: Finding the n Most Dissimilar Molecular Conformers for Reference-Free Metabolomics

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

^{13}to be exact). However, if one assumes all members of the set n are also members of the set n + 1, then an approximate solution that is sufficiently close to the exact solution can be determined in a “greedy” fashion. We introduce a heuristic algorithm implemented in Python, similarity downselection (SDS), that finds the subset of the n most dissimilar items from a large population. As described, this type of algorithm is critical for helping to accelerate reference-free metabolomics methods for computing in silico libraries, but it is also generally applicable to other similar subsetting problems across all domains of science. SDS is generalizable to any application where the data can be represented as arrays whose elements are the pairwise relationships between each item and all other items in the population. We include a brief description of an example application on molecular conformer selection, and benchmark SDS against both a Monte Carlo sampling method and the exact solution. In addition, we demonstrate the constraints and efficacy of the algorithm using triangle approximations and ratios (Supplementary Information).

## 2. Application: Molecular Conformer Sampling

^{+}, SDGRG [M+Na]

^{+}, and Naringin [M−H]

^{−}, and plotted in CCS vs. energy space.

## 3. Similarity Downselection Python Module

#### 3.1. Algorithm Description

_{ij}contains the pairwise relation between items i and j. Since N

_{ij}= N

_{ji}, the matrix is symmetric across the diagonal.

^{323}in our setup) after about 10,000 items, so log-summing was used instead. Effectively, log-summing (or multiplying) rewards items that have a large value across all arrays by making its numerical representation larger and punishes items that have even one significantly small pairwise relation with another item by making its numerical representation smaller. N can be very large, theoretically indefinite, and limited only by machine precision and memory. The population used for the original implementation, as discussed in Section 3, contained 50,000 items.

#### 3.2. Problem and Algorithm Description Using Graph Theory (Nodes and Edges)

## 4. Benchmarking

#### 4.1. Performance against a Monte Carlo Method

^{+}. MC sampling was run for 1,000,000 iterations for each n-sized set, with each taking more than 2 h to complete. After loading the data matrix, which required about 3 min, the heuristic algorithm found all sets in <1 min. SDS also had a greater RMSD log-sum (total distance between nodes) for every set size, as shown in Figure 3, indicating that it was closer to the exact solution than the MC method every time.

^{+}, with similar results. Here, MC performed better than SDS at n = 3 by a small margin (Figure 3). SDS ran the complete search for every possible set of 1 < n < 50,000 in approximately 7 min, including the approximate 3 min required to load the matrix.

#### 4.2. Performance against the Exact Solution

#### 4.3. Comparing Computational Costs of Calculating Pairwise Relations

## 5. Conclusions

^{th}most dissimilar set in generalized datasets.

## Supplementary Materials

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## References

- Yongye, A.B.; Bender, A.; Martinez-Mayorga, K. Dynamic clustering threshold reduces conformer ensemble size while maintaining a biologically relevant ensemble. J. Comput.-Aided Mol. Des.
**2010**, 24, 675–686. [Google Scholar] [CrossRef] [PubMed] - Colby, S.M.; Thomas, D.G.; Nunez, J.R.; Baxter, D.J.; Glaesemann, K.R.; Brown, J.M.; Pirrung, M.A.; Govind, N.; Teeguarden, J.G.; Metz, T.O.; et al. ISiCLE: A Quantum Chemistry Pipeline for Establishing in Silico Collision Cross Section Libraries. Anal. Chem.
**2019**, 91, 4346–4356. [Google Scholar] [CrossRef] [PubMed] - Ebejer, J.P.; Morris, G.M.; Deane, C.M. Freely available conformer generation methods: How good are they? J. Chem. Inf. Model
**2012**, 52, 1146–1158. [Google Scholar] [CrossRef] [PubMed] - Pearlman, D.; Case, D.; Caldwell, J.; Seibel, G.; Singh, U.C.; Weiner, P.; Kollman, P. AMBER 2017; Unversity of California: San Francisco, CA, USA, 2017. [Google Scholar]
- Pracht, P.; Bohle, F.; Grimme, S. Automated exploration of the low-energy chemical space with fast quantum chemical methods. Phys. Chem. Chem. Phys.
**2020**, 22, 7169–7192. [Google Scholar] [CrossRef] [PubMed] - Nielson, F.F.; Colby, S.M.; Thomas, D.G.; Renslow, R.S.; Metz, T.O. Exploring the Impacts of Conformer Selection Methods on Ion Mobility Collision Cross Section Predictions. Anal. Chem.
**2021**, 93, 3830–3838. [Google Scholar] [CrossRef] [PubMed] - Sabuncuoglu, I.; Bayiz, M. Job shop scheduling with beam search. Eur. J. Oper. Res.
**1999**, 118, 390–412. [Google Scholar] [CrossRef] - Alsabti, K.; Ranka, S.; Singh, V. An efficient k-means clustering algorithm. Electr. Eng. Comput. Sci.
**1997**, 43. [Google Scholar] - Kanungo, T.; Mount, D.M.; Netanyahu, N.S.; Piatko, C.D.; Silverman, R.; Wu, A.Y. An efficient k-means clustering algorithm: Analysis and implementation. IEEE Trans. Pattern Anal. Mach. Intell.
**2002**, 24, 881–892. [Google Scholar] [CrossRef] - Khanmohammadi, S.; Adibeig, N.; Shanehbandy, S. An improved overlapping k-means clustering method for medical applications. Expert Syst. Appl.
**2017**, 67, 12–18. [Google Scholar] [CrossRef] - Clark, R.D. OptiSim: An Extended Dissimilarity Selection Method for Finding Diverse Representative Subsets. J. Chem. Inf. Comput. Sci.
**1997**, 37, 1181–1188. [Google Scholar] [CrossRef] - Elhamifar, E.; Sapiro, G.; Sastry, S.S. Dissimilarity-Based Sparse Subset Selection. IEEE Trans. Pattern Anal. Mach. Intell.
**2016**, 38, 2182–2197. [Google Scholar] [CrossRef] [PubMed] - Willett, P. Dissimilarity-based algorithms for selecting structurally diverse sets of compounds. J. Comput. Biol.
**1999**, 6, 447–457. [Google Scholar] [CrossRef] [PubMed] - Tanemura, K.A.; Das, S.; Merz, K.M. AutoGraph: Autonomous Graph-Based Clustering of Small-Molecule Conformations. J. Chem. Inf. Modeling
**2021**, 61, 1647–1656. [Google Scholar] [CrossRef] [PubMed] - Ermanis, K.; Parkes, K.E.B.; Agback, T.; Goodman, J.M. The optimal DFT approach in DP4 NMR structure analysis-pushing the limits of relative configuration elucidation. Org. Biomol. Chem.
**2019**, 17, 5886–5890. [Google Scholar] [CrossRef] [PubMed] - Kim, H.; Jang, C.; Yadav, D.K.; Kim, M.H. The comparison of automated clustering algorithms for resampling representative conformer ensembles with RMSD matrix. J. Cheminform.
**2017**, 9, 21. [Google Scholar] [CrossRef] [PubMed] - O’Boyle, N.M.; Banck, M.; James, C.A.; Morley, C.; Vandermeersch, T.; Hutchison, G.R. Open Babel: An open chemical toolbox. J. Cheminform.
**2011**, 3, 33. [Google Scholar] [CrossRef] [PubMed] - O’Boyle, N.M.; Morley, C.; Hutchison, G.R. Pybel: A Python wrapper for the OpenBabel cheminformatics toolkit. Chem. Cent. J.
**2008**, 2, 5. [Google Scholar] [CrossRef] [PubMed] - Shimizu, S.; Yamaguchi, K.; Masuda, S. A maximum edge-weight clique extraction algorithm based on branch-and-bound. Discret. Optim.
**2020**, 37, 100583. [Google Scholar] [CrossRef] - Martí, R.; Gallego, M.; Duarte, A. A branch and bound algorithm for the maximum diversity problem. Eur. J. Oper. Res.
**2010**, 200, 36–44. [Google Scholar] [CrossRef] - Ghosh, J.B. Computational aspects of the maximum diversity problem. Oper. Res. Lett.
**1996**, 19, 175–181. [Google Scholar] [CrossRef] - Sørensen, M.M. New facets and a branch-and-cut algorithm for the weighted clique problem. Eur. J. Oper. Res.
**2004**, 154, 57–70. [Google Scholar] [CrossRef] - Glover, F. Improved linear integer programming formulations of nonlinear integer problems. Manag. Sci.
**1975**, 22, 455–460. [Google Scholar] [CrossRef] - Gouveia, L.; Martins, P. Solving the maximum edge-weight clique problem in sparse graphs with compact formulations. EURO J. Comput. Optim.
**2015**, 3, 1–30. [Google Scholar] [CrossRef] - Hosseinian, S.; Fontes, D.; Butenko, S. A nonconvex quadratic optimization approach to the maximum edge weight clique problem. J. Glob. Optim.
**2018**, 72, 219–240. [Google Scholar] [CrossRef]

**Figure 1.**Demonstration of SDS choosing the 8 most mutually dissimilar conformers for Harmine [M+H]

^{+}, SDGRG [M+Na]

^{+}, and Naringin [M−H]

^{−}, showing the structure of the three most dissimilar conformers for each. SDS works iteratively by finding the set n + 1 by building off the set n.

**Figure 2.**Illustration of the similarity downselection algorithm. The natural log is taken on a square matrix containing the pairwise-similarity relations of the items in the full population. The two most dissimilar items (i.e., most dissimilar subset n = 2) are found and their arrays summed to find the third most dissimilar item (i.e., subset n = 3). Successive most dissimilar subsets are iteratively found by adding the array of the most recently found item to the summation array and taking the index of the largest (or smallest) value. Items already selected cannot be selected again and are represented as nan in the summation array.

**Figure 3.**SDS benchmarked against a Monte Carlo (MC) sampling method for sphingosine [M+H]

^{+}and methyleugenol [M+Na]

^{+}with conformer populations of 50,000. Top and middle, the conformer RMSD log-sum (a metric of the dissimilarity of the set) for SDS and the largest RMSD log-sum found via the MC method for set size n. Bottom, search time per node for both methods. Time includes the (approximate) 3 min to load the pairwise RMSD matrix.

**Figure 4.**SDS benchmarked against the exact solution used on randomly generated datasets with population size N, searching for the most dissimilar set of size n = N/2.

**Top**, total pairwise dissimilarity for the exact solution, SDS, mean, and minimum (most similar) sets.

**Bottom**, search time per node for both methods.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Nielson, F.F.; Kay, B.; Young, S.J.; Colby, S.M.; Renslow, R.S.; Metz, T.O.
Similarity Downselection: Finding the *n* Most Dissimilar Molecular Conformers for Reference-Free Metabolomics. *Metabolites* **2023**, *13*, 105.
https://doi.org/10.3390/metabo13010105

**AMA Style**

Nielson FF, Kay B, Young SJ, Colby SM, Renslow RS, Metz TO.
Similarity Downselection: Finding the *n* Most Dissimilar Molecular Conformers for Reference-Free Metabolomics. *Metabolites*. 2023; 13(1):105.
https://doi.org/10.3390/metabo13010105

**Chicago/Turabian Style**

Nielson, Felicity F., Bill Kay, Stephen J. Young, Sean M. Colby, Ryan S. Renslow, and Thomas O. Metz.
2023. "Similarity Downselection: Finding the *n* Most Dissimilar Molecular Conformers for Reference-Free Metabolomics" *Metabolites* 13, no. 1: 105.
https://doi.org/10.3390/metabo13010105