SimilarityLab: Molecular Similarity for SAR Exploration and Target Prediction on the Web

: Exploration of chemical space around hit, experimental, and known active compounds is an important step in the early stages of drug discovery. In academia, where access to chemical synthesis efforts is restricted in comparison to the pharma-industry, hits from primary screens are typically followed up through purchase and testing of similar compounds, before further funding is sought to begin medicinal chemistry efforts. Rapid exploration of druglike similars and structure– activity relationship proﬁles can be achieved through our new webservice SimilarityLab. In addition to searching for commercially available molecules similar to a query compound, SimilarityLab also enables the search of compounds with recorded activities, generating consensus counts of activities, which enables target and off-target prediction. In contrast to other online offerings utilizing the USRCAT similarity measure, SimilarityLab’s set of commercially available small molecules is consistently updated, currently containing over 12.7 million unique small molecules, and not relying on published databases which may be many years out of date. This ensures researchers have access to up-to-date chemistries and synthetic processes enabling greater diversity and access to a wider area of commercial chemical space. All source code is available in the SimilarityLab source repository.


Introduction
Academic groups running primary screens rely heavily on strong preliminary results to build a case for further funding to progress their drug and medicines discovery efforts. Whilst initial hits provide good starting points, some knowledge of the activity landscape can greatly help with medicinal chemistry feasibility and requirements. Activity cliffs [1][2][3][4] or areas of flat SAR (structure-activity relationships) [5][6][7] with little synthetic potential and no options for scaffold hopping [8,9] can quickly discount hits and induce failing fast and early, thereby saving time, money and effort. Such good practice contributes to avoiding the currently disastrously high attrition rates in drug discovery [10,11]. In this short communication, we wish to highlight our most recently developed web-service, SimilarityLab (https://similaritylab.bio.ed.ac.uk accessed on 5 June 2021) [12], giving all researchers in the field a quick way to source and purchase similar compounds which are mostly used to explore SAR around their own hit compounds, as well as those from the literature (see Figure 1). SimilarityLab makes extensive use of the USRCAT [13] 3D molecular similarity measure to query a local, processed version of the eMolecules database [14], currently containing over 12.7 million commercially available, unique druglike small molecules. Of crucial importance is the up-to-date nature of this commercial chemical space explorable with SimilarityLab, achieved through consistent updates of new compounds and removal of those no longer available. This is in contrast with existing online offerings such as USR-VS [15], which allows querying of a database last updated with molecules from the 2013 ZINC database [16] and an estimated commercial availability of around 50%. A similar story regards comparable tools and websites, with many utilizing out-of-date compound archives [17,18]. Integration of new molecules into SimilarityLab requires low-energy 3D Processes 2021, 9, 1520 2 of 6 conformations to be generated. This step, along with the efficient rebuilding of new updates into the commercial chemical space, is handled in a compute and data-efficient manner, greatly reducing the burden of updating the commercially available chemical space. regards comparable tools and websites, with many utilizing out-of-date compound archives [17,18]. Integration of new molecules into SimilarityLab requires low-energy 3D conformations to be generated. This step, along with the efficient rebuilding of new updates into the commercial chemical space, is handled in a compute and data-efficient manner, greatly reducing the burden of updating the commercially available chemical space.
Alongside the main use of SimilarityLab for finding 3D similar molecules to users' input queries, a secondary database can also be queried in a mode which enables prediction of protein targets for small molecules. We believe that the implemented approach which retrieves known active 3D similars from the ChEMBL [19] database will have an impact when integrated with phenotypic screening campaigns and used to guide target deconvolution.

Materials and Methods
All code generated for the SimilarityLab website and supporting codes for dataset preparation, including 3D conformer and descriptor generation, are available within the SimilarityLab source repository under an open-source license on GitHub [20].
Backend technologies used to serve SimilarityLab which currently runs on the University of Edinburgh's Eleanor cloud service include the Python Flask web framework (version 1.  Alongside the main use of SimilarityLab for finding 3D similar molecules to users' input queries, a secondary database can also be queried in a mode which enables prediction of protein targets for small molecules. We believe that the implemented approach which retrieves known active 3D similars from the ChEMBL [19] database will have an impact when integrated with phenotypic screening campaigns and used to guide target deconvolution.

Materials and Methods
All code generated for the SimilarityLab website and supporting codes for dataset preparation, including 3D conformer and descriptor generation, are available within the SimilarityLab source repository under an open-source license on GitHub [20].
Backend technologies used to serve SimilarityLab which currently runs on the University of Edinburgh's Eleanor cloud service include the Python Flask web framework (version 1. distributed opensource project) for drawing molecules to HTML canvas elements. When commercially available, compound databases are updated, and the QED [24] measure of druglikeness is applied with a cut-off of less than 0.67 to remove non-druglike small molecules. These molecules then have a single low-energy conformer generated, using the protocol outlined by Ebjner [25], which is then used to generate USRCAT descriptors which are stored by the backend (see repository for code listing). The same protocol is followed when a user draws a query molecule (without the druglike filter), with a single conformer being generated as an intermediary step before descriptor generation and comparison against commercially available small molecules, whereby the top similars are returned. The number of returned similars is user-definable, allowing concise SAR exploration with 100-200 molecules or larger datasets of up to 2000 molecules to be generated for further use in docking, virtual screening and cheminformatics studies.
Target prediction is achieved using a similar approach to commercial chemical space exploration, whereby the USRCAT molecular similarity technique is applied to "active" molecules within ChEMBL [19] (version 29). Active in this sense is defined as having a recorded IC 50 or K D of minimally 10 µM against protein targets. The top 100 similar active molecules then have their activities against all protein targets counted. The protein targets are then sorted by the number of times they are hit by this 100-compound similar list, and this list of targets is returned to the user as a ranked list of likely targets, along with the IDs of known active compounds for each target, which may be further explored and evaluated as to their similarity to the user's supplied query compound.

Results
SimilarityLab presents a fast, user-friendly interface for fast molecular similarity calculations (See Figure 2). With an emphasis on speed and near instant results, it is envisioned that SimilarityLab will play a major role on not only research but also teaching, allowing large groups the ability to progress cheminformatics experiments, retrieving compounds which are then used as input to a variety of different tools, models and simulations. Bootstrap (version 5.0.0, distributed opensource project) for styling, Kekule.js [22] (version 0.9.3, distributed opensource project) for user entry of 2D chemical structures, SmilesDrawer [23] (version 1.2.0, distributed opensource project) for drawing molecules to HTML canvas elements. When commercially available, compound databases are updated, and the QED [24] measure of druglikeness is applied with a cut-off of less than 0.67 to remove non-druglike small molecules. These molecules then have a single low-energy conformer generated, using the protocol outlined by Ebjner [25], which is then used to generate USRCAT descriptors which are stored by the backend (see repository for code listing). The same protocol is followed when a user draws a query molecule (without the druglike filter), with a single conformer being generated as an intermediary step before descriptor generation and comparison against commercially available small molecules, whereby the top similars are returned. The number of returned similars is user-definable, allowing concise SAR exploration with 100-200 molecules or larger datasets of up to 2000 molecules to be generated for further use in docking, virtual screening and cheminformatics studies. Target prediction is achieved using a similar approach to commercial chemical space exploration, whereby the USRCAT molecular similarity technique is applied to "active" molecules within ChEMBL [19] (version 29). Active in this sense is defined as having a recorded IC50 or KD of minimally 10 µM against protein targets. The top 100 similar active molecules then have their activities against all protein targets counted. The protein targets are then sorted by the number of times they are hit by this 100-compound similar list, and this list of targets is returned to the user as a ranked list of likely targets, along with the IDs of known active compounds for each target, which may be further explored and evaluated as to their similarity to the user's supplied query compound.

Results
SimilarityLab presents a fast, user-friendly interface for fast molecular similarity calculations (See Figure 2). With an emphasis on speed and near instant results, it is envisioned that SimilarityLab will play a major role on not only research but also teaching, allowing large groups the ability to progress cheminformatics experiments, retrieving compounds which are then used as input to a variety of different tools, models and simulations.

Figure 2.
Landing page of the SimilarityLab website, allowing access to further interfaces for searching commercial chemical space for molecules highly similar to an input query compound, as well as target prediction for user's supplied queries. Figure 2. Landing page of the SimilarityLab website, allowing access to further interfaces for searching commercial chemical space for molecules highly similar to an input query compound, as well as target prediction for user's supplied queries. The educational applications of SimilarityLab are strengthened through an intuitive interface, allowing input of molecules using the 2D drawing capabilities of the Kekule.js editor interface, with live automatic updating of the query in the SMILES molecular format shown below (See Figure 3). The educational applications of SimilarityLab are strengthened through an intuitive interface, allowing input of molecules using the 2D drawing capabilities of the Kekule.js editor interface, with live automatic updating of the query in the SMILES molecular format shown below (See Figure 3). Querying for similar molecules is achieved through the "Find similars" link displayed on the landing page in Figure 2. Following this link leads to the "Find similars" page displayed in Figure 3, which allows drawing of query molecules such as diclofenac shown above using the Kekule.js drawing applet. Standard chemical file formats such as SDF are supported by the applet which translates uploaded files into 2D, before submission to the SimilarityLab backend as SMILES for 3D conformer generation using the method outlined by Ebejer [25] and molecular similarity calculations. The database of small molecules assessed against the supplied query is user-selectable, along with the number of requested top similars which are to be returned up to a limit of 2000. A similar process is used to assess the targets of diclofenac and suggest possible modes of action. From the landing page in Figure 2, the "Predict targets" link can be followed to arrive at an interface similar to that shown in Figure 3, without the ability to choose a small-molecule database. Drawing in diclofenac again to this interface and clicking predict targets takes the user to a page containing top-noted targets for close similars for diclofenac, with the two top targets being Cyclooxygenase-2 and Alpha-1a adrenergic receptor, hit by nine and eight close similars to diclofenac, respectively. This is in agreement with the literature, which documents the role of cyclooxygenase-2 in acute pain and pain relief achieved through its inhibition [26] and the role of adrenergic receptors in pain [27]. Querying for similar molecules is achieved through the "Find similars" link displayed on the landing page in Figure 2. Following this link leads to the "Find similars" page displayed in Figure 3, which allows drawing of query molecules such as diclofenac shown above using the Kekule.js drawing applet. Standard chemical file formats such as SDF are supported by the applet which translates uploaded files into 2D, before submission to the SimilarityLab backend as SMILES for 3D conformer generation using the method outlined by Ebejer [25] and molecular similarity calculations. The database of small molecules assessed against the supplied query is user-selectable, along with the number of requested top similars which are to be returned up to a limit of 2000. A similar process is used to assess the targets of diclofenac and suggest possible modes of action. From the landing page in Figure 2, the "Predict targets" link can be followed to arrive at an interface similar to that shown in Figure 3, without the ability to choose a small-molecule database. Drawing in diclofenac again to this interface and clicking predict targets takes the user to a page containing top-noted targets for close similars for diclofenac, with the two top targets being Cyclooxygenase-2 and Alpha-1a adrenergic receptor, hit by nine and eight close similars to diclofenac, respectively. This is in agreement with the literature, which documents the role of cyclooxygenase-2 in acute pain and pain relief achieved through its inhibition [26] and the role of adrenergic receptors in pain [27].

Discussion
SimilarityLab being publicly available represents a major resource and fills a need present for mainly academic groups in the early stages of drug discovery. Now more than ever, funding for drug discovery efforts is scarce and difficult to consistently achieve without commercial funding, carrying IP restrictions and other constraints. It is hoped that SimilarityLab will be used to capitalize on results from primary screens in academia, allowing SAR exploration by non-specialists without access to computational chemists or cheminformaticians. With SAR landscapes understood or looking promising, this strengthens further funding cases. The high rates of attrition in drug discovery point to the need for more novel and agile techniques, moving away from industry standard approaches; the ultimate solution may lay in hits identified by smaller, more specialist groups which are then independently progressed to lead status. It should also be stated here that SimilarityLab holds the potential to become a standard resource of information in basic research, particularly in the field of chemical biology and for the generation of tool compounds. Chemical molecules used as tools to study biological function are employed as standard repertoire these days to progress the fundamental understanding of biology. Researchers might ask the questions what other molecules are available to investigate their biological systems. With a quick query on the SimilarityLab website, they will obtain these required answers.

Conflicts of Interest:
The authors declare no conflict of interest.