RetroTransformDB : A Dataset of Generic Transforms for Retrosynthetic Analysis

Presently, software tools for retrosynthetic analysis are widely used by organic, medicinal, and computational chemists. Rule-based systems extensively use collections of retro-reactions (transforms). While there are many public datasets with reactions in synthetic direction (usually non-generic reactions), there are no publicly-available databases with generic reactions in computer-readable format which can be used for the purposes of retrosynthetic analysis. Here we present RetroTransformDB—a dataset of transforms, compiled and coded in SMIRKS line notation by us. The collection is comprised of more than 100 records, with each one including the reaction name, SMIRKS linear notation, the functional group to be obtained, and the transform type classification. All SMIRKS transforms were tested syntactically, semantically, and from a chemical point of view in different software platforms. The overall dataset design and the retrosynthetic fitness were analyzed and curated by organic chemistry experts. The RetroTransformDB dataset may be used by open-source and commercial software packages, as well as chemoinformatics tools. Dataset: https://doi.org/10.5281/zenodo.1209312 Dataset License: CC-BY

Retrosynthetic analysis is one of the main tasks in the planning of organic synthesis and a milestone in the computer-aided synthesis design.Different approaches have been proposed [1] and several software systems have been developed as a solution to this issue, including rule-based expert systems, algorithms that use principles of physical chemistry to predict energy barriers of a reaction and machine learning techniques [2].Among them, rule-based expert systems have been the most widespread approach used for prediction of retrosynthetic routes [3].It should be noted that the quality of results from rule-based system strongly depends on the available reactions for the purpose of the retrosynthetic analysis-the so-called transforms [4].And while many retrosynthesis software systems are based on manually coded rules [5][6][7][8], some systems [4,9] attempt to automate the rule (transforms) generation process [10] in order to cover more reactions.Applying such an approach is certainly attractive, but the depth of the predictive models that use it strongly depend on the reaction databases they are working with [11].Large reaction databases can be used for the extraction of reaction rules, but full access to such databases is often strictly limited or expensive [12][13][14][15].There are some freely-available databases of chemical reactions extracted from patents [16] but they describe non-generic reactions which cannot be employed in a generic fashion right way.Another technique for automatic extraction of reaction rules are so-called matched molecular pairs (MMPs) [17,18].Typically, MMPs correspond to chemical transformations that lead to changes of the molecular property values.Predominantly, MMP-based transformations do not make sense in a synthetic context, but they are useful for describing a molecule transformation by means of a software algorithm.There are also other chemoinformatics tools (analogous to some extent to MMPs) for automatic extraction of chemical transformations from reaction datasets based on automatic reaction mapping [19].Although these approaches are promising, it should be noted that they are strongly dependent on the method used to detect the reaction centers (typically done by identifying common or maximal common substructures of the reactant and product structures).Additionally, in most of the cases there is more than one possible reaction mapping.Furthermore, the automatically-generated transformation does not always describe the correct reaction center, or describes it partially or redundantly.Therefore, the manual curating of the obtained transformations is mandatory.
In this context, collections of transforms manually coded from experts are still a reasonable choice and such approaches are capable of providing very promising results [20,21].
Although several datasets with reactions in the synthetic direction [22][23][24][25] have been published, as far as we know there is no publicly-available databases with transforms that can be used for the purposes of retrosynthesis.A few collections of transforms found in the literature are not presented in a computer-readable form, which makes a process of their implementation in retrosynthetic analysis software quite difficult.This conclusion is an indication of the need for a dataset with transforms to be created; even more-(1) it should consist of transforms that correspond to generic chemical reactions (or chemical transformations [26]); and (2) the dataset should be presented in computer-readable form, in which transforms can be easily stored, processed, and applied.In accordance with "Nomenclature for organic chemical transformations" [26] and the terminology used in retrosynthetic analysis [27], in this paper we distinguish the following terms: (i) chemical reaction-typically describing a specific chemical reaction with concrete reactants, products, agents and conditions; (ii) chemical transformation-describing a generic chemical reaction, i.e., the chemical transformation can be related to a set of many ordinary reactions which share the same reaction center; and (iii) transform is a retro-reaction-the reverse transformation of a generic chemical reaction.

Data Description
We present RetroTransformDB (ver.1.0)-adataset of transforms for retrosynthetic analysis available in tabular data text file as well as in an Excel (Microsoft Corp., Redmond, WA) spreadsheet file.Each row of the *.txt/xlsx file holds a single transform record consisting of five columns describing a generic chemical transformation in a retrosynthetic fashion (Figure 1).[17,18].Typically, MMPs correspond to chemical transformations that lead to changes of the molecular property values.Predominantly, MMP-based transformations do not make sense in a synthetic context, but they are useful for describing a molecule transformation by means of a software algorithm.There are also other chemoinformatics tools (analogous to some extent to MMPs) for automatic extraction of chemical transformations from reaction datasets based on automatic reaction mapping [19].Although these approaches are promising, it should be noted that they are strongly dependent on the method used to detect the reaction centers (typically done by identifying common or maximal common substructures of the reactant and product structures).Additionally, in most of the cases there is more than one possible reaction mapping.Furthermore, the automaticallygenerated transformation does not always describe the correct reaction center, or describes it partially or redundantly.Therefore, the manual curating of the obtained transformations is mandatory.
In this context, collections of transforms manually coded from experts are still a reasonable choice and such approaches are capable of providing very promising results [20,21].
Although several datasets with reactions in the synthetic direction [22][23][24][25] have been published, as far as we know there is no publicly-available databases with transforms that can be used for the purposes of retrosynthesis.А few collections of transforms found in the literature are not presented in a computer-readable form, which makes a process of their implementation in retrosynthetic analysis software quite difficult.This conclusion is an indication of the need for a dataset with transforms to be created; even more-(1) it should consist of transforms that correspond to generic chemical reactions (or chemical transformations [26]); and (2) the dataset should be presented in computer-readable form, in which transforms can be easily stored, processed, and applied.In accordance with "Nomenclature for organic chemical transformations" [26] and the terminology used in retrosynthetic analysis [27], in this paper we distinguish the following terms: (i) chemical reaction-typically describing a specific chemical reaction with concrete reactants, products, agents and conditions; (ii) chemical transformation-describing a generic chemical reaction, i.e., the chemical transformation can be related to a set of many ordinary reactions which share the same reaction center; and (iii) transform is a retro-reaction-the reverse transformation of a generic chemical reaction.

Data Description
We present RetroTransformDB (ver.1.0)-adataset of transforms for retrosynthetic analysis available in tabular data text file as well as in an Excel (Microsoft Corp., Redmond, WA) spreadsheet file.Each row of the *.txt/xlsx file holds a single transform record consisting of five columns describing a generic chemical transformation in a retrosynthetic fashion (Figure 1).The first column of the transform record is used as a formal reaction ID and it is intended for a technical use, such as a fast transform reference, the quick storage of reaction sequences, etc.The second column, "NAME", contains the name of the chemical reaction that is associated with the transform.The transform described in a given record corresponds to a transformation of a molecule that is exactly the reverse of the actual generic reaction designated by the record name.Column "NAME" also can be used for transform identification.The third column, designated in the header as "SMIRKS", contains the SMIRKS linear notation of the transform and it is the most important field of the record.Column "FunctionalGroup" describes which functional groups will be obtained as a  The first column of the transform record is used as a formal reaction ID and it is intended for a technical use, such as a fast transform reference, the quick storage of reaction sequences, etc.The second column, "NAME", contains the name of the chemical reaction that is associated with the transform.The transform described in a given record corresponds to a transformation of a molecule that is exactly the reverse of the actual generic reaction designated by the record name.Column "NAME" also can be used for transform identification.The third column, designated in the header as "SMIRKS", contains the SMIRKS linear notation of the transform and it is the most important field of the record.Column "FunctionalGroup" describes which functional groups will be obtained as a result of transform application, e.g., ALCOHOLS, ETHERS, etc.The last column "TransformType" designates what type of transformation is performed in a retrosynthetic direction e.g., FGE (functional group exchange), C-C (disconnecting C-C bond) etc.
The RetroTransformDB dataset consists of more than 100 SMIRKS line notations corresponding to a wide range of well-known and frequently-used retro-reactions.Each transform was manually created and programmatically tested with the Ambit software platform [28][29][30].The entire dataset was additionally curated considering all transformations (generic reactions) and their interconnections in a hierarchical fashion.The presented SMIRKS notations can be used by any chemoinformatics system that supports SMIRKS linear notation.

Methods
The SMIRKS linear notation [31] is used for describing the transforms in our collection.The SMIRKS notation is intended to present generic reactions [32] that involve one or more changes in atoms and bonds.Using the full SMARTS [33] syntax, SMIRKS notation is flexible enough to define a set of structural constraints that each reactant should fit to in order the encoded transformation to be applied.In addition, SMIRKS linear notation is easy-editable, widely used and implemented in many software packages and toolkits.
Most of transforms presented in the form of SMIRKS linear notations are characterized by the following general model: product >> reactant1.reactant2 where the product of a given reaction in a synthetic direction is the target molecule in retrosynthetic analysis, and reactant1 and reactant2 are the precursors, separated by a period ".".The transforms in our collection are mainly in the format product >> reactant1.reactant2(two-component) or product >> reactant (single-component).Information on solvents and catalysts, if needed, can be described in SMIRKS itself as: product > solvent > reactant1.reactant2(there are no such transforms in the current version of the collection).In SMIRKS line notations, one should describe only the substructures that directly participate in the transformation (i.e., the reaction center) or such molecule fragments that are considered essential for its reactivity.This description is the basis for the application of generic reactions, illustrated in Figure 2: Data 2018, 3, x FOR PEER REVIEW 3 of 6 result of transform application, e.g., ALCOHOLS, ETHERS, etc.The last column "TransformType" designates what type of transformation is performed in a retrosynthetic direction e.g., FGE (functional group exchange), C-C (disconnecting C-C bond) etc.The RetroTransformDB dataset consists of more than 100 SMIRKS line notations corresponding to a wide range of well-known and frequently-used retro-reactions.Each transform was manually created and programmatically tested with the Ambit software platform [28][29][30].The entire dataset was additionally curated considering all transformations (generic reactions) and their interconnections in a hierarchical fashion.The presented SMIRKS notations can be used by any chemoinformatics system that supports SMIRKS linear notation.

Methods
The SMIRKS linear notation [31] is used for describing the transforms in our collection.The SMIRKS notation is intended to present generic reactions [32] that involve one or more changes in atoms and bonds.Using the full SMARTS [33] syntax, SMIRKS notation is flexible enough to define a set of structural constraints that each reactant should fit to in order the encoded transformation to be applied.In addition, SMIRKS linear notation is easy-editable, widely used and implemented in many software packages and toolkits.
Most of transforms presented in the form of SMIRKS linear notations are characterized by the following general model: where the product of a given reaction in a synthetic direction is the target molecule in retrosynthetic analysis, and reactant1 and reactant2 are the precursors, separated by a period ".".The transforms in our collection are mainly in the format product >> reactant1.reactant2(two-component) or product >> reactant (single-component).Information on solvents and catalysts, if needed, can be described in SMIRKS itself as: product > solvent > reactant1.reactant2(there are no such transforms in the current version of the collection).In SMIRKS line notations, one should describe only the substructures that directly participate in the transformation (i.e., the reaction center) or such molecule fragments that are considered essential for its reactivity.This description is the basis for the application of generic reactions, illustrated in Figure 2: The rich SMIRKS syntax maintains sufficient functionality for a detailed description of the reaction centers, which is critical to the correct representation of a chemical transformation.The transform for the example illustrated on Figure 1 is as follows: The rich SMIRKS syntax maintains sufficient functionality for a detailed description of the reaction centers, which is critical to the correct representation of a chemical transformation.The transform for the example illustrated on Figure 1 is as follows: It should be noted that the SMIRKS standard is quite strict, thus, all small details in the linear notation syntax and the encoded logic of chemical expressions must be taken into account.SMIRKS notations in RetroTransformDB are written with explicit H atoms, therefore, it is expected that the used software will apply the SMIRKS transforms against molecules with explicit H atoms. Additionally, the results from the application of a particular SMIRKS may vary in different software systems depending on the level of implementation and the chemoinformatics treatment of the target molecules.
The widespread applicability of some generic reactions was the reason for implementing them in the transforms collection.As a basis for our set, we used the reactions published by D'Angelo and Smith, divided into two main groups: carbon-carbon bond formation and functional group exchange [34].In addition, transforms are grouped according to the functional group obtained by the application of the transforms.For the sake of completeness of the set of selected reactions, we also used other publications [22][23][24][25] to identify additional reactions that would be of interest in retrosynthetic analysis.The choice of reactions for compiling the collection of transforms is based on information from several sources, as each one is relevant to a particular concept.For example, the list of reactions published by Hartenfeller et al. [22], focused on in silico molecule design, does not contain a Diels-Alder cycloaddition, which is, however, one of the most powerful simplifying transforms from the point of view of classical retrosynthesis defined by Corey [27].
The open source application, Ambit-SMIRKS [35], available as a command line interface and GUI, was used for testing the transforms in the RetroTransformDB dataset.An example is given in Figure 3: It should be noted that the SMIRKS standard is quite strict, thus, all small details in the linear notation syntax and the encoded logic of chemical expressions must be taken into account.SMIRKS notations in RetroTransformDB are written with explicit H atoms, therefore, it is expected that the used software will apply the SMIRKS transforms against molecules with explicit H atoms. Additionally, the results from the application of a particular SMIRKS may vary in different software systems depending on the level of implementation and the chemoinformatics treatment of the target molecules.
The widespread applicability of some generic reactions was the reason for implementing them in the transforms collection.As a basis for our set, we used the reactions published by D'Angelo and Smith, divided into two main groups: carbon-carbon bond formation and functional group exchange [34].In addition, transforms are grouped according to the functional group obtained by the application of the transforms.For the sake of completeness of the set of selected reactions, we also used other publications [22][23][24][25] to identify additional reactions that would be of interest in retrosynthetic analysis.The choice of reactions for compiling the collection of transforms is based on information from several sources, as each one is relevant to a particular concept.For example, the list of reactions published by Hartenfeller et al. [22], focused on in silico molecule design, does not contain a Diels-Alder cycloaddition, which is, however, one of the most powerful simplifying transforms from the point of view of classical retrosynthesis defined by Corey [27].
The open source application, Ambit-SMIRKS [35], available as a command line interface and GUI, was used for testing the transforms in the RetroTransformDB dataset.An example is given in Figure 3:

Conclusions
A dataset of appropriate transforms is one of the most crucial elements in every rule-based software for retrosynthetic analysis.Considering the fact that there is not a publicly-available, computer-readable database with retro-reactions, a new collection of transforms (RetroTransformDB) has been presented.The preparation of the transforms selected by the literature using the SMIRKS

Conclusions
A dataset of appropriate transforms is one of the most crucial elements in every rule-based software for retrosynthetic analysis.Considering the fact that there is not a publicly-available, computer-readable database with retro-reactions, a new collection of transforms (RetroTransformDB) has been presented.The preparation of the transforms selected by the literature using the SMIRKS linear notation is considered to be an appropriate choice as it allows the transforms (retro-reactions) to be clearly defined in a form suitable for computer processing, direct analysis by experts, and edited by users (no special software required).In such a format, the RetroTransformDB collection may be used by multiple software packages and programming tools for molecular modeling.
Further development of the collection (following the Zenodo future dataset versions) will include a detailed description of more qualitative and quantitative parameters as a type of reaction, yield, experimental conditions, reliability, etc., as well as extending of the dataset with more generic transforms for heterocyclic reactions.

Figure 1 .
Figure 1.Generic transform records represented as rows in a spreadsheet file.

Figure 1 .
Figure 1.Generic transform records represented as rows in a spreadsheet file.

Figure 2 .
Figure 2. Example of a generic transform (a) and its application (b).

Figure 2 .
Figure 2. Example of a generic transform (a) and its application (b).

Figure 3 .
Figure 3. Screenshot of testing transform (corresponding to synthesis of amides from acyl chloride) in the Ambit-SMIRKS GUI.

Figure 3 .
Figure 3. Screenshot of testing transform (corresponding to synthesis of amides from acyl chloride) in the Ambit-SMIRKS GUI.