A Reaction Database for Small Molecule Pharmaceutical Processes Integrated with Process Information

: This article describes the development of a reaction database with the objective to collect data for multiphase reactions involved in small molecule pharmaceutical processes with a search engine to retrieve necessary data in investigations of reaction-separation schemes, such as the role of organic solvents in reaction performance improvement. The focus of this reaction database is to provide a data rich environment with process information available to assist during the early stage synthesis of pharmaceutical products. The database is structured in terms of reaction classiﬁcation of reaction types; compounds participating in the reaction; use of organic solvents and their function; information for single step and multistep reactions; target products; reaction conditions and reaction data. Information for reactor scale-up together with information for the separation and other relevant information for each reaction and reference are also available in the database. Additionally, the retrieved information obtained from the database can be evaluated in terms of sustainability using well-known “green” metrics published in the scientiﬁc literature. The application of the database is illustrated through the synthesis of ibuprofen, for which data on different reaction pathways have been retrieved from the database and compared using “green” chemistry metrics.


Introduction
Organic chemistry has an important role to play in the development of synthetic routes for new drugs during early stage process development.To pursue synthesis at a high level, access to chemical information is needed, which can be provided by using knowledge databases, experience, literature review and/or computer-aided tools [1,2].The retrieved data is used for similarity search, reaction data retrieval, synthesis route planning, drug discovery-development and prediction of physicochemical properties [3].The development of methods, algorithms and tools to systematize data collection, retrieval of chemical information-data, and to assist the solution approach to many problems related to the synthesis of molecules in organic chemistry has been developed since the 1970s.The methods and tools for reaction synthesis are based on retrieving chemical information organized in chemical reaction databases where data for individual reactions and structural information for different components involved in the reaction are stored.
Computer-aided tools have been developed to solve problems related to "synthesis" and "retrosynthesis."The focus of these tools is to generate a number of possible chemical synthesis paths for possible precursors (synthesis tree) to achieve the synthesis of a given target compound.
In retrosynthesis, the process of generating the possible pathways starts from the given target compound and, by going backwards, the reactions necessary to synthesize the target compound are identified.In addition, the reactions to produce the reactants of identified reactions are generated.The process is repeated until commercially available reactants are identified.These approaches are based on heuristics and logical rules and all of them rely on knowledge databases [4][5][6][7][8].Recently, computer-aided tools that are based on algorithmic approaches have been developed, such as The Route Designer [9], which automatically extracts rules that capture the essence of the reactions in the chemical reaction database [10].The tool ICSYNTH utilizes a graph-based approach with available data from the literature to generate the reaction rules [9].Many other computer-aided methods and tools for reaction synthesis have already been developed with different characteristics.For example, tools to perform combinatorial searches, to screen generated alternatives based on information retrieved from knowledge databases and to perform extensive reaction assessment calculations [11][12][13][14][15].
Searching for reactions and retrieving the relevant information is a complex problem because it involves searching for chemical structures (complete or partial), transformation information (reaction centers), description of the reactions (reaction type, general comments) and numerical data such as experimental reaction data (including conversion, yield, selectivity, reaction conditions etc.).Reaction databases that help to organize, store and retrieve data continue to be developed (Houben-Weyl [16] and Theillheimer [17]), but more recently, the field of reaction databases has evolved further and databases (see Table 1) such as CASREACT [18], ChemReact [17] and REAXYS (previously Beilstein plus Reactions) [19] have been established, while reaction databases such as ChemInform [20] have become well-known.

General Databases
In these types of database, the information included is focused on organic reactions and synthetic methods in general.The CASREACT reaction database [18] was started in 1840 and since then more than 74.9 million reactions have been added as it is updated daily.The information is related to organic synthesis including organometallics, total synthesis of natural products and biocatalytic (biotransformation) reactions.This database can be used to provide information on different ways to produce the same product (single step or multi-step reactions), used for applications of a particular catalyst and various ways to carry out specific functional group transformations.The REAXYS reaction database [19]-based on data from Elsevier's industry-leading chemistry databases (CrossFire Beilstein, CrossFire Gmelin and Patent Chemistry Database)-includes data for more than 40.7 million reactions, dating from 1771 to the present.It includes a large number of compounds (organic, inorganic and organometallic) and experimental reaction details (yield, solvents etc.).It is searchable for reactions, substances, formulas, and data such as physico-chemical properties data, spectra.Additionally, the REAXYS database can be used for synthesis route planning.The Current Chemical Reaction (CCR) database [21] includes over one million organic reactions together with reaction diagrams, critical conditions and bibliographic data.The Reference library of synthetic methodology (RefLib) covers reaction data from 1946 to 1992.The database contains information from different sources and the latest version has a comprehensive heterocyclic chemistry database [17].
The ChemReact reaction database [17] is a closed database that covers the period from 1974 to 1998 and includes over 3.5 million reactions.It is searchable by reaction type and provides information for the reaction transformation classified by type of reaction and relevant data (bibliographic, spectra and yield).Chemogenesis is a web-book [22], dealing with chemical reactions and chemical reactivity.It examines the rich science between the periodic table and the established disciplines of inorganic and organic chemistry.The Organic Synthesis database [23], includes more than 6000 organic reactions and is searchable by the reaction type or the structure of the compounds and it provides information for single and multi-step organic reaction together with reaction components, conditions and description.The reaction database-Chemical Synthesis [24] enables the user to find reactions related to reagents or target products and it also provides information with the necessary details of the reagents.The Synthetic Pages reaction database [25], covers 292 reactions and provides information for the optimized reaction procedure.It is searchable by reaction type and/or the structure of the reagent or the target product.The Chemical Thesaurus reaction database [26] contains 4000 reactions classified as organic, inorganic, organometallic, transition metal and biochemical.
The WebReaction reaction database [27] covers over 400,000 reactions; it can be searched by defining the structure of the reactant and the product and it performs search based on the reaction similarity with focus on reaction center.The Science of Synthesis database (previously Houben-Weyl) [16] covers information for organic and organometallic reactions with detailed experimental procedures, methodology evaluation and discussion of the field.Finally, the SPRESI reaction database [28] contains 4.6 million reactions and it enables searching of structures, references and reactions.
The Synthetic Reaction Updated (previously Methods in Organic Synthesis) lists many organic reactions (in graphical form) and is searchable by reaction type [29].

Specialized Databases
These databases are specialized in one class of reaction type.The ChemInform reaction database [20] includes more than 2 million reactions, including organic, enzymatic and microbial reactions.The available data can be used for the application of new reagents and also for catalysts as with the preparation of natural and pharmaceutical products.Other aspects that are covered by the ChemInform database include synthetic procedures, enantio-and diasteroselective syntheses and new protection/de-protection procedures.The Biotage Pathfinder reaction database [30] is specialized in the verified methods of microwave synthesis.
The e-EROS (Encyclopedia of Reagents for Organic Synthesis) [31] focuses on the reagents and catalysts used in organic chemistry for synthesis.The FlowReact Search [32] covers a range of over 2000 flow chemistry reactions adapted from publications on pharmaceutical, fine chemical and biotech companies.The Protecting Groups reaction database [33] provides information for protection, de-protection and trans-protection methods, stability, liability, and reaction conditions, and includes up-to-date information.Recently, a reaction library focused on generic reactions (88 reactions, ~20,000 reactants) with high reliability and reasonable yield has been developed by Masek et al. [34].The objective of this library is to provide information on synthetically feasible design ideas for de novo drug design.
Representing chemical reactions in a structured way is a complex task.The reaction information contained in a database needs to fulfil several criteria and needs to be categorized with respect to their searchable reaction information.The criteria that a reaction database should fulfill are [17]: (i) Each reaction is an individual record in the database (detailed and graphical).The reaction must be able to be retrieved from the database as a detailed record (reagents, products, stoichiometry etc.).It can also be extracted as a graphical representation where the reaction scheme is shown.In many databases, the reaction is represented in a graphical form.(ii) Structural information for target product as well as substrates.(iii) Reaction centers.The reaction center of a reaction is the collection of atoms and bonds that are changed during the reaction [3].(iv) Reaction components must be searchable.Information for the components involved in the reaction such as reagent, catalysts, solvents etc. (v) Multistep reactions.In the case of multistep reactions, all reactions (individual and whole pathway) must be searchable.(vi) Reaction conditions.Conditions such as pH, temperature, pressure etc. should be searchable by exact and a suitable range of values.(vii) Reaction classification.The type of reaction (i.e., esterification) should be searchable.(viii) Post-processing of the database contents.Export of the retrieved reaction data in other tools (i.e., MS Excel).
Many reaction databases have been developed over time-some of them have a large number of reactions available and others a smaller number, and some of the databases cover the whole range of the organic and/or inorganic reactions.There are also reaction databases that cover more specialized reactions such as solid reactions, flow reactions etc.It can also be seen that most of the databases cover the most important criteria as defined by Zass [17], such as the need for individual reaction records (criterion i, in Table 1).In Table 1, existing reaction databases are listed and have been classified based on the different presented criteria.The numbers of reactions, as well as online sources, have also been listed.The main objective of this article is to assist pharmaceutical process development in the early stages of the synthesis route selection and development, by providing enhanced process understanding.To achieve this task, a data-rich environment where knowledge can be collected, stored and retrieved is a requirement.A database that covers reactions taking place in pharmaceutical processes covering information connected to the criteria listed by Zass [17] and additionally covering process information has been developed to create an environment where process knowledge is available.The connection of individual reactions to criteria like scalability, cost, expected yield, and reaction steps, ease of separation, safety and to parameters such as reaction conditions, experimental data and models, they can improve the process understanding and the decision making process during the synthesis route selection process.In addition to constraints of high product quality and process economics, a pharmaceutical process needs to fulfill the criteria for environmental issues.In particular, for pharmaceutical processes, the environmental sustainability evaluation must be performed during the early stage of process development [36] before the approval of the regulatory bodies as the re-approval of the process can be a very expensive process [37].Constable et al. [38] has reviewed "green" metrics proposed in literature and these metrics are used to increase the awareness of generated waste sources from the reaction and to identify opportunities for further improvement.The reviewed "green" metrics are listed in Table 2, where for each metric an explanation and the equation to quantify the specific metric are given.This information, in combination with other knowledge databases and computer-aided synthesis design (CASD) tools developed earlier, provides an opportunity for an integrated approach to the solution of problems related to synthesis route selection and improvement, taking into account important process considerations such as the development time to establish the synthesis route, product quality, cost of manufacture that are often linked to "green" chemistry metrics and the final approval of regulatory agencies [1].This process related information is not available in the reaction databases listed in Table 1, but is needed for plant-wide design, process-operation simulation and optimization in studies related to sustainability and the economics of processes producing active pharmaceutical ingredients [39][40][41].
In this article, the developed reaction database is presented with a specific focus on reactions (including multiple reactions) taking place in pharmaceutical processes within the pharmaceutical industry and connecting them with process information.The reactions in this database have been categorized according to the reaction type, the target product to be produced (when single-step or multistep reactions are considered), the reaction product and the effect of the solvent use on the reacting system.Reaction conditions (temperature, pressure etc.), reaction components (reagents, catalysts etc.), reaction data (conversion, selectivity, etc.), scaling information and finally batch or continuous processing is included in the developed database.For each reaction entry, a description of the process exists and the references are provided.A more detailed description of the database development and structure follows later in this article.
This reaction type database, more specifically, aims to: 1.
Identify reactions that are used to produce different types of products (Active Pharmaceutical Ingredients (API), Intermediates).

2.
Identify reactions to be utilized, for a given compound availability.

3.
Investigate the function of different type of solvents in single/multiphase reactive systems.

4.
Facilitate the choice of the reaction conditions. 5.
Evaluate the reaction pathway in terms of yield, cost and sustainability metrics.6.
Facilitate the reactor design from available experimental data and kinetic models.
In addition, with the process information that is included in the database and has been mentioned in points 1-6 above, the database fulfills most of the criteria defined by Zass [17] (see Table 3).Table 3 provides a comparison of the available database with respect to the criteria given by Zass [17].It can be noted that most of the available databases provide information for individual reactions (criterion i) and molecular structure information on reactants and products (criterion ii).However, the remaining criteria are covered only in some databases (see Table 3).

Reaction Database
The data required to populate a reaction database to satisfy the abovementioned objectives has been acquired from numerous published articles and patents.The collected knowledge from these sources has been structured in the database according to a developed ontology (knowledge representation) and stored for easy data retrieval and re-use in different likely applications.The database consists of classes, sub-classes, instances and objects.A class is a representation for a conceptual grouping of similar terms.Classes are the focus of most ontology.A class describes concepts in the domain.A class can have subclasses that represent concepts that are more specific than a super class [42].A simplified flow-diagram, which serves as a guide for the reaction database in terms of knowledge representation system, classes and instances of data and information on the available data, and where information can be found in the article, is shown in Figure 1.

Reaction Database
The data required to populate a reaction database to satisfy the abovementioned objectives has been acquired from numerous published articles and patents.The collected knowledge from these sources has been structured in the database according to a developed ontology (knowledge representation) and stored for easy data retrieval and re-use in different likely applications.The database consists of classes, sub-classes, instances and objects.A class is a representation for a conceptual grouping of similar terms.Classes are the focus of most ontology.A class describes concepts in the domain.A class can have subclasses that represent concepts that are more specific than a super class [42].A simplified flow-diagram, which serves as a guide for the reaction database in terms of knowledge representation system, classes and instances of data and information on the available data, and where information can be found in the article, is shown in Figure 1.  4 provides information on the classification of the data and Tables 5-10 provide information on the available data.

Knowledge Representation
For the development of the reaction database, classes have been used to represent the main knowledge categories such as the reaction type, the reaction, phases involved, how the phases are created, solvent use, solvent function, type of solvent, reaction conditions, available data and finally operation mode (listed in Table 4 and shown in Figure 2).The first knowledge class consists of different reaction types that are commonly found in pharmaceutical processes (i.e., hydrogenation).The set of these reaction types are called the instances of the class.The second class in the knowledge representation system (or data) is the reaction, which is divided in four sub-classes; the reactants, reaction products, and target product and reaction information (see Figure 3).The instances of the three first sub-classes of the second class are classified in terms of name of the compound, type of the compound and molecular structure while the fourth class summarizes information for the specific reaction.This type of information is important to identify the structural changes of the compounds during the reaction.The fourth class of data consists of instances describing the phases involved in the specific reaction.It is important to note that this class connects the reaction information with the reaction performance class, which will be described later, and it has an important role in the database since in this way, the advantages of using a multiphase or a single-phase system can be identified.The next two classes of the database consist of instances describing the solvent function, in case an organic solvent has been used in the reactive system, for example, the solvent function is "creates a second phase and removes the reaction product," and the type and name of the used organic solvent.The last three classes of the data consist of instances describing the reaction performance under certain conditions.The reaction conditions class consists of instances, which have to do with the reaction variables such as reaction temperature, stoichiometric amount, catalyst (type and amount), pH,  2 and 3 provide details of the knowledge representation system, Table 4 provides information on the classification of the data and Tables 5-10 provide information on the available data.

Knowledge Representation
For the development of the reaction database, classes have been used to represent the main knowledge categories such as the reaction type, the reaction, phases involved, how the phases are created, solvent use, solvent function, type of solvent, reaction conditions, available data and finally operation mode (listed in Table 4 and shown in Figure 2).The first knowledge class consists of different reaction types that are commonly found in pharmaceutical processes (i.e., hydrogenation).The set of these reaction types are called the instances of the class.The second class in the knowledge representation system (or data) is the reaction, which is divided in four sub-classes; the reactants, reaction products, and target product and reaction information (see Figure 3).The instances of the three first sub-classes of the second class are classified in terms of name of the compound, type of the compound and molecular structure while the fourth class summarizes information for the specific reaction.This type of information is important to identify the structural changes of the compounds during the reaction.The fourth class of data consists of instances describing the phases involved in the specific reaction.It is important to note that this class connects the reaction information with the reaction performance class, which will be described later, and it has an important role in the database since in this way, the advantages of using a multiphase or a single-phase system can be identified.The next two classes of the database consist of instances describing the solvent function, in case an organic solvent has been used in the reactive system, for example, the solvent function is "creates a second phase and removes the reaction product," and the type and name of the used organic solvent.The last three classes of the data consist of instances describing the reaction performance under certain conditions.The reaction conditions class consists of instances, which have to do with the reaction variables such as reaction temperature, stoichiometric amount, catalyst (type and amount), pH, pressure and the need to use acid or base.The data class consists of four sub-classes, reaction data, dynamic data, kinetic model, and scale.The instances of the reaction data sub-classes are information related to reaction time (or residence time), conversion, selectivity, reaction yield and overall process yield (usually after isolation and purification).The instances of the dynamic data are sets of experimental data that can be used to fit or to develop a kinetic model.The next sub-class describes the availability of kinetic models that can be used either directly, or after fitting to the experimental data for reaction optimization studies.The last sub-class of the data is a super class that provides important information on the scale the reaction has been performed.Finally, the last class of the data is the operation mode, instances of this class can be different operational modes such as batch reaction or flow reaction.

Database Structure
Table 4 lists the classes of the data in the first column, the second column relates the classes to the instances that an individual class contains and in the third column, the instances are listed for different classes.The structure of the database is visually shown in Figure 2.

Main Classes Relation with Instances Instances
Reaction Type, T T = [T 1 , T 2 , ..., T i , . . ., T n ] T i : reaction type in the knowledge base (i.e., acylation etc.) R i : reaction of the ith reaction type; for each reaction information about the reactants and reaction products are provided as well as information for the target product and process (for example: 1st step for production of an API) Phases involved, P P = [P 1 , P 2 , ..., P i , . . ., P n ] P i : phase of the ith reaction (i.e., organic-aqueous, organic-gas etc.)In Figure 3, the subclasses and the values of each instance in the "Reaction" class are illustrated.For example, each reaction has reactants-as well as reaction products-and can be used to eventually produce a target product (in case of multi-step reactions), each of the sub-classes take values such as the name of the compound (N), the type of the compound (T, for example, alcohol) and the molecular structure of the compound.The reaction info subclasses takes text values that can be used to give useful insights for the reaction.
Processes 2017, 5, 58 11 of 25 In Figure 3, the subclasses and the values of each instance in the "Reaction" class are illustrated.For example, each reaction has reactants-as well as reaction products-and can be used to eventually produce a target product (in case of multi-step reactions), each of the sub-classes take values such as the name of the compound (N), the type of the compound (T, for example, alcohol) and the molecular structure of the compound.The reaction info subclasses takes text values that can be used to give useful insights for the reaction.

Statistics of the Reaction Database
To determine the range of applications and the capability of the reaction database, the statistics of the stored data within the database are needed.The statistics are given in terms of number of reactions, reaction types, list of APIs, reactions where the use of solvent improves the reaction performance and available kinetic models.

General Numbers
In this section, the general statistics of this database are given, for example, total number of reactions, total number of APIs, the number of the intermediates, reactions that require solvent, multiphase reactions, experimental data, type of reaction operation (batch or continuous, technology i.e., microwave technology).The general characteristics of the reaction type database are listed in Table 5.

Statistics of the Reaction Database
To determine the range of applications and the capability of the reaction database, the statistics of the stored data within the database are needed.The statistics are given in terms of number of reactions, reaction types, list of APIs, reactions where the use of solvent improves the reaction performance and available kinetic models.

General Numbers
In this section, the general statistics of this database are given, for example, total number of reactions, total number of APIs, the number of the intermediates, reactions that require solvent, multiphase reactions, experimental data, type of reaction operation (batch or continuous, technology i.e., microwave technology).The general characteristics of the reaction type database are listed in Table 5.

Reaction Types
The different reaction types included in the database are listed in Table 6, together with the number of the reactions, the catalyst need, the phases (usually) involved and the solvent function if it used.

Active Pharmaceutical Ingredients (APIs)
In Table 7, the list of the available APIs (or the final drug) in the database is given.The database includes at least one pathway for each API listed in Table 7.In some cases, more than two completely different published reaction pathways (for example, for Ibuprofen) exist, which are also listed in the database.Finally, in some cases efforts have been focused on improving a certain reaction within the reaction path that has also been included in the knowledge database.

Reaction Improved Reaction Performance When Solvent Is Used
Reaction improvements in terms of reaction time, reaction volume, yield, conversion and/or selectivity and post-processing improvement in the separation and purification steps related to solvent use are considered in database development.The functions of solvent and the possible process improvements are listed below and summarized in Table 8: a.
Reaction medium.b.
Separation of the main product in order to shift the equilibrium reaction towards the product side in order to increase the yield and/or reduce the separation steps required.c.
Separation of an inhibitory product to increase the productivity of the reaction.d.
Controlled released of substrate, it might improve the process safety in case of hazardous compounds or increase selectivity towards the desired product.e.
Reaction volume reduction.f.
Dissolves reactants to increase the reaction rate and/or to avoid process complications when the reaction involves compounds in solid phase at the reaction conditions.
Table 8.Solvent functions in reaction and their possible improvements.

Solvent Functions
Reaction medium --

Catalyst carrier (Phase creation)
In Table 9 below, different reactive systems where solvent has been added in order to improve the reaction performance are listed.Table 9 has been classified based on the reaction type and the main product-it also gives the reaction phases, the solvent function and the reaction improvement.

Kinetic Models Available
Table 10 lists the kinetic model availability (found through literature search) and their inclusion in the reaction database kinetic model library.Some of the available kinetic models in the literature have been analyzed, validated against experimental data and, if found acceptable, then been used for reaction optimization in order to establish the design space.In other cases a model has been used by taking it directly from the reported reference, for example, the model reported by Thakar et al. [57] for the second hydrogenation step of ibuprofen synthesis has been successfully used without any modification (of the kinetic parameters) to fit the dynamic experimental data published by Cho et al. [58].

Reaction Database Application
The reaction database has multiple features that can assist in the creation of a data-rich environment in the early stage pharmaceutical process-product development.The knowledge stored in the database is searchable by forward or backward search options.As is illustrated in Figure 2, data can be retrieved for the specific search and the retrieved data is used for reaction improvement studies in subsequent calculation-analysis.

Reaction Data
Process improvements are usually related to resources such as development cost and time.The process of establishing the reactions, the experimental procedure, and the reaction conditions might require significant resources during the initial reaction screening that is required to identify the reaction pathway that leads to the production of the desired type of products (i.e., chiral alcohols).However, having an information-based system that can provide information for reaction identifications, reaction conditions and experimental procedures, can rapidly reduce the required time and cost of the initial screening process.The data-rich environment can also provide solution for reaction improvements related to the mass and heat transfer improvements by the use of new technologies such as flow reactions using for example new microwave technologies.
The use of experimental data (dynamic or end-points) can assist the improvement of the reaction system as the effect of reaction variable changes can be understood and quantified.Moreover, experimental data can be used to develop or to fit kinetic models that capture the behavior of the system under different conditions.These kinetic models can be used for validation studies, optimization studies to identify improved reaction conditions, evaluate different operation scenario and/or different reactor designs and networks.

Organic Solvents
Another class of process improvement is related to the solvent role during the synthesis step.There are cases where solvent use might enhance the reaction performance.Solvents might have different roles such as creating a second phase to remove an inhibitory product and shift the reaction equilibrium towards the product side, or simply it can create the second phase to remove the product in order to facilitate the following separation procedure.The solvent can also be used as a carrier for the controlled release of the substrate in the reaction mixture, which can minimize the amount of by-products produced when the concentration of substrate is high.The solvent can also have a role as the medium of the reaction and broaden the reaction conditions in order to improve reaction performance or satisfy other process concerns such as process safety.For example, if a reaction takes place at very low temperatures (<−25 • C), the solvent should be liquid at this condition and have the ability to dissolve the reactants, products and catalyst [70].

Search Options
The search options of the database in terms of both the retrieved data and the use of that data for a defined process are given below.

1.
Search for reaction types Different reaction types can be searched in the reaction database, the retrieved results provide information for the reaction (reactants, product and target product), the solvent role and how it improves the reaction, reaction conditions (i.e., temperature range, acid/base, different catalyst) and quantitative data (i.e., conversion, concentration vs. time), and finally applicability information such as scale or batch/continuous mode.The results can be used as similarity check, to identify reaction conditions, solvents and possibilities for improvement (i.e., equipment, production mode, technology) for quick reaction optimization.

2.
Search for main products (such as APIs or intermediates or type of products like chiral alcohols) Searching for main products or type of products, reactions that are used to synthesize this type of compound can be retrieved.The results are used to identify different ways for synthesis and to evaluate them in terms of reaction performance, cost, scalability and sustainability.

Search for reactants
The results obtained by searching reactants are used to identify ways for further utilizing them in case they have used or produced a product during a reaction.4.

Multiphase reactions
Multiphase and single reactions where the solvent use has improved the reaction performance can be searched, the retrieved results are used to identify the role of the multiphase system, for example, solvent creates a second phase to remove inhibitory by-product and to quantify the improvement in reaction performance, for example, increased conversion.
To summarize, the information retrieved from the reaction database can be used to: a. Identify reaction pathways, reaction types, reactants, catalysts, solvents and base/acid.b.
Optimization reaction conditions.c.
Investigate the solvent role in process improvement.d.
Optimize the process development identified reactions in terms of cost, yield and time.e.
Improve the overall process performance in terms of separation process, overall yield, sustainability, safety, scalability, controllability and utilized mass.f.
Improve reactor design and evaluate different reactor designs.g.
Establish operation procedure for the reactors.h.
Assist in plant-wide design, simulation, and techno-economic optimization.i.
Enhance process understanding.

Problem Definition
To illustrate the applicability of the database, the synthesis of ibuprofen is selected as an example.The objectives of this example are: 1.
To retrieve data relevant to the reaction pathway of Ibuprofen.

2.
Collect data related to individual reactions.

3.
Evaluate the alternatives based on green metrics.

Database Results
The main product sub-class is found in the "Reaction" class and from there information before (reactants, reaction types) and information forward (solvents, reaction conditions, data etc.) are retrieved.The database information as retrieved from the database is shown in Figure 4, which contains three screenshots for the purpose of illustration.Screenshot-1 connects the main product (ibuprofen) to the reaction type data; screenshot-2 connects the main product to the specific reaction information (temperature, pressure, solvent use, etc.); screenshot-3 connects the main product to modelling details (for example, kinetic model).The information is also given in the text as follows: a.
Summary of the findings (reaction pathways, reaction types, operation mode, available data and reference).b.
For each reaction pathway, each reaction is analyzed in terms of: i. Reactant, products, by-products, acids/base, solvents, catalysts.ii.
Then the reaction conditions for each reaction is presented.iii.
Finally, the reaction data is presented.
The retrieved information is used for the evaluation of different pathways to produce ibuprofen using the green chemistry metrics.
The database search gives three different reaction pathways.Pathway 1 consists of three reactive steps.It has been proposed by Elango et al. [71] and consists of three batch reaction steps-a Friedel craft acylation, a hydrogenation and finally, a carbonylation step.The first reactive step has been improved by Lindley et al. [72] using a continuous counter flow reaction-separation system which enables the recovery and recycle of the solvent and the unreacted reactants.The second reactive step is a hydrogenation step that takes place in a fed-batch reactor and the final step is a carbonylation step that also takes place in a fed-batch reactor.Pathway 2 consists of 3 reactive steps as well-a Friedel crafts acylation, an 1,2-aryl migration step and a saponification step-all the reactions are taking place in a continuous flow reactor and this reaction pathway that has been proposed by Snead et al. [73].Finally, the third reaction pathway consists of the same three reactive steps, as the second pathway, although the intermediates and reactants are different Bogdan et al. [74].Table 11 gives a summary of the reaction pathways retrieved from the database.The details for reaction pathways 1 and 3 are given in the supplementary material (see Sections A.1 and A.2 for pathways 1 and 3 respectively) while the retrieved data for reaction pathway 2 are given and analyzed in the below.

Pathway 2: Ibuprofen Synthesis
The individual reaction details for the reaction pathway proposed by Snead et al. [73] are presented in Table 12, where the reaction is given in terms of reactants and reaction product for each step and the overall reaction pathway is illustrated in Figure 5.The stoichiometric amounts of the reactants, the solvents, the catalyst, acid/base and by-products are also given in Table 12 for the three reaction steps involved in this pathway.The details for reaction pathways 1 and 3 are given in the supplementary material (see Section A.1 and Section A.2 for pathways 1 and 3 respectively) while the retrieved data for reaction pathway 2 are given and analyzed in the text below.

Pathway 2: Ibuprofen Synthesis
The individual reaction details for the reaction pathway proposed by Snead et al. [73] are presented in Table 12, where the reaction is given in terms of reactants and reaction product for each step and the overall reaction pathway is illustrated in Figure 5.The stoichiometric amounts of the reactants, the solvents, the catalyst, acid/base and by-products are also given in Table 12 for the three reaction steps involved in this pathway.The reaction conditions in terms of temperature, pressure, residence time, catalyst amount and solvent amount are listed in Table 13 for all the reaction steps.The retrieved experimental data are given in Table 14 in terms of conversion, selectivity, overall reaction yield, experimental data and model availability.A simple evaluation based on green metrics [38] has been performed and the results are illustrated in Figure 6.For this analysis, pathway 1 (BHC pathway) with and without recycling of HF and IBB, pathway 3 proposed by Bogdan et al. [74], and pathway 2 proposed by Snead et al. [73] have been considered.The effective mass yield, which is a ratio of the produced product (in mass, kg) over the total amount of non-benign reactant, has been evaluated first.As shown in Figure 6, step 1 of the BHC synthesis requires larger amounts of non-benign reactants compared to pathways 2 and 3, whereas reaction steps 2 and 3 require much less non-benign reactants.Another metric that has been evaluated is the mass intensity (MI), which shows the total required mass for the reaction per kg of product.
In Figure 6b, it can be seen that the first reaction steps of pathways 2 and 3 require fewer reactants than the amount required for the BHC pathway without considering the recycling.However, when recycle is considered, the MI metric has lower values for BHC pathway than the other two pathways where recycle is not possible.In addition, pathway 2 proposes much fewer reactants than are required by pathway 3.
The E-factor metric, which shows the generated waste per kg of product, has been evaluated for all the four cases (shown in (Figure 6c).The first step of the BHC pathway has been found to be the main contributor in the E-factor metric-even if step 1 produces a small amount of waste during the reaction, the large value of E-factor is caused by the large stoichiometric amounts of needed solvent and reactant.When the solvent and the reactant are recycled back into the reactor, the E-factor reduces dramatically and the small value of the E-factor is now caused by the small amount of waste and non-recovered solvent and reactant (~1%) [72].The other two pathways (2 and 3) have relatively high E-factor values, which means that larger amounts of waste are generated through the synthesis steps.
The generated waste for pathway 2 has been found to be slightly lower compared to the reaction in pathway 3. Finally, the atom efficiency has been evaluated for the all pathways and is illustrated Figure 6d.It can be seen that the atom efficiency for the BHC pathway is very high and therefore, most of the reactant atoms remain in the final product whereas the atom efficiencies are much lower for the two new pathways which means that pathways 2 and 3 might generate more waste than the batch process.Note that the interpretation and the analysis of each "green" metric should be performed individually for each reaction pathway as they represent different aspects of the process (for example, waste generation and total mass used per kg of product).Therefore, an overall conclusion about the "green extent" of the reaction pathways using weighted individual metrics cannot easily be made.
steps.The generated waste for pathway 2 has been found to be slightly lower compared to the reaction in pathway 3. Finally, the atom efficiency has been evaluated for the all pathways and is illustrated in Figure 6d.It can be seen that the atom efficiency for the BHC pathway is very high and therefore, most of the reactant atoms remain in the final product whereas the atom efficiencies are much lower for the two new pathways which means that pathways 2 and 3 might generate more waste than the batch process.Note that the interpretation and the analysis of each "green" metric should be performed individually for each reaction pathway as they represent different aspects of the process (for example, waste generation and total mass used per kg of product).Therefore, an overall conclusion about the "green extent" of the reaction pathways using weighted individual metrics cannot easily be made.

Conclusions
In this article, a reaction database has been developed to assist pharmaceutical process development during the early stages of the synthesis route selection and process-product development by providing enhanced process understanding.A data-rich environment is proposed for this task, where knowledge can be collected, stored and retrieved.The focus of this database is on the pharmaceutical processes and multiphase reactions taking place within them.The reactions in this database have been represented in terms of reaction type, target product to be produced (when single-step or multistep reactions are considered), reaction product and the effect of the solvent use in the reacting system.Information that is contained in the database includes: reaction conditions (temperature, pressure etc.), reaction components (reagents, catalysts etc.), reaction data (conversion,

Conclusions
In this article, a reaction database has been developed to assist pharmaceutical process development during the early stages of the synthesis route selection and process-product development by providing enhanced process understanding.A data-rich environment is proposed for this task, where knowledge can be collected, stored and retrieved.The focus of this database is on the pharmaceutical processes and multiphase reactions taking place within them.The reactions in this database have been represented in terms of reaction type, target product to be produced (when single-step or multistep reactions are considered), reaction product and the effect of the solvent use in the reacting system.Information that is contained in the database includes: reaction conditions (temperature, pressure etc.), reaction components (reagents, catalysts etc.), reaction data (conversion, selectivity, dynamic data set, and kinetic models), scaling information and finally batch or continuous processing.For each reaction entry, a description of the process together literature references are provided.
Reaction data collection is a crucial and very challenging task together with the development of an appropriate knowledge representation system.Also, verification of the consistency of the data is necessary but tests for consistency of data are not yet available, except for some phase equilibrium data.
The application of the database has been highlighted by retrieving data for the synthesis of ibuprofen and using the retrieved data to evaluate the identified reaction pathways using "green" metrics.This reaction database can be used to provide important information during the development of pharmaceutical processes at the early stages of process design.The reaction database covers chemical and biochemical reactions and the future aim is to extend it in terms of reactions and pathways to cover a wider range of reaction systems-products.Many multiphase reactions or single-phase reactions have been improved through the use of solvents available in the database.The solvents are either organic solvents or ionic solvents and in some cases, the extra phase is created by resin, especially for biochemical processes.

Figure 1 .
Figure 1.Simplified flow-diagram highlighting the contents of the reaction database.Figures 2 and 3 provide details of the knowledge representation system, Table4provides information on the classification of the data and Tables 5-10 provide information on the available data.

Figure 1 .
Figure 1.Simplified flow-diagram highlighting the contents of the reaction database.Figures 2 and 3 provide details of the knowledge representation system, Table4provides information on the classification of the data and Tables 5-10 provide information on the available data.

Figure 2 .
Figure 2. Knowledge representation system of the reaction database.Figure 2. Knowledge representation system of the reaction database.

Figure 2 .
Figure 2. Knowledge representation system of the reaction database.Figure 2. Knowledge representation system of the reaction database.

Figure 3 .
Figure 3. Sub-classes and instance/individuals for the reaction class of the database.

Figure 3 .
Figure 3. Sub-classes and instance/individuals for the reaction class of the database.
Screenshot of database search, Screenshot-1 (main product versus reaction type data) Screenshot-2: Continuation of the screenshot-1 (main product versus specific reaction data) Screenshot-3: Continuation of the screenshot-2 (main product versus kinetic model availability)

Figure 5 .
Figure 5. Reaction pathway proposed by Snead et al. for the continuous flow synthesis of ibuprofen.Figure 5. Reaction pathway proposed by Snead et al. for the continuous flow synthesis of ibuprofen.

Figure 5 .
Figure 5. Reaction pathway proposed by Snead et al. for the continuous flow synthesis of ibuprofen.Figure 5. Reaction pathway proposed by Snead et al. for the continuous flow synthesis of ibuprofen.

Figure 6 .
Figure 6."Green" metrics evaluation for the reaction pathways found in the reaction database: (a) effective mass yield (EM) metric, (b) mass intensity (MI) metric, (c) E-Factor metric and (d) atom efficiency metric.

Figure 6 .
Figure 6."Green" metrics evaluation for the reaction pathways found in the reaction database: (a) effective mass yield (EM) metric, (b) mass intensity (MI) metric, (c) E-Factor metric and (d) atom efficiency metric.

Table 1 .
Database review.All the databases have been summarized with respect to the number of reactions and the focus of the database.

Table 4 .
Main classes of the reaction type database and the instances.

Table 5 .
Summary of the information included in the database.
N: name of the compound, T: type of the compound and S: molecular structure

Table 5 .
Summary of the information included in the database.

Table 6 .
Reaction types included in database, phases involved and function of the used solvent.

Table 7 .
List of APIs and final drugs (*) in the database, of which complete reaction pathway and the reactions are provided in the database.

Table 9 .
List of reactions where the use of the solvent has a specific function that leads in direct reaction performance improvement.

Table 10 .
Kinetic models availability; * indicates those that are included in the kinetic model library.

Table 11 .
Summary of the data retrieved from the database.

Table 12 .
Retrieved reaction information from the database.

Table 12 .
Retrieved reaction information from the database.

Table 13 .
Reaction Conditions for the three reactive steps.

Table 14 .
Available experimental data as retrieved from the database.