SUSSOL—Using Artificial Intelligence for Greener Solvent Selection and Substitution

Solvents come in many shapes and types. Looking for solvents for a specific application can be hard, and looking for green alternatives for currently used nonbenign solvents can be even harder. We describe a new methodology for solvent selection and substitution, by applying Artificial Intelligence (AI) software to cluster a database of solvents based on their physical properties. The solvents are processed by a neural network, the Self-organizing Map of Kohonen, which results in a 2D map of clusters. The resulting clusters are validated both chemically and statistically and are presented in user-friendly visualizations by the SUSSOL (Sustainable Solvents Selection and Substitution Software) software. The software helps the user in exploring the solvent space and in generating and evaluating a list of possible alternatives for a specific solvent. The alternatives are ranked based on their safety, health, and environment scores. Cases are discussed to demonstrate the possibilities of our approach and to show that it can help in the search for more sustainable and greener solvents. The SUSSOL software makes intuitive sense and in most case studies, the software confirms the findings in literature, thus providing a sound platform for selecting the most sustainable solvent candidate.


Introduction
Solvents are frequently used as diluents in chemical production processes and products like paint, cleaning agents, glue, and ink. The global solvent market is about 20 million tons a year [1]. The paints, cleaning, and pharmaceutical industries represent the main sectors with more than 60% of the total consumption of solvents [2]. In 2017, paints and coating (46%) and pharmaceutical industry (9%) combined solvent usage accounted for 55% in Europe [3]. Most classical solvents are nonrenewable, nonbiodegradable, flammable, or toxic and therefore, exhibit various problems in manipulation, recycling, and waste treatment [4].
The European regulation concerning the "Registration, Evaluation, Authorisation and Restriction of Chemicals (REACH)" was adopted to improve the protection of human health and the environment from the risks that can be posed by chemicals [5]. Solvents such as cyclohexane, benzene, toluene, chloroform and dichloromethane can be found in the REACH restriction list [6]. Chemicals that are candidates for future bans or restrictions are listed in the "candidate list of substances of very high concern (SVHC) for authorization" [7]. The candidate list for SVHC contains some solvents including nitrobenzene, o-toluidine, N-methylacetamide, N,N-dimethylformamide, furan, formamide, N,N-dimethylacetamide, N-methylpyrrolidone, 2-(m)ethoxyethanol, trichloroethylene, 1,2,3-trichloropropane, and 2,4-dinitrotoluene. To evaluate whether the MDS analysis accurately represents the proximity values as distances on the map, the normalized stress value (σn) is used. This value is the proportion of variation of the dissimilarities not accounted for by the distances in the MDS map [51]. 1 − is the fitted proportion, thus a coefficient of determination [51]. The normalized stress for the MDS plot ( Figure  1) is 0.027, whereas the coefficient of determination is 0.97. The success in representing the actual data on a 2D map is also reflected by the two dimensions (D1 and D2), which account for 84.5% of the variance.
The axes on the MDS plot are dimensionless. However, a trend can be observed in the datapoints of the solvents, which makes it possible to assign physical properties to the axes. The properties were assigned to the axes by visually observing the solvents in the MDS map and searching for correlations between D1 and D2 and the relevant physical properties. This helps the user to make some intuitive sense of the dimensionless axes but it is intended for that use only. Naming the axes with the correlated physical properties would be wrong since there is no sole contributor to every axis. Dimension 1 (the x-axis) in the MDS plot is a measure of volatility. From left to right in the visualization a gradient from low boiling solvents (2-methylbutane, 1-pentene, n-pentane, diethyl ether, and isopropylamine have boiling points lower than 36 °C) to high boiling solvents (triethanolamine, tetraethylene glycol, oleic acid, and tri(2-ethylhexyl)phosphate have a boiling point higher than 325 °C) is observed. This is confirmed by the strong correlations between the MDS scores from D1 and the boiling point and relative evaporation rate ( Figure 2). To evaluate whether the MDS analysis accurately represents the proximity values as distances on the map, the normalized stress value (σ n ) is used. This value is the proportion of variation of the dissimilarities not accounted for by the distances in the MDS map [51]. 1 − σ n is the fitted proportion, thus a coefficient of determination [51]. The normalized stress for the MDS plot ( Figure 1) is 0.027, whereas the coefficient of determination is 0.97. The success in representing the actual data on a 2D map is also reflected by the two dimensions (D1 and D2), which account for 84.5% of the variance.
The axes on the MDS plot are dimensionless. However, a trend can be observed in the datapoints of the solvents, which makes it possible to assign physical properties to the axes. The properties were assigned to the axes by visually observing the solvents in the MDS map and searching for correlations between D1 and D2 and the relevant physical properties. This helps the user to make some intuitive sense of the dimensionless axes but it is intended for that use only. Naming the axes with the correlated physical properties would be wrong since there is no sole contributor to every axis. Dimension 1 (the x-axis) in the MDS plot is a measure of volatility. From left to right in the visualization a gradient from low boiling solvents (2-methylbutane, 1-pentene, n-pentane, diethyl ether, and isopropylamine have boiling points lower than 36 • C) to high boiling solvents (triethanolamine, tetraethylene glycol, oleic acid, and tri(2-ethylhexyl)phosphate have a boiling point higher than 325 • C) is observed. This is confirmed by the strong correlations between the MDS scores from D1 and the boiling point and relative evaporation rate ( Figure 2). It is less obvious to assign a physical property (or multiple properties) to Dimension 2 (the yaxis). A correlation between the y-scores from the MDS and water solubility can be found ( Figure 3). However, the y-axis is also linked to other physical properties which are mainly linked to solvent polarity, as discussed in the following case studies. The graphic shows horizontal lines at 0 and 1000 g/L because these are the minimum and maximum values in the dataset for water solubility. Alkanes, alkenes, aromatic solvents, and most of the halogenated solvents are plotted on the lower half of Figure 1, whereas amines, short-chain alcohols, and dipolar aprotic solvents are depicted in the upper half. It is less obvious to assign a physical property (or multiple properties) to Dimension 2 (the y-axis). A correlation between the y-scores from the MDS and water solubility can be found ( Figure 3). However, the y-axis is also linked to other physical properties which are mainly linked to solvent polarity, as discussed in the following case studies. The graphic shows horizontal lines at 0 and 1000 g/L because these are the minimum and maximum values in the dataset for water solubility. Alkanes, alkenes, aromatic solvents, and most of the halogenated solvents are plotted on the lower half of Figure 1, whereas amines, short-chain alcohols, and dipolar aprotic solvents are depicted in the upper half.
We can roughly identify three groups of solvents in the visualization ( Figure 1). A group of relatively high boiling, polar compounds, a group of low boiling, slightly water-soluble solvents, and a group of high boiling, hydrophobic solvents. In between the polar solvents and the group of high boiling, hydrophobic solvents, one can find solvents such as diacetone alcohol, triethyl phosphate, and cyclopentanone.
The three groups of solvents correspond to the three clusters found by Tobiszewski et al. [44]. In their work, the authors define the clusters as "rather nonpolar and volatile compounds," "nonpolar and sparingly volatile solvents," and "polar solvents." In the MDS visualization, solvents that are "similar" are plotted closer together in the 2D solvent space as can be seen in Figure 4 where a group of typical dipolar aprotic solvents (DMF, DMAc, NMP, and DMSO) is highlighted in the MDS plot. Homologous series (e.g., alcohols, alkanes, and aromatic compounds with increasing alkyl chain) can be identified as a trend in the MDS plot ( Figure 4). We can roughly identify three groups of solvents in the visualization ( Figure 1). A group of relatively high boiling, polar compounds, a group of low boiling, slightly water-soluble solvents, and a group of high boiling, hydrophobic solvents. In between the polar solvents and the group of high boiling, hydrophobic solvents, one can find solvents such as diacetone alcohol, triethyl phosphate, and cyclopentanone.
The three groups of solvents correspond to the three clusters found by Tobiszewski et al. [44]. In their work, the authors define the clusters as "rather nonpolar and volatile compounds," "nonpolar and sparingly volatile solvents," and "polar solvents." In the MDS visualization, solvents that are "similar" are plotted closer together in the 2D solvent space as can be seen in Figure 4 where a group of typical dipolar aprotic solvents (DMF, DMAc, NMP, and DMSO) is highlighted in the MDS plot. Homologous series (e.g., alcohols, alkanes, and aromatic compounds with increasing alkyl chain) can be identified as a trend in the MDS plot ( Figure 4).   The relative influence of a functional group decreases with increasing chain length and molecular size. Homologous series, therefore, are not always following a straight line in the MDS visualization, as can be seen in the homologous series from alcohols. Due to a drastic increase in boiling point and decrease in water solubility, vapor pressure, and relative evaporation rate, the series of alcohols shows a sudden jump from 1-propanol to 1-butanol. The relative influence of a functional group decreases with increasing chain length and molecular size. Homologous series, therefore, are not always following a straight line in the MDS visualization, as can be seen in the homologous series from alcohols. Due to a drastic increase in boiling point and decrease in water solubility, vapor pressure, and relative evaporation rate, the series of alcohols shows a sudden jump from 1-propanol to 1-butanol.

Solvent Effect in Keto-Enol Tautomerism
The keto-enol equilibrium of a compound in various solvents can be used as a case to test the MDS plot. The keto-enol tautomerization constant for 2-methyl-5-phenyl-3-oxo-4-pentenenitrile ( Figure 5) is visualized through a color scale in the MDS plot. A top-down trend can be observed in the MDS chart ( Figure 6), which indicates the dependency of the amount of enol-tautomers on solvent polarity. Giussi et al. [53] show that π* is the more important term in a linear solvation energy relationship (LSER) suggesting that the solute-solvent dipole-dipole interactions occur preferably. This makes sense as the (calculated) dipole moment of the keto-tautomer is lower than for the enol-tautomers (E-and Zisomers), i.e., 3.34 D and 4.26 D and 4.99 D, respectively [46]. The relative influence of a functional group decreases with increasing chain length and molecular size. Homologous series, therefore, are not always following a straight line in the MDS visualization, as can be seen in the homologous series from alcohols. Due to a drastic increase in boiling point and decrease in water solubility, vapor pressure, and relative evaporation rate, the series of alcohols shows a sudden jump from 1-propanol to 1-butanol.

Solvent Effect in Keto-Enol Tautomerism
The keto-enol equilibrium of a compound in various solvents can be used as a case to test the MDS plot. The keto-enol tautomerization constant for 2-methyl-5-phenyl-3-oxo-4-pentenenitrile ( Figure 5) is visualized through a color scale in the MDS plot. A top-down trend can be observed in the MDS chart ( Figure 6), which indicates the dependency of the amount of enol-tautomers on solvent polarity. Giussi et al. [53] show that π* is the more important term in a linear solvation energy relationship (LSER) suggesting that the solute-solvent dipole-dipole interactions occur preferably. This makes sense as the (calculated) dipole moment of the keto-tautomer is lower than for the enoltautomers (E-and Z-isomers), i.e., 3.34 D and 4.26 D and 4.99 D, respectively [46].

Reaction Solvents for the Synthesis of Butyl Butanoate
Next, the conversion of butanoic anhydride and 1-butanol into butyl butanoate ( Figure 7) is used as a test for the MDS visualization. The second-order reaction rate constants in various solvents are visualized on the MDS chart ( Figure 8). Again, the trend is apparent, although the rate constant for 1,4-dioxane seems too high to fit the trend. Clark et al. [54] make use of an LSER to demonstrate that  Next, the conversion of butanoic anhydride and 1-butanol into butyl butanoate ( Figure 7) is used as a test for the MDS visualization. The second-order reaction rate constants in various solvents are visualized on the MDS chart ( Figure 8). Again, the trend is apparent, although the rate constant for 1,4-dioxane seems too high to fit the trend. Clark et al. [54] make use of an LSER to demonstrate that β (hydrogen bond accepting ability) and δ 2 H (the square of the Hildebrand solubility parameter) combined account for the rate of the esterification reaction. The remaining parameters (α and π*) were found not to be statistically significant.

Reaction Solvents for the Synthesis of Butyl Butanoate
Next, the conversion of butanoic anhydride and 1-butanol into butyl butanoate ( Figure 7) is used as a test for the MDS visualization. The second-order reaction rate constants in various solvents are visualized on the MDS chart ( Figure 8). Again, the trend is apparent, although the rate constant for 1,4-dioxane seems too high to fit the trend. Clark et al. [54] make use of an LSER to demonstrate that β (hydrogen bond accepting ability) and δ²H (the square of the Hildebrand solubility parameter) combined account for the rate of the esterification reaction. The remaining parameters (α and π*) were found not to be statistically significant.
From this data, it becomes clear that not only polarity-polarizability (π*) but also hydrogen-bond basicity (β) contributes to the y-axis in the MDS visualization.

Liquid-Liquid Extraction of Furfural
Not in all test cases, a clear trend can be observed in the MDS plot. In the following example, solvents used for liquid-liquid extraction from furfural from an aquatic medium [55] are depicted in the MDS plot ( Figure 9). The experimental separation factor is used as a measure for extraction efficiency.
x2α and x1α are the mole fractions of solute and diluent in the aqueous phase, respectively. x2β and x1β are the mole fractions of solute and diluent, respectively, in the organic phase.
The trend in this case is not very obvious, however, a region of interest can be observed which may indicate solvents worth testing in laboratory experiments. It should be noted that Xin et al. [55] did not consider the removal of the extraction solvent. Concentrations of solvents and solutes in the organic and aqueous phase were determined using gas chromatography. In our dataset, multiple From this data, it becomes clear that not only polarity-polarizability (π*) but also hydrogen-bond basicity (β) contributes to the y-axis in the MDS visualization.

Liquid-Liquid Extraction of Furfural
Not in all test cases, a clear trend can be observed in the MDS plot. In the following example, solvents used for liquid-liquid extraction from furfural from an aquatic medium [55] are depicted in the MDS plot ( Figure 9). The experimental separation factor is used as a measure for extraction efficiency.
x 2α and x 1α are the mole fractions of solute and diluent in the aqueous phase, respectively. x 2β and x 1β are the mole fractions of solute and diluent, respectively, in the organic phase.

Solvent Substitution
Three cases are analyzed with the SUSSOL software. In the first case study, the replacement of NMP is analyzed using the full dataset. In the second case, the substitution of toluene in a real-world application is studied using a subset of the full dataset. The third case study is performed from a solvent producer or distributor point of view. In the latter case, tetramethyl oxolane (TMO) was selected for a benchmark analysis against common organic solvents as well as other "green," neoteric, and biobased solvents.

Replacement of NMP
The fluorination via a SNAr reaction is typically promoted by conventional dipolar aprotic solvents [56]. As such, the reaction is a good case to study possible substitutes for NMP.
The full dataset with 500 solvents and 21 physical properties was clustered on a map of 15 × 15 (225 centroids), and 2500 runs were executed on an Intel ® Core™ i7-8850H CPU @ 2.60 GHz. The total computation time was approximately 11 min: clustering 3 min and statistical analysis 8 min.
The clustering of the dataset and the following statistical analysis yielded 84 candidates. Since an aprotic solvent was needed and no physical parameter quantifying that property was present in the dataset, protic solvents were removed manually from the candidates. After removal of the protic solvents, 20 solvents remained as a possible substitute for NMP in the fluorination reaction. The top 10 candidate replacement solvents sorted on Hansen distance to NMP are visualized in a graph with The trend in this case is not very obvious, however, a region of interest can be observed which may indicate solvents worth testing in laboratory experiments. It should be noted that Xin et al. [55] did not consider the removal of the extraction solvent. Concentrations of solvents and solutes in the organic and aqueous phase were determined using gas chromatography. In our dataset, multiple parameters are linked to the solvent's evaporation behavior (boiling point, vapor pressure, relative evaporation rate, and Antoine constants), and therefore, it has a great impact on the MDS map.

Solvent Substitution
Three cases are analyzed with the SUSSOL software. In the first case study, the replacement of NMP is analyzed using the full dataset. In the second case, the substitution of toluene in a real-world application is studied using a subset of the full dataset. The third case study is performed from a solvent producer or distributor point of view. In the latter case, tetramethyl oxolane (TMO) was selected for a benchmark analysis against common organic solvents as well as other "green," neoteric, and biobased solvents.

Replacement of NMP
The fluorination via a S N Ar reaction is typically promoted by conventional dipolar aprotic solvents [56]. As such, the reaction is a good case to study possible substitutes for NMP.
The full dataset with 500 solvents and 21 physical properties was clustered on a map of 15 × 15 (225 centroids), and 2500 runs were executed on an Intel ® Core™ i7-8850H CPU @ 2.60 GHz. The total computation time was approximately 11 min: clustering 3 min and statistical analysis 8 min.
The clustering of the dataset and the following statistical analysis yielded 84 candidates. Since an aprotic solvent was needed and no physical parameter quantifying that property was present in the dataset, protic solvents were removed manually from the candidates. After removal of the protic solvents, 20 solvents remained as a possible substitute for NMP in the fluorination reaction. The top 10 candidate replacement solvents sorted on Hansen distance to NMP are visualized in a graph with a normalized y-axis, where 0 represents the lowest value in the 500 solvent dataset and 1, represents the highest value ( Figure 10). Each solvent is represented as a polyline. More overlap between the polylines signifies similarity between solvents. The same visualization is used in the SUSSOL software ( Figure 11). This allows the user to graphically asses the replacement candidates. In this case, there is quite a bit of variation for some properties. For certain properties like surface tension, the variation is not a problem, but other properties are key to the solvent's performance. In this case, the solvent's polarity and water solubility have a high impact on the performance of the solvent in the reaction. The solvents 1-formylpiperidine, furfural, 2,5-hexanedione, and diketene are poorly water soluble and cause the relative standard deviation for water solubility to be 40% ( Figure 11). By giving this property more weight in the clustering process, a better output, tailored to the application under consideration, could be achieved. This feature has been provided in the SUSSOL software. The relative standard deviation drops to 4% after applying more weight to water solubility ( Figure 10). The aforementioned solvents disappear from the candidates list.
Molecules 2020, 25, x FOR PEER REVIEW 12 of 28 the variation is not a problem, but other properties are key to the solvent's performance. In this case, the solvent's polarity and water solubility have a high impact on the performance of the solvent in the reaction. The solvents 1-formylpiperidine, furfural, 2,5-hexanedione, and diketene are poorly water soluble and cause the relative standard deviation for water solubility to be 40% ( Figure 11). By giving this property more weight in the clustering process, a better output, tailored to the application under consideration, could be achieved. This feature has been provided in the SUSSOL software. The relative standard deviation drops to 4% after applying more weight to water solubility ( Figure 10). The aforementioned solvents disappear from the candidates list.    In the SUSSOL software, the solvent names are color coded in accordance with the CHEM21 solvent guide. If the candidate list is sorted according to the CHEM21 health score (Table 1), DMSO appears to be the healthiest option for the replacement of NMP, followed by Cyrene and N-butyl pyrrolidone. Cyrene has the advantage that it is based on renewable feedstock. The authors of the study successfully tested 5 solvents in the S N Ar fluorination: DMSO, DMF, sulfolane, NMP, and Cyrene. All of the solvents are present in the candidates list, with exception of sulfolane, which is not present in the dataset. The default scores as determined by the CHEM21 methodology do not reflect the low occupational limit values of pyridine, therefore, the overall ranking was changed to "hazardous" by the CHEM21 solvent team.

Replacement of Toluene
The Belgian company Soudal [57] produces silicone, caulks, polyurethane-foams, and adhesives. One of their neoprene rubber-based contact adhesives contains 75-80% of solvent. In the past, toluene was used to dissolve the neoprene rubber. Nowadays, due to legislation, toluene is banned from all formulations for the do-it-yourself market. To replace toluene, a mixture of 5 solvents (alkanes and others) was used to produce the adhesive. However, the composition of the mixture changes as the solvents evaporate with different evaporation rates, which may lead to (partial) insolubility of the neoprene rubber. Depending on environmental conditions, this may lead to an inhomogeneous film and/or affect the adhesion in a negative way. The most straightforward solution to overcome this problem would be a one-on-one replacement of toluene. Requirements are the complete dissolution of the polymer/resin mixture, a comparable evaporation profile, and a cost-efficient solution.
A subset of the 500 solvents dataset was created, using the Hansen distance to toluene as a measure. All solvents with a distance less than 4 MPa 1/2 to toluene were included in the subset. The correct radius of the neoprene rubber was unknown, however, the smallest reported radius in HSPiP is 6.2. A rule of thumb by Seymour [58] describes two nonpolar compounds are miscible if they are within a distance of approximately 4 MPa 1/2 . To assure the complete solubility of the rubber in the solvent candidates, a distance of 4 MPa 1/2 was adopted. The resulting set of 42 solvents was clustered on a 7 × 7 map. In total, 500 runs were executed. The generated solvent replacement candidates were ranked according to the health, environmental, and safety score ( Table 2). The two first solvents in the ranking were selected for further laboratory testing. p-Cymene can be synthesized from renewable feedstock and is previously presented as a possible substitute for toluene [54]. Dipentene, limonene, and α-pinene are promising candidates because of their renewable origin. Nonetheless, these solvents were not tested because of their strong odor, which is an important factor in consumer formulations. The two selected solvents were tested against the current solvent blend and toluene as a reference. As a first test, solubility of the neoprene rubber was tested by mixing it with the solvents in a 1:9 mass ratio. The mixtures were mixed at the roller bench for 24 h at 6 rpm. The rubber dissolved in all of the solvents. Visual differences in viscosity of the mixture were observed, and the viscosity increased in the order of solvent blend < toluene < p-cymene < isobutylbenzene.
Next, formulations were made based on the current formula. Viscosity was measured on the completed formulations with resin added. Viscosity was determined using a Brookfield RVDV-E apparatus at 20 rpm and at 23 • C/RH 50%, applying the appropriate spindle (Sp5). The viscosity for the finished formulations is approximately a factor 1.5 times higher compared to the toluene formulation and a factor 4-5 times higher compared to the current solvent blend ( Table 3). The toluene formulation, however, also shows a viscosity 3 times higher compared to the solvent blend. To test the evaporation rate, the formulations were placed in aluminum cups and air dried. The weight was recorded in function of time and plotted as weight percentage (Figure 12). The replacement solvents clearly show a much slower rate of evaporation than the solvent blend and toluene formulation. The solid base of the adhesive formulation is 22.5%. After 72 h, all formulations were completely dried. The drying times do not reflect a real-use situation since the adhesives were not applied and dried as a film. At that time, a satisfactory one-one-one replacement could not be found. Price, odor, or evaporation speed were the limiting factors.

Benchmark of TMO
The SUSSOL software can also be used as a benchmark for new solvents. For example, tetramethyl oxolane (TMO) or 2,2,5,5-tetramethyltetrahydrofuran (TMTHF) is presented as a renewable substitute for toluene amongst others [59]. The Hansen solubility parameters, however, are not very close to the values for toluene (Hansen distance of 5.3 MPa 1/2 ), which makes the solvent a possible substitute for a range of other solvents. Using the clustering analysis in the software, similar (nonbenign) solvents can easily be detected. Not all physical properties in our dataset are available for TMO. The clustering is performed with a dataset containing 13 properties (instead of 22).
From the clustering, it can be concluded that TMO is a valid substitute for toluene and similar aromatic compounds, as published by Byrne et al. [59]. Toluene, benzene, and other alkyl substituted aromatic compounds, such as xylenes and propylbenzene, are significant neighbors of TMO. In Table  4, a selection of the candidates list is presented. All solvents within a distance of 6 MPa 1/2 in the Hansen space that have a high score (>5) for safety, health, or environment are included. Amines were removed from the list because the use of amines is usually related to their basicity. At that time, a satisfactory one-one-one replacement could not be found. Price, odor, or evaporation speed were the limiting factors.

Benchmark of TMO
The SUSSOL software can also be used as a benchmark for new solvents. For example, tetramethyl oxolane (TMO) or 2,2,5,5-tetramethyltetrahydrofuran (TMTHF) is presented as a renewable substitute for toluene amongst others [59]. The Hansen solubility parameters, however, are not very close to the values for toluene (Hansen distance of 5.3 MPa 1/2 ), which makes the solvent a possible substitute for a range of other solvents. Using the clustering analysis in the software, similar (nonbenign) solvents can easily be detected. Not all physical properties in our dataset are available for TMO. The clustering is performed with a dataset containing 13 properties (instead of 22).
From the clustering, it can be concluded that TMO is a valid substitute for toluene and similar aromatic compounds, as published by Byrne et al. [59]. Toluene, benzene, and other alkyl substituted aromatic compounds, such as xylenes and propylbenzene, are significant neighbors of TMO. In Table 4, a selection of the candidates list is presented. All solvents within a distance of 6 MPa 1/2 in the Hansen space that have a high score (>5) for safety, health, or environment are included. Amines were removed from the list because the use of amines is usually related to their basicity. Common hydrocarbon solvents such as n-hexane and traditional ethers such as diethyl ether are also present in the candidates list generated by the software. Additionally, esters and ketones containing 5-8 carbon atoms are clustered together with TMO. More interestingly, some halogenated solvents are replaceable with TMO. For example, 1,1-Dichloroethene, 1,1-dichloroethane, 1,1,1-trichloroethane, and 1,1-dichloropropane are all significant neighbors of TMO. All of these solvents score very high on health and/or environmental impact. In addition, ethylene glycol diethyl ether (1,2-diethoxyethane), which scores a 9 for health could be substituted with TMO. Certainly, the substitution should be tested for each case individually, depending on the application, however, using the software like this supports the search for possible applications where TMO could be a successful substitute and accordingly, could aid solvent producers and distributors.

Materials and Methods
In the following section, the software is described in detail. Algorithms and underlying principles are discussed comprehensively. A version of the software with limited functionality and a small dataset is available on GitHub [48]. The implementation of the SOM algorithm in Java can be found in a GitHub repository [60] and the MDS algorithm used in SUSSOL is available at the website of the University of Konstanz [61].

Sustainable Solvents
A lot has been published on "how green is a solvent" [34]. Furthermore, every industry or company has its own priorities. Therefore, it was decided to integrate the CHEM21 approach [33] in our software. This approach is robust since it requires less data, and it is a recognized methodology. The CHEM21 methodology assigns scores for safety, health, and environment criteria, aligned with the Global Harmonized System (GHS) and European regulations. Every criterion (SH&E) is assigned a score between 1 and 10, 10 representing the highest hazard in each category. A color code is associated with the scores: green for 1-3, yellow for 4-6, and red for 7-10. The combination of the three scores determines the final ranking of the solvent: hazardous, problematic, or recommended.
The safety score is assigned based on the flash point and boiling point. Penalty points are assigned for a low autoignition temperature, a high resistivity, or the ability to form peroxides. A solvent with a high energy of decomposition has a score of 10.
The health score is determined by the compounds H3xx statements. More severe H3xx statements yield higher scores. A penalty point is added to the score for low boiling point (<85 • C) solvents. Solvents with incomplete toxicological data are assigned a score of 5.
The score for environmental impact is based on the solvent's boiling point, the H4xx statements, and the REACH status. An ideal boiling range between 70 and 139 • C is proposed. Solvents with lower boiling points are more likely to give rise to emissions, but a high boiling solvent is more difficult to recycle. Solvents with incomplete environment toxicity data are assigned a score of 5. Solvents with the H420 statement (ozone layer hazard) have a score of 10 by default.
The scores for SH&E and the corresponding color codes are self-explanatory and provide the user of our software with a reliable starting point to decide which solvents to further look into. Whenever the user selects a solvent in the graphical user interface (GUI), the final ranking and the scores for SH&E are shown. It is up to the user to base their decision on comprehensive data on toxicity, regulatory issues, cost, and availability and priorities within the industry. The only drawback in using the CHEM21 methodology [33] in the general-purpose solvent guide subject of this publication is its clear focus on pharmaceutical companies. The importance of the boiling point in the score for environmental impact is inconvenient for applications that do not require the solvent to be separated via distillation (e.g., paints, coatings, and cleaning agents).

Dataset
Common organic solvents were included together with, "green," neoteric, and biobased solvents. A solvent distributor and several solvent users (a paint and coatings producer, a producer of industrial cleaning agents, a pharmaceutical company, and a fine chemicals company) were asked which solvents should be incorporated in the dataset. This produced a list of about 200 "common" organic solvents which was complemented with "new" and alternative solvents. Gasses, supercritical fluids, ionic liquids, and deep eutectic solvents were excluded from the dataset.
The present dataset contains 500 solvents. A list of the compounds in the dataset is available in the Supporting Information. An overview of functional groups present in the dataset is depicted in Figure 13. A solvent may belong to multiple classes, as some solvents contain multiple functional groups (e.g., diacetone alcohol). Alcohols, alkenes, esters, ethers, aromatic compounds, and halogenated compounds together account for 70% of the total functional group count (698).
Specifically, 25 Solvents in the dataset are solvents in the SVHC list and 54 solvents are in the SIN list (Table 5). Following the CHEM21 ranking of solvents with a green, amber, or red color, 349 solvents in the dataset have a green safety score, 276 solvents have a green health score, and only 65 solvents have a green environmental score. This is due to the importance of the boiling point in assigning the environment score. Solvents with a boiling point lower than 70 • C or above 139 • C are scored amber or red [33]. In total, 48 solvents have a boiling point below 70 • C and 303 solvents have a boiling point above 139 • C. Overall, 118 solvents in the dataset are ranked as "recommended" according to the CHEM21 methodology. Molecules 2020, 25, x FOR PEER REVIEW 18 of 28  (Table 5). Following the CHEM21 ranking of solvents with a green, amber, or red color, 349 solvents in the dataset have a green safety score, 276 solvents have a green health score, and only 65 solvents have a green environmental score. This is due to the importance of the boiling point in assigning the environment score. Solvents with a boiling point lower than 70 °C or above 139 °C are scored amber or red [33]. In total, 48 solvents have a boiling point below 70 °C and 303 solvents have a boiling point above 139 °C. Overall, 118 solvents in the dataset are ranked as "recommended" according to the CHEM21 methodology. Table 5. Solvents in the dataset mentioned in Substitute It Now (SIN)-list and/or substances of very high concern (SVHC)-list.  Table 5. Solvents in the dataset mentioned in Substitute It Now (SIN)-list and/or substances of very high concern (SVHC)-list. Every solvent in the dataset is described by 22 physical properties (Table 6). Sources used to compile the dataset include supplier and producer information (Sigma Aldrich [62], Acros Organics [63], Fisher Scientific [64], Merck Millipore [65], VWR [66], Huntsman [67], Dow [68], BASF [69], Shell [70], Solvay [71], and Total [72]), online databases (PubChem [73], ChemSpider [74], Chemicalbook [75], Molbase [76], and ECHA [77]), commercially available database programs (Chemtec [78] and HSPiP [79]), and handbooks (CRC Handbook of Chemistry and Physics [80] and Sustainable Solvents RSC Green Chemistry Series [81]). The solvent dataset is loaded in the software in the form of a .csv file and thus provides the necessary flexibility to the user. Solvents (rows) can be added or removed as well as physical properties (columns). The .csv format allows to export/import solvent subsets to/from other software packages, such as HSPiP [79].

Multidimensional Scaling (MDS) Plot
MDS uses a square symmetrical matrix of the Euclidean distances between all solvents, called the similarity matrix. This matrix is then compared to the original full dataset by evaluating a stress function. Stress is a goodness-of-fit measure, based on differences between predicted and actual distances [51,52]. The coordinates in 2D are then adjusted, if necessary, to reduce stress.
The stress function is defined as: In Equation (2), d ij refers to the Euclidean distance, across all dimensions, between solvents i and j on the map, f(x ij ) is a function of the input data, and scale refers to a constant scaling factor, used to keep stress values between 0 and 1. When the MDS map perfectly reproduces the input data, f(x ij ) − d ij is 0 for all i and j, so stress is zero. The smaller the stress, the better the representation or dimensionality reduction. The transformation of the input values f(x ij ) depends on whether metric or non-metric scaling is used. In metric scaling, which applies to the solvent data set, f(x ij ) = x ij .
For some datasets, it is easy to define the x and y axes of the resulting 2D map, e.g., for distances between cities. For high-dimensional datasets, the dimensions are abstracted, and that is why, the axes in our MDS plot are undefined.

Filter Module
The user can explore the solvent dataset by selecting individual solvents in the MDS plot. In an associated filter module, it is possible to define ranges for the physical properties, thus obtaining a set of solvents that meet the user's criteria. Criteria can be combined, resulting in even more fine-grained filters. This allows the user to conveniently create subsets of solvents with desirable properties for the user's intended application.

Self-Organizing Map
For solvent substitution, we apply a clustering algorithm, the Self-organizing Map (SOM) of Kohonen [49,50]. The SOM is a neural network, based on neurophysiological research by Hubel and Wiesel [82], who established the existence of ocular dominance columns in the visual cortex of a cat. Ocular dominance columns are stripes of neurons in the visual cortex of certain mammals, including humans, that respond preferentially to input from one eye or the other and to lines with a specific orientation.
The SOM can be seen as an artificial implementation of this neuroanatomical structure, consisting of a comparable grid of artificial neurons. It has been used extensively as a visualization tool in exploratory data analysis. It has had plenty of practical applications ranging from industrial process control and finance analyses to the management of very large document collections.
Initially, we tested several other clustering algorithms (K-Means, Canopy, DBSCAN, COBWEB, and Expectancy Maximization) [83] on small datasets to be able to choose the most suited one, from a chemical viewpoint. The datasets were compiled in such a way that several obvious clusters were present in the data. For example, a dataset with 3 short-chain alcohols, 3 alkanes, and 3 ketones was used. Using these datasets, the resulting clusterings could easily be evaluated. The SOM algorithm produced the most chemically relevant clusterings.
Regardless of the choice of algorithm, in AI learning, a specific procedure is followed, see Figure 14. Solvent data has to be entered in the dataset and cleaned according to a specific syntax. The solvent data is then offered to the algorithm for clustering. Data for clustering only include the physical properties (numeric) data from the dataset. The SH&E scores are not used for clustering but are assigned to the solvent succeeding the clustering. The resulting model must be validated both chemically and statistically. If the model is not satisfactory, the cycle repeats. The dataset can be Solvent data has to be entered in the dataset and cleaned according to a specific syntax. The solvent data is then offered to the algorithm for clustering. Data for clustering only include the physical properties (numeric) data from the dataset. The SH&E scores are not used for clustering but are assigned to the solvent succeeding the clustering. The resulting model must be validated both chemically and statistically. If the model is not satisfactory, the cycle repeats. The dataset can be adjusted to contain more or less solvents or to give more or less weight to specific physical properties.
Three hyperparameters highly influence the behavior of the SOM: learning rate, width, and height. The learning rate can be adjusted to change how coarse learning proceeds. By changing the width and height of the map, the total number of neurons is increased or decreased. When finally the model has converged, it can be used to look for alternative, comparable solvents.
During training, all solvents are offered to the SOM grid and allocated to a specific column or neuron. Comparable solvents are grouped together around 1 specific neuron, which results in clustering. In pseudocode, this looks like: Input: a set of solvents, X = {x 1 , ..., x n } Output: a set of clusters, Y = {y 1 , ..., y n } Begin Initialize Y = {y 1 , ..., y n } randomly Repeat select x ∈ X randomly Find y * such that d(x,y * ) = min{d(x,y)| y ∈ Y} For all y ∈ N(y * ) do y = y + γ (x− y) Reduce learning rate γ Until termination condition is true End N(y*) = the neighborhood of y* γ = the learning rate which determines how coarse learning proceeds

Statistical Postprocessing
Results of neural network learning are always subject to some variability due to the sensitivity to initial conditions, to convergence to local minima, and, sometimes more dramatically, to sampling variability [84]. Consequently, the content of clusters (groups of solvents that are similar) may vary for each run. To illustrate this, 250 runs were performed on a small solvent set (57 solvents) containing all solvents in the 500 solvents dataset within a distance of 4 MPa 1/2 (according to the Seymour rule of thumb [58]) from d-Limonene in the Hansen solvent space. In Table 7, the solvents present in the same cluster as d-Limonene are listed for the first 10 runs.
The results may vary for each run, but it is clear that some solvents are more often in the "limonene cluster" than others. Decalin is 6 out of 10 times present in the same cluster as limonene, whereas other solvents are clustered only once together with limonene. If the clustering analysis is performed a sufficient number of times, it is possible to do a statistical postprocessing analysis and determine which solvent pairs are statistically significant clustered in the same cluster. Further, it is explained what "a sufficient number of times" might be. Table 7. Contents of "limonene cluster" for first 10 runs. Count  1  2  3  4  5  6  7  8  9  10 cis-Decalin

Run Number
To implement this assessment, de Bodt et al. [84] introduced the objective measure of neighborhood.
Being neighbors within radius r means that two solvents x i and x j are within the same cluster or in neighboring clusters within the radius. If the radius r is 0 and NEIGH i,j = 1, the two solvents are projected in the same cluster. If r > 0, also surrounding clusters are taken into account. In this work, we adopt r = 0 for all assessments. Superscript b in Equation (3) signifies that the result for the NEIGH i,j value is obtained for a specific clustering (simulation or run) b; of course, it may happen that two solvents x i and x j are neighbors after one simulation and not after another as illustrated in Table 7; this is exactly what we are looking for to evaluate.
As a result of Equation (3), the neighborhood value NEIGH i,j for the solvent pair x i , x j is 0 if they are not in the same cluster and 1 if they are.
Next, a stability function (STAB i,j ) is defined [84] as the average of NEIGH i,j over all simulations where B is the total number of simulations. The STAB i,j function is a measure to evaluate if it is meaningful that a solvent pair x i , x j is clustered within the same cluster. To assess this, we check if NEIGH i,j always has the same value (0 or 1) over all simulations. A perfect stability would lead STAB i,j to be 0 (x i and x j are never neighbors) or 1 (they are always neighbors).
In practice, STAB i,j is a value between 0 and 1. The higher the STAB i,j value, the more certain we are that solvents x i and x j belong to the same cluster and thus, exhibit similar physical properties. As an example, a table with STAB values for the previous clustering of limonene is included (Table 8). In 36% of the runs, α-pinene is present in the same cluster as limonene. α-Pinene is a C10 terpene just like limonene, hence it is the solvent which has the highest neighbor count in the dataset after 250 runs.
Cyclic hydrocarbons (mostly substituted cyclohexanes) also appear to be highly similar to limonene. In addition, some aromatic solvents seem quite alike. o-Diethylbenzene is clustered together with limonene 48 out of 250 times. Finally, it is possible to study the significance of the neighborhood for a specific pair of solvents by comparing STAB i,j to the value it would have if the solvents fell in the same cluster in a completely random way (unorganized map). If this specific pair of solvents only falls in the same cluster by chance, then the STAB i,j statistic will be approximately equal to the value it would take in unorganized maps. Comparing the STAB i,j values in these two situations (organized and unorganized maps) is thus a way to make possible the use of a conventional statistical test to check if the STAB i,j statistic in the organized case is significant or not. If U is the total number of possible clusters (which corresponds to the total number of centroids in the SOM) and ν is the size of the neighborhood (which can be calculated from the radius r by ν = (2r + 1) 2 , but is always 1 when r = 0), the probability of x i and x j being neighbor in the random case is ν/U. With the probability of success being ν/U (in this case, success means x i and x j are neighbors), the number Y of successes in B simulations is distributed as a binomial distribution, with parameters B and ν/U. This allows to test the H 0 hypothesis: x i and x j are random neighbors against H 1 : the fact that x i and x j are (not) neighbors is meaningful. An approximated Gaussian distribution can be used if the conditions Bν/U > 10 and B(1 -ν/U) > 10 are met. Since we use r = 0 (ν = 1), this implies that the number of runs should be at least a factor of 10 higher than the number of centroids in the SOM.
Consequently, if Y is less than B ν U − 1.96 B ν U 1 − ν U or greater than B ν U + 1.96 B ν U 1 − ν U , H 0 is rejected, and we can conclude with a confidence level of 5% that the fact that x i and x j are (not) neighbors is meaningful. If the STAB i,j value is close to 0, x i and x j are significant not neighbors and if STAB i,j is close to 1, x i and x j are significant neighbors. For the analysis in Table 8, 25 centroids were used in the SOM and 250 runs were performed. The upper and lower limit are thus 3.9 and 16.1. Consequently, all solvents with a neighbor count higher than 16 are statistically neighbors of limonene.
This statistical analysis is automated in the software and serves as a postprocessing module. When searching for an alternative for solvent x, the significance of all solvent pairs with solvent x is checked. In this way, we generate a candidate list with all statistically significant substitution candidates for solvent x. The candidate list for d-Limonene is showed in Table 8. From this list, it can be concluded that limonene is a good substitute for longer chain hydrocarbon solvents (C9-C16) and alkylbenzenes, such as isopropylbenzene or the Shellsol A100 solvent, which is a C9-C10 aromatic hydrocarbon solvent.

Conclusions
All-purpose, fully automated, and AI based solvent selection and substitution software has been developed in this study. The SUSSOL software makes intuitive sense, and in most case studies, the software confirms the findings in literature, thus providing a sound platform for selecting the most sustainable solvent candidate. SUSSOL has also been tested in a real-life industrially relevant problem. The clustering and consecutive statistical analysis in software provide an easy-to-use approach, which is tunable for individual applications. Due to the extensive dataset and clustering principle, SUSSOL can serve as a starting point and a source of inspiration in the search for alternative solvents, especially to nonexperts. The software is complementary with other software packages such as HSPiP, since it is easy to make subsets and export them in the .csv format. The dataset is easily extendable with other physical properties and/or solvents.