Chemometrics for Selection, Prediction, and Classiﬁcation of Sustainable Solutions for Green Chemistry—A Review

: In this review, we present the applications of chemometric techniques for green and sustainable chemistry. The techniques, such as cluster analysis, principal component analysis, artiﬁcial neural networks, and multivariate ranking techniques, are applied for dealing with missing data, grouping or classiﬁcation purposes, selection of green material, or processes. The areas of application are mainly ﬁnding sustainable solutions in terms of solvents, reagents, processes, or conditions of processes. Another important area is ﬁlling the data gaps in datasets to more fully characterize sustainable options. It is signiﬁcant as many experiments are avoided, and the results are obtained with good approximation. Multivariate statistics are tools that support the application of quantitative structure–property relationships, a widely applied technique in green chemistry.


Introduction
The term "chemometrics" was coined by the Swedish scientist Svante Wold in early 1970s while submitting a grant proposal for the application of statistical methods to chemical data [1]. It appeared as the word "kemometri," a combination of the forms "kemo-" for chemistry and "-metri" for measure [2].
Initially, chemometrics was defined as a "science of relating measurements made on a chemical system or process to the state of the system via application of mathematical or statistical methods." According to the name, the discipline of chemometrics originated from chemistry, where one of the first applications focused on improving the quantitative performance of analytical instruments, such as NIR (near infrared) calibration, HPLC (high-performance liquid chromatography) resolution, and UV-VIS deconvolution [3]. Chemometrics took the form of an interdisciplinary field that uses mathematical and statistical methods to design or select optimal measurement procedures and experiments and to provide maximum chemical information by analysing chemical data. The numerous domains that are covered by chemometrics are presented by Santos et al. on a bibliometric map generated using more repeated words in the authors' search for the period 2014-2018 performed in the Science Citation Index Expanded [4]. However, the breakthrough in chemometrics is a response to various software and new high-dimensional hyphenated equipment appearance. These devices in chromatography have been allowed for the determination of various analytes in complex matrices with high resolution and precision. On the other hand, obtained results as large datasets become more difficult to interpret.
Due to rapid technological advances, the focus on multivariate methods is visible. Therefore, the distribution of multiple variables simultaneously provides more information than what could be obtained by considering each variable individually. Then some meaningful information may be chemometrically extracted. As mentioned above, chemometrics is a very important issue in fields

The Outline of Chemometric Tools
Chemometric tools may be divided into two groups: qualitative and quantitative methods. The first group is dedicated to solving problems of classification and pattern recognition. In other words, they allow for assigning an individual sample to a given group of samples or finding a sorting pattern in the underlying data structure of a set [5]. The idea of these methods is based on two philosophies dividing methods into unsupervised and supervised methods. The aim of unsupervised methods is to reveal the underlying data structure without the potential bias of knowing the group memberships beforehand. On the other hand, supervised methods are based on producing the best possible separation of the groups. Therefore, they maximize the capability of the classification method to predict the class membership of samples with unknown membership. Accordingly, it is worth bearing in mind that depending on the problem, one group of methods could be more suited for a given purpose. However, due to fact that it is not always an unambiguous choice, sometimes several chemometric tools are applied. In finding the connection between the detected signals and the exact concentration values, quantitative methods are used. As it is widely known, modern analytical devices generate huge datasets with thousands of spectral data (from Fourier transform infrared/near-infrared, mass spectrometry, nuclear magnetic resonance, etc.); therefore, finding a correlation is very often unclear and difficult. The quantitative analysis is based on regression techniques, whose concept involves exploration of a connection (linear or nonlinear) between one or several independent variables and one (or more, but usually one) dependent variable. If there is only one dependent and one independent variable, then the easiest case is presented-a univariate regression. However, sometimes, as in analytical chemistry problems, the situation is more complicated, including a greater number of dependent variables [6]. Taking the above into account, the selection of an appropriate chemometric tool is dictated by the purpose of the analysis and the characteristics of a given problem. Moreover, obtaining satisfactory results may require the use of several tools. The most commonly used chemometric tools in chemical analysis are briefly described below [7].
The most commonly used chemometric tools in chemical sciences are principal component analysis (PCA) [8,9] and cluster analysis (CA) [6,10]. These unsupervised techniques are very often applied for reducing the dimension of the original data [11], finding internal patterns in the dataset [12,13], or discovering the dominant factors [14,15]. In element classification, very popular are supervised techniques such as linear discriminant analysis (LDA) [16] and partial least squares (PLS) [17,18]. However, they may also be used for prediction [19,20]. An example of regression algorithms may Symmetry 2020, 12, 2055 3 of 21 be similar to each other: multiple linear regression (MLR) [21] and principal component regression (PCR) [22]. They are mainly used in data analysis for finding the relationship among variables that effect the prediction of variable values (e.g., chemical compounds' properties). Nevertheless, the most widely used prediction tools are mathematical models from the quantitative structure-activity relationship (QSAR) family [23,24]. They allow for finding the physicochemical, biological, and environmental fate properties of compounds in reference to the knowledge of their chemical structure (new and existing chemical compounds) without animal use in, for example, toxicological testing. Nowadays, artificial neural network (ANN) and genetic algorithm (GA) are gaining more attention in the field of chemical sciences while identifying patterns in data, even complex ones. This is due to their structures and mechanisms, because both of them are comparable to evolutionary processes in nature, namely, equivalents of genes and chromosomes in GA [25] or the biological (human or animal) central nervous system (including neurons) in ANN [26]. They can be successfully used separately [27] or often as a combined tool [28,29]. It is worth noting that these are not all of the techniques that may be used for this purpose. Other approaches, for instance, sum of ranking differences (SRD) [30], k-nearest neighbours (KNN) method [31], and support vector machine, (SVM) [32,33], may also be successfully applied for alternative data treatment in the context of green chemistry. Details of the mentioned chemometric techniques are described elsewhere (some references given in brackets); therefore, they are not fully described in this review.

Selection
The problem of selection can be related to the solvents and other chemical reagents (for instance, derivatization agents) used in operations, such as extraction, clean-up, and derivatization. In these cases, the selection of appropriate solvents and chemical reagents for additional chemical activities is extremely important to obtain satisfactory results. Nevertheless, it is worth looking for substitutes for those chemicals mentioned above that are less hazardous to the environment, which correspond to the 5th and 8th of the 12 principles of green chemistry for solvents and derivatization agents, respectively. Considering the above, it is not surprising that the selection of appropriate chemical reagents is a topic of interest in chemometrics.
An approach for fast selection of solvents for a given industrial application with the use of chemometric tools is proposed by García et al. [34]. First, the QSPR (quantitative structure-property relationship) model is developed to find the relationship between the molecular structure and some fundamental solvent properties. Then MLR (multiple linear regression) and PLS (partial least squares) are used for the selection of 62 glycerol-based solvents with respect to three solvent features: the behaviour of the dissolution processes (solvatochromic parameter E N T ), mechanical aspects (viscosity), and volatility aspects (closely related to safety, toxicity, and air pollution considered through the boiling point). A comparison of applied chemometric tools shows that both of them represent good results in the E N T solvation parameter. MLR is only appropriate in the E N T solvation parameter, whereas PLS offers better fitting of two of the three properties considered simultaneously. Viscosity and boiling point do not fit well enough to lead to a fully predictive model; however, PLS provides a higher value of determination coefficient for boiling point. A solvent selection system based on a combination of chemometrics and multicriteria decision analysis is proposed by Tobiszewski et al. in line with the concept of green chemistry [35]. CA (cluster analysis), together with the TOPSIS (the technique for order of preference by similarity to ideal solution) algorithm, allows for, first, grouping and then ranking within groups of 151 solvents in respect to physicochemical, toxicological, and hazard parameters. Three clusters, as presented in Figure 1, are obtained: nonpolar and volatile (35 solvents), nonpolar and sparingly volatile (35 solvents), and polar (81 solvents). The results are compared with another SSG (solvent selection guide) developed by Pfizer [36], GlaxoSmithKline [37], AstraZeneca [38], Sanofi [39], and CHEM21 [40], which are well known in the pharmaceutical industry, confirming a general agreement of solvent rankings within each cluster. [40], which are well known in the pharmaceutical industry, confirming a general agreement of solvent rankings within each cluster. Similar results were recently presented by Sels et al. with the application of MDS (multidimensional scaling) [41]. Solvents were assigned to three groups based on their 22 physical properties according to safety, health, and environment scores: polar compounds, slightly water-soluble solvents, and hydrophobic solvents. In the MDS visualization, the solvents that were similar were plotted closer together in the 2D solvent space. However, it was noted that the relative influence of a functional group decreased with increasing chain length and molecular size. Then a straight line in the MDS visualization was not visible for homologous series from alcohols (due to drastic increase in boiling point and decrease in water solubility, vapour pressure, and relative evaporation rate). Moreover, the application of SUSSOL (Sustainable Solvents Selection and Substitution Software), a specially created software by applying artificial intelligence (AI), is presented for finding solvent replacements for N-methylpyrrolidone (NMP), toluene, and tetramethyl oxolane (TMO). The proposed alternative solvents are as follows: 10 candidate alternative solvents (including dimethyl sulfoxide, Cyrene, N-butyl pyrrolidone, pyridine, acetone, methyl acetoacetate, 1-ethyl pyrrolidone, dimethylacetamide, dimethylformamide, nicotine) for NMP; isobutylbenzene and p-cymene for toluene; and toluene, 1,1-dichloroethene, 1,1-dichloroethane, 1,1,1-trichloroethane, 1,1-dichloropropane, ethylene glycol diethyl ether (1,2-diethoxyethane), and so forth for TMO. An example of visualization dedicated to possible alternatives for NMP by SUSSOL software is presented in Figure 2. Similar results were recently presented by Sels et al. with the application of MDS (multidimensional scaling) [41]. Solvents were assigned to three groups based on their 22 physical properties according to safety, health, and environment scores: polar compounds, slightly water-soluble solvents, and hydrophobic solvents. In the MDS visualization, the solvents that were similar were plotted closer together in the 2D solvent space. However, it was noted that the relative influence of a functional group decreased with increasing chain length and molecular size. Then a straight line in the MDS visualization was not visible for homologous series from alcohols (due to drastic increase in boiling point and decrease in water solubility, vapour pressure, and relative evaporation rate). Moreover, the application of SUSSOL (Sustainable Solvents Selection and Substitution Software), a specially created software by applying artificial intelligence (AI), is presented for finding solvent replacements for N-methylpyrrolidone (NMP), toluene, and tetramethyl oxolane (TMO). The proposed alternative solvents are as follows: 10 candidate alternative solvents (including dimethyl sulfoxide, Cyrene, N-butyl pyrrolidone, pyridine, acetone, methyl acetoacetate, 1-ethyl pyrrolidone, dimethylacetamide, dimethylformamide, nicotine) for NMP; isobutylbenzene and p-cymene for toluene; and toluene, 1,1-dichloroethene, 1,1-dichloroethane, 1,1,1-trichloroethane, 1,1-dichloropropane, ethylene glycol diethyl ether (1,2-diethoxyethane), and so forth for TMO. An example of visualization dedicated to possible alternatives for NMP by SUSSOL software is presented in Figure 2.
A screening of potential PBT (persistent, bioaccumulative, and toxic) compounds (in an environment based on persistence, bioconcentration, and toxicity data) is another example of chemical selection, but different from solvents [42]. PCA is used to group chemicals representing many classes of pollutants of various chemical structures, such as dioxins, PCBs, PAHs, and pesticides, and various industrial chemicals according to their potential cumulative PBT behaviour. However, due to unavailability of experimental data, an approach combining multivariate analysis and QSAR/QSPR (quantitative structure-activity relationship) was applied, which allowed for the reduction of data gaps in the dataset. The strength of the approach is validated in two sequential steps: first, performed on the available experimental dataset, including 54 chemicals, and then performed on the dataset of 180 chemicals (developed by QSPR). In Figure 3, the analysis of the latter dataset of organic compounds using PCA is presented. A screening of potential PBT (persistent, bioaccumulative, and toxic) compounds (in an environment based on persistence, bioconcentration, and toxicity data) is another example of chemical selection, but different from solvents [42]. PCA is used to group chemicals representing many classes of pollutants of various chemical structures, such as dioxins, PCBs, PAHs, and pesticides, and various industrial chemicals according to their potential cumulative PBT behaviour. However, due to unavailability of experimental data, an approach combining multivariate analysis and QSAR/QSPR (quantitative structure-activity relationship) was applied, which allowed for the reduction of data gaps in the dataset. The strength of the approach is validated in two sequential steps: first, performed on the available experimental dataset, including 54 chemicals, and then performed on the dataset of 180 chemicals (developed by QSPR). In Figure 3, the analysis of the latter dataset of organic compounds using PCA is presented.   A screening of potential PBT (persistent, bioaccumulative, and toxic) compounds (in an environment based on persistence, bioconcentration, and toxicity data) is another example of chemical selection, but different from solvents [42]. PCA is used to group chemicals representing many classes of pollutants of various chemical structures, such as dioxins, PCBs, PAHs, and pesticides, and various industrial chemicals according to their potential cumulative PBT behaviour. However, due to unavailability of experimental data, an approach combining multivariate analysis and QSAR/QSPR (quantitative structure-activity relationship) was applied, which allowed for the reduction of data gaps in the dataset. The strength of the approach is validated in two sequential steps: first, performed on the available experimental dataset, including 54 chemicals, and then performed on the dataset of 180 chemicals (developed by QSPR). In Figure 3, the analysis of the latter dataset of organic compounds using PCA is presented.  According to PBT index values, chemicals are grouped into three regions: region 1-not PBT chemicals, region 2-chemicals with medium PBT properties, and region 3-PBT and vPvB (very persistent and very bioaccumulative) chemicals.

Classification
Classification as a systematic arrangement in groups or categories according to established criteria is sometimes very useful in designing a chemical process or reaction. It allows for recognizing some alternatives with corresponding characterization.
Translating the principle similia similibus solvuntur into the field of chemistry means solvents belonging to the same group demonstrate similar abilities to dissolve compounds. Therefore, chemometric classification of solvents according to the degree of polarity may provide information about possible substitutes. This kind of grouping addressed to organic solvents is one of the frequently undertaken problems in chemometrics, which is summarized in Table 1.   [49] Interestingly, these classifications are carried out for various objects (types of solvents) using different chemometric tools, for instance, PCA, KNN (k-nearest neighbours method), Parker-Reichardt classification, CP-ANN (counter-propagation artificial neural network), ANN (artificial neural network), Symmetry 2020, 12, 2055 9 of 21 PCA, and CA, obtaining similar results. An example may be the study performed by Dutkiewicz [44] using the Parker-Reichardt classification, whose results highly correspond to those obtained by a more complex multivariate statistical method presented by M. Chastrette et al. [43]. Moreover, there are applications with few tools applied. The idea is to improve the results of classification, for instance, by making them more chemically interpretable, as in organic solvent classification based on molecular descriptors (theoretical descriptions of the molecular structure), where KNN application is followed by CP-ANN [46].
One of the latest works considers a classification of 72 solvents according to polarity and selectivity issues based on the Snyder approach (related to different polar interactions), performed using FCM (fuzzy c-means) and FLDA (fuzzy linear discriminant analysis) [49]. The used fuzzy chemometric techniques show high efficiency and information power methods in solvent characterization and classification (an approach for rationalchoosing of a good solvent). The obtained results (division into eight groups of solvents) are in good agreement with the Snyder classification, especially using FLDA (the highest value of 100% for the solvents corresponding to groups II and V and the lowest value of 66.67% for the solvents of group I).
However, the classification does not always take into account a large number of groups/classes. Salahinejad [50] proposed a division of solvents for single-walled carbon nanotube dispersion into two groups: solvents and nonsolvents (solvents with effectively zero of nanotube dispersibility). The classification is conducted separately with several tools, such as RF (random forest), SVM (support vector machine), MLP (multilayer perceptron), and QDA (quadratic discriminant analysis). According to the results of the sum of ranking difference (SRD) procedure, the RF classifier based on selected descriptors is the best classification model, while the SVM, MLP, and QDA are ranked as good models.
Moreover, another classification of solvents based on a chemical group of compounds was performed by Katritzky et al. [51] and Tobiszewski et al. [52]. In the first case, a classification of the theoretical molecular descriptors, derived from the chemical structure alone (QSPR model), according to their relevance to specific types of intermolecular interaction (including cavity formation, electrostatic polarization, dispersion, and hydrogen bonding) in liquid media is presented. According to the PCA results, 11 classes of solvents were formed: hydrocarbons; halo-hydrocarbons; saturated, unsaturated, and cyclic ethers; esters and polyesters; aldehydes, ketones, and amides; nitriles and nitro hydrocarbons; hydroxylic compounds; amines and pyridines; thiols, sulphides, sulfoxides, and thio compounds; phosphorus compounds; and compounds with vastly different chemical functionalities. In the latter case, CA and PCA were used to group around 130 potentially green organic solvents according to their similarity based on physiochemical parameters, as well as to assess and identify variables from which properties missing values such as bioconcentration factors, water-octanol, and octanol-air partitioning constants can be predicted. The CA results show that polar solvents are divided into three major groups: (a) less volatile solvents, slightly water soluble with high values of logKOW and logBCF (alcohols with ether functional groups, aromatic alcohols, and short-chain organic acids apart from formic and acetic); (b) less volatile and very highly water-soluble solvents (lactate esters, formic and acetic acids, glycerol, and some alcohols with other functional groups); and (c) highly volatile, low-boiling-point, high vapour pressure, and Henry's law constant solvents ("traditional" polar solvents, like short-chain alcohols, ketones, aldehydes, and esters). On the other hand, nonpolar solvents were divided into volatile, water-nonsoluble, and slightly water-soluble solvents. According to a chemometric analysis connected with finding the internal relationship between bioconcentration factors and physiochemical parameters, in polar solvents, the variable logBCF forms a separate latent factor not directly correlated with other variables (specific importance of this parameter as a discriminant for the dataset). Unlike in nonpolar solvents, the relationship between parameters like logBCF and logKOW and Henry's law constant and the correlation of logKOA with a whole group of physicochemical parameters, like surface tension, density, boiling, and melting point, is visible.
A different approach for the classification of 259 solvents according to the experimentally found and theoretically predicted physicochemical parameters presented by 15 specific descriptors is proposed by Nedyalkova et al. (2020) [53]. The variables involved parameters such as melting point, boiling point, density, water solubility, vapour pressure, Henry's law constant, octanol-water and octanol-air partition coefficients, and bioconcentration factor, some of which are implemented within the modules of EPI Suite or by the SMILES codes (simplified molecular input line entry system). The fuzzy hierarchical clustering methods allow for checking whether the experimental values of the respective variables correspond to the calculated ones, and the partitioning procedure could determine stable groups of similarity between the variables with highly different degrees of membership. The performed partitioning with respect to specific descriptors divides solvents into 10 classes (some examples of solvents within each class are presented in brackets) (i.e., chlorinated solvents-class 1 (iodoethane, n-butyl acetate, m-cresol, diethyl carbonate, chloroform), nonpolar and volatile solvents-class 2 (bromoethane, benzonitrile, isobutyl acetate, carbon disulphide), polar and nonpolar solvents mixed-class 3 (benzene, dichloromethane, diethyl ether, triethylene glycol, polyethyleneglycol 200), polar solvents-classes 4-7 (dioctylsuccinate, oleic acid, 2-pyrrolidone, glycerol, water, 1-octanol, nitrobenzene, methyl stearate), high molecular weight polar solvents-class 8 (ethyl laurate, anisole), large group of mostly polar solvents with some exceptions-class 9 (triethylamine, ethanol, 1-butanol, formamide, toluene, o-xylene, aniline, n-heptane, d-limonene, styrene, acetone, phenol, acetonitrile), and outlier-class 10 (perfluorooctane 20). The relationships between solvents of various natures (polar, nonpolar, volatile, etc.) and the physicochemical variables are found, despite the fact that missing data of specific descriptors are fulfilled via theoretical calculation. Moreover, applied chemometric techniques allow for partitioning solvents with more or less similar characteristics in terms of higher, smallest, or intermediate values of considered descriptors.
One of the most interesting groups of solvents are ionic liquids (ILs) due to their desired feature-designing of solvents with particular properties (within certain ranges) by a combination of selected cation and anion. Therefore, characterization of their types is very important for finding an appropriate alternative, for instance, in phases for gas chromatography. This aspect is discussed by González-Álvarez et al. in the classification of three ILs with hexacationic imidazolium, polymeric imidazolium, and phosphonium as cations and halogens, thiocyanate, boron anions, triflate, and bistriflimide as anions [54]. The application of CA, LDA (linear discriminant analysis), D-PLS (discriminant partial least squares), and MLR shows that two main groups of phases may be distinguished: ILs with acidic and basic characterization. After the identification of the two natural groups of ILs by CA, several supervised chemometric techniques, such as LDA, D-PLS, and MLR were used to construct models of pattern recognition and classification rules for ILs. All tools showed high prediction capacity and were successfully used for characterizing IL classes. The best results were obtained via LDA with >96% for classification and >92% for prediction, followed by MLR with 96.7% and 92% in the prediction for classes A and B, respectively.
In another study, 227 ionic liquids and their related salts were also classified based on their toxicities towards rat cell lines [55]. Regardless of the used chemometric method (LDA, CA, SVM (support vector machine), or CP-ANNs (counter-propagation artificial neural networks)), ILs were classified into four categories: low, moderate, high, and very high toxicity. In this study, CP-ANN turned out to be more favourable over other methods in terms of accuracy of classification, underlining that CP-ANNs may extract actual information and knowledge from the dataset.
An interesting approach with a classification map called the Σpider diagram was proposed by Lesellier [56]. Solvents were classified based on physiochemical properties encountered with other visual presentations, such as Snyder triangle, Hansen parameters, LSER (linear solvation energy relationships), Abraham descriptors, COSMO-RS (Conductor like Screening Model for Real Solvents) parameters, and solvatochromic solvent selectivity. Visualization of the last solvent classification is presented in Figure 4. This diagram shows many advantages of solvent classification through a better view of solvents having no acidic character (for the solvatochromic solvent selectivity), easier usage due to the "flattening" of the spherical view down to a single plane (for Hansen parameters), more subtle classification due to the use of five parameters instead of three (for COSMO-RS), and simple view of the solvent groups having similar or different properties (for Abraham descriptors). An approach may be useful not only for selecting suitable solvents for extraction, separation, or purification approaches and for solubility studies but also for choosing greener solvents.
There are also other fields of interest apart from solvents, for instance, pharmaceutical excipients in reference to their solubility parameters [57]. PCA is used to predict a behaviour of materials in a multicomponent system (e.g., for the selection of the best materials to form stable pharmaceutical liquid mixtures or stable coating formulation). It is significantly important because similarity between the values of the respective components of the solubility parameter allows for the estimation of the compatibility between different materials (solvents, colorants, lubricants, coating components, and powder blends).

Properties (Prediction and Correlation)
Knowledge of the physicochemical properties of compounds is necessary to predict their behaviour under various conditions or factors during chemical reactions, and their behaviour in various media or compartments in the environment (environmental fate). Therefore, this explains the need to obtain information on the solvents' and other chemical reagents' properties. Unfortunately, sometimes there are missing points in chemical characteristics. Thus, some prediction and computational methods for filling the gaps are highly required and successfully applied.
An example of the most popular advanced and computational modelling approaches may be QSAR (quantitative structure-activity relationship) and EPI Suite (Estimation Programs Interface Suite). QSAR models allow for the prediction of the physicochemical, biological, and environmental fate properties of compounds in reference to knowledge of their chemical structure. The concept is based on establishing quantitative relationships between descriptors (referring to the chemical structure) and the target property capable of predicting activities of novel compounds [58]. On the This diagram shows many advantages of solvent classification through a better view of solvents having no acidic character (for the solvatochromic solvent selectivity), easier usage due to the "flattening" of the spherical view down to a single plane (for Hansen parameters), more subtle classification due to the use of five parameters instead of three (for COSMO-RS), and simple view of the solvent groups having similar or different properties (for Abraham descriptors). An approach may be useful not only for selecting suitable solvents for extraction, separation, or purification approaches and for solubility studies but also for choosing greener solvents.
There are also other fields of interest apart from solvents, for instance, pharmaceutical excipients in reference to their solubility parameters [57]. PCA is used to predict a behaviour of materials in a multicomponent system (e.g., for the selection of the best materials to form stable pharmaceutical liquid mixtures or stable coating formulation). It is significantly important because similarity between the values of the respective components of the solubility parameter allows for the estimation of the compatibility between different materials (solvents, colorants, lubricants, coating components, and powder blends).

Properties (Prediction and Correlation)
Knowledge of the physicochemical properties of compounds is necessary to predict their behaviour under various conditions or factors during chemical reactions, and their behaviour in various media or compartments in the environment (environmental fate). Therefore, this explains the need to obtain information on the solvents' and other chemical reagents' properties. Unfortunately, sometimes there are missing points in chemical characteristics. Thus, some prediction and computational methods for filling the gaps are highly required and successfully applied.
An example of the most popular advanced and computational modelling approaches may be QSAR (quantitative structure-activity relationship) and EPI Suite (Estimation Programs Interface Suite). QSAR models allow for the prediction of the physicochemical, biological, and environmental fate properties of compounds in reference to knowledge of their chemical structure. The concept is based on establishing quantitative relationships between descriptors (referring to the chemical structure) and the target property capable of predicting activities of novel compounds [58]. On the other hand, EPI Suite may estimate physical/chemical and environmental fate properties such as water solubility, octanol-water partition coefficient, Henry's law constant, melting point, boiling point, and aquatic toxicity, taking into account chemical structure as input data (depending on the chosen estimation model program) [59]. However, the easiest manner is chemical predictive modelling, which is based on an observation of some patterns, correlations between variables in dataset. In this respect, the chemometric tools play an important role.
As mentioned in Section 3, the use solvents in chemistry is one of the most important issues with respect to environmental aspects. In this manner, the type of solvent and its amount are of great importance. ILs are very often described in the context of solvents with incredible features, such as negligible vapour pressure, high chemical and thermal stability, low flammability, large liquidus range, high ionic conductivity, large electrochemical window, excellent solvation ability of a wide range of compounds, and most of all, possibility of designing for specific demands (due to an appropriate selection of cation and anion). However, there are also numerous studies where the authors pay attention to the environmental problem due to poor biodegradability, toxicity, and methods of preparation and degradation after use [60][61][62][63][64][65]. Nevertheless, the lack of data for IL characterization in the context of greenness assessment is a serious problem. It may make the evaluation difficult and in some sense inaccurate and inappropriate in flat assertions on ILs as alternative green solvents [66]. Hence, a large number of publications on predicting the properties of ionic liquids have been performed, as shown in Table 2.  The prediction of IL properties may be successfully conducted using different chemometric tools. It is mostly proved by a comparison of predicted values with experimental/literature ones, such as in estimation melting point [68] or viscosity [69]. Moreover, it sometimes happens that one technique is applied to select appropriate descriptors; then another one is used for the prediction of a particular feature. In some cases, the applications of several chemometric methods are compared, as presented with the example of carbon dioxide solubility [67], electric conductivity [70], density [71], and toxicity [74]. In first case, nonlinear models, such as RB (radial basis network) and MLP (multilayer perceptron) turned out to be more adequate when the mathematical complexity of the model is not important or a high accuracy is necessary. On the other hand, MQR (multiple quadratic regression) is recommended for faster computation if the operating conditions are stable. Prediction of electric conductivity using an ANN model is more favourable than using an MLR model due to more rational nonlinear modelling. An interesting approach is presented for the latter case-toxicity prediction based on molecular descriptors and EC 50 concentrations for the inhibition of acetylcholinesterase using a decision tree(s) model. Decision tree(s) models (R = 0.992) significantly outperform other models, such as PCR (principal component regression) and PLS (R = 0.62 and 0.64), for numerical predictions of EC 50 concentrations and the classification of ILs into four levels of toxicity. The visualization of this division into four classes is presented in Figure 5.
It is not always the rule that one of the models used is clearly better than the others. Very often, all of them or some of them lead to satisfactory results, which is described by Huang et al. [71] for density prediction. ER (extended Riedel) and ANN proved to be accurate in a wide range of compositions and temperatures. However, the ER model is a better alternative because it can be used directly without any adjustable parameter and computer-aided program. Sometimes satisfactory results may be obtained by the application several chemometric tools, one by one. Barycki et al. proposed the application of PCA for the definition of the distribution trends of four IL properties dependently on their structures. Then CA is used to provide some detailed information concerning IL distribution [72]. It is also worth noting that chemometrics may be the basis for developing other tools. According to the observed strong relationship between the variance in the observed toxicity and the cations' descriptors, a toxicity ranking index based on the structural similarity of cations (TRIC) for initial toxicity screening studies of ILs has been developed [75]. However, the use of TRIC cannot be individual. It is limited to the prediction of toxicity endpoints used in its development. It is not always the rule that one of the models used is clearly better than the others. Very often, all of them or some of them lead to satisfactory results, which is described by Huang et al. [71] for density prediction. ER (extended Riedel) and ANN proved to be accurate in a wide range of compositions and temperatures. However, the ER model is a better alternative because it can be used directly without any adjustable parameter and computer-aided program. Sometimes satisfactory results may be obtained by the application several chemometric tools, one by one. Barycki et al. proposed the application of PCA for the definition of the distribution trends of four IL properties dependently on their structures. Then CA is used to provide some detailed information concerning IL distribution [72]. It is also worth noting that chemometrics may be the basis for developing other tools. According to the observed strong relationship between the variance in the observed toxicity and the cations' descriptors, a toxicity ranking index based on the structural similarity of cations (TRIC) for initial toxicity screening studies of ILs has been developed [75]. However, the use of TRIC cannot be individual. It is limited to the prediction of toxicity endpoints used in its development.
One of the most frequently predicted environmental parameters is toxicity, which may be noticed due to the visible trend in IL properties' prediction analysis as summarized above [74][75][76][77]. It is expressed by different endpoints towards various organisms. Toxicity assessment is very important from green chemistry's point of view. Some examples of studies concerning the prediction of toxicity for selected chemicals as potential pollutants are summarized in Table 3.  One of the most frequently predicted environmental parameters is toxicity, which may be noticed due to the visible trend in IL properties' prediction analysis as summarized above [74][75][76][77]. It is expressed by different endpoints towards various organisms. Toxicity assessment is very important from green chemistry's point of view. Some examples of studies concerning the prediction of toxicity for selected chemicals as potential pollutants are summarized in Table 3.   Based on the above studies, the methods from the family of QSAR models are willingly used for toxicity prediction. They allow for the achievement of good results and provide more than 95% predictions for agrochemical toxicity towards Daphnia magna [86]. QSAR models are often supported by chemometrics; however, there is no dominant chemometric tool that ensures the best prediction ability. In nitrobenzene toxicity prediction, LS-SVM (least squares-support vector machines) turned out to be the more powerful method than the rest [80]. The reason is fact that LS-SVM (for quantum chemical descriptors) drastically enhances the ability of prediction in QSAR (prediction of IGC 50 toxicity) studies superior to MLR and PLS.
Other parameters of great importance for the assessment of the environmental risk associated with the use of chemical compounds are the partition coefficients towards different media. They allow for the estimation of the affinity of a particular chemical compound to a selected phase system. Octanol-air or octanol-water partition coefficients may be applied as the predictors of the partitioning of semivolatile organic chemicals to aerosols or a chemical compound to dissolve in fats, oils, lipids, and nonpolar solvents, respectively. Moreover, the value of the latter coefficient could provide information on the potential for bioaccumulation as well as in persistent compounds undergoing biomagnification [89,90]. In Table 4, a list of studies on the chemometric prediction of partition coefficients in presented.  [96] The information summarized in Table 4 shows that the application of the combination the QSPR model and chemometric methods is common. In the estimation of the water-polydimethylsiloxane [92] and n-octanol-water [21] partition coefficients of organic compounds, the best techniques turned out to be ANN and LS-SVM, respectively. This results in a significant improvement in prediction quality. Two years later, Goudarzi and Goodarzi [94] conducted a prediction of the n-octanol-water partition coefficient for the same dataset of organic compounds but using different techniques, namely, MLR, PLS, and RBF-PLS (radial basic function-partial least squares). This time, due to flexible mapping of the selected features by manipulating their functional dependence implicitly unlike regression analysis, RBF-PLS is considered to be better than MLR and PLS models.
An interesting approach for the n-octanol-water partition coefficient for polychlorinated naphthalenes (PCNs) congener is proposed by Gu et al. [95], where QSAR is combined with comparative molecular field analysis (CoMFA) and comparative molecular similarity indices analysis (CoMSIA). These two models are dedicated to 3D-QSAR approaches, where the 3D conformation property of compounds has to be taken into account (possibility of exploring, visualizing a structural information, and designing new compounds with particular properties). Although the results of both models show good prediction ability, the CoMSIA model is better in designing new types of compound molecules due to the higher number of descriptors. The readiness of chemicals to concentrate in organisms when the compounds are present in the environment may also be defined by bioconcentration factor (BCF). Prediction of this environmental property for some organic compounds using QSAR combined with GA-ANN (for the selection of appropriate descriptors) is proposed by Fatemi et al. [29].

Conclusions
There are various chemometric tools that can give benefits in terms of green chemistry. Application of even the simplest and well-known techniques for dimensionality reduction and grouping of objects or variables, such as CA or PCA, may result in significant advantages. These are the treatments for missing data, so chemical parameters are predicted without performing problematic, time-consuming, and material-demanding measurements. Even finding correlations in the dataset can give clues on the selection of proper materials. In this way, there is a possibility of estimation of the environmental fate of chemical compounds if the predicted datapoints refer to their behaviour in the environment. Reducing the number of elements in the dataset by grouping objects according to similarities leads to a preselection of objects for further consideration by more detailed studies. Selection of chemical compounds with similar characteristics by chemometric techniques is helpful in finding greener alternatives, compounds that are less problematic but retain their desired features. Multivariate statistics are successfully applied in green chemistry studies, and their significance is expected to be growing.