Prediction of the Solubility of CO 2 in Imidazolium Ionic Liquids Based on Selective Ensemble Modeling Method

: Solubility data is one of the essential basic data for CO 2 capture by ionic liquids. A selective ensemble modeling method, proposed to overcome the shortcomings of current methods, was developed and applied to the prediction of the solubility of CO 2 in imidazolium ionic liquids. Firstly, multiple di ﬀ erent sub–models were established based on the diversities of data, structural, and parameter design philosophy. Secondly, the fuzzy C–means algorithm was used to cluster the sub–models, and the collinearity detection method was adopted to eliminate the sub–models with high collinearity. Finally, the information entropy method integrated the sub–models into the selective ensemble model. The validation of the CO 2 solubility predictions against experimental data showed that the proposed ensemble model had better performance than its previous alternative, because more e ﬀ ective information was extracted from di ﬀ erent angles, and the diversity and accuracy among the sub–models were fully integrated. This work not only provided an e ﬀ ective modeling method for the prediction of the solubility of CO 2 in ionic liquids, but also provided an e ﬀ ective method for the discrimination of ionic liquids for CO 2 capture.


Introduction
With the increase of energy consumption in industrial production, reducing CO 2 emissions and increasing CO 2 absorption have become an essential means to alleviate environmental degradation [1]. Room-temperature ionic liquids, which are relatively new compounds, have gained much attention in recent years, and had the potential to be considered as an alternative to conventional volatile organic solvents in the reaction and separation processes. Information about the solubility and the rate of solubility is a crucial factor for consideration of ionic liquids in potential industrial processes [2,3]. A large number of ionic liquids can be synthesized due to their special ionic structure. Due to some difficulties associated with experimental measurements and the cost of ionic liquids, it is more advantageous to develop predictive methods for prediction of the phase behavior of such systems [4][5][6]. Therefore, the modeling prediction methods have become an important way to obtain the solubility data of CO 2 in ionic liquids, which is divided into the mechanism modeling method and the data-driven modeling method.
In order to understand and predict the phase behavior of CO 2 and ionic liquid mixtures, Perturbed Hard Sphere Chain Equation of State (PHSC) has been selected to simulate the CO 2 absorption in a series of ionic liquids [7]. CO 2 solubility in ionic liquids had been calculated based on two

Sub-Model Training
Zhou et al. [17] proposed the theory of 'many can be better than all', which presupposed that the sub-models had a high degree of diversity and accuracy. The diversity of sub-models had an essential impact on improving the generalized performance of ensemble models [18,19]. The sub-models were established based on data, structural, and parameter diversities.

Data Diversity
Multiple datasets from the original dataset were generated based on data diversity to train different sub-models. The data sets should be different from each other to obtain different results from the trained sub-models. Bootstrap aggregation (Bagging), Adaptive Boosting (AdaBoost), and random subspace were commonly used to achieve data diversity. To generate several training sets with different attributes, the bootstrap algorithm was introduced to achieve the goal. When re-sampling was enough, about 36.8% of the given data sets did not appear in the constructed training set, which ensured the diversity of the data.

Structural Diversity
Different model structures were used to induce structural diversity, and three data-driven algorithms (Back Propagation Neural Network (BPNN), Extreme Learning Machine (ELM), and Radial Basis Function Neural Network (RBFNN)) were used to train the generated sub-training sets. These sub-models varied in size and architecture, and such collections were called heterogeneous ensemble [20]. To control the diversity in the heterogeneous integration, the 'overproduce and choose' strategy was performed. Firstly, a large number of models were trained; then, a selection or combination of these models was made to optimize performance, which purpose was to minimize the size of the ensemble model without significantly reducing the accuracy of the model [21,22].

Parameter Diversity
Parameter diversity uses different parameter sets to generate different sub-models. Even if the same training set is used, the output of the sub-model may vary with different parameter sets. The method of adjusting the internal parameters of the model was adopted to ensure the diversity of model parameters. For BPNN and ELM, the internal parameters of the adjusted model are the number of neurons, the activation function and the number of hidden layers. For RBFNN, the internal parameters of the adjusted model are the kernel function center and width.

Sub-Model Discrimination
To improve the prediction ability of the ensemble model or reduce the prediction cost, it is necessary to screen the established sub-models and avoid multicollinearity in the sub-model as much as possible. As one of the main techniques of unsupervised machine learning, fuzzy clustering analysis had been widely used in large-scale data analysis, data mining, pattern recognition, and other fields [23][24][25]. The fuzzy C-means clustering was used to screen the sub-model in this study.
N sub-models are defined, the parameters of each model are w i (I = 1,2, . . . ,N), c sub-models are clustered centers and denoted by m j (j = 1,2, . . . ,c), the sample set is S = {(x 1 ,y 1 ),(x 2 ,y 2 ), . . . ,(x n ,y n )}, where x is the input variable, y is the output variable, and n is the number of data in the sample set. For all the sub-models to be clustered, the difference between the models can be measured by the Euclidean distance between the sub-models. The calculation formula of the Euclidean distance is as follow: where d(r 1 , r 2 ) = r 1 − r 2 2 is the distance between the sub-models, and y w i , x k and y m j , x k represent the output of x k on the parameters w i and m j , respectively. The results of the sub-models on the input data set were adopted to define the difference between the models. The larger the Euclidean distance was, the greater the difference between the two sub-models was.
To perform cluster analysis on the sub-models, the outputs of each sub-model on the sample point x k (k = 1,2, . . . ,n) is composed into a vector, namely: z i = (y(w i ,x 1 ),y(w i ,x 2 ), . . . ,y(w i ,x n )) (I = 1,2, . . . ,N), the outputs of N sub-models with dimension n can be obtained. To determine the optimal number of clusters output by the N sub-models, CH indicators and Davies-Bouldin (DB) could be used as evaluation indicators. In consideration of computational efficiency, the CH evaluation index was utilized to determine the optimal number of clusters. The CH index used the intra-class dispersion matrix to describe the tightness and the inter-class dispersion matrix to describe the separation. The specific calculation formula is as follow: where n is the number of clusters, k is the current class, trB(k) is the trace of the inter-class dispersion matrix, and trW(k) is the trace of the intra-class dispersion matrix. The larger the CH is, the closer and the more disperse the class is. When the number of clusters is 1, the CH evaluation index cannot be used. In the clustering process, when the difference between the two sub-models is very large, it means that the two sub-models are likely to be in different clusters, otherwise they may be in the same cluster. Due to the similarity of the sub-models in the same cluster, the output results obtained by these sub-models under the same input are similar [26]. When the fuzzy C-means algorithm is applied to over-generated sub-models, it is necessary to detect the collinearity of the sub-models in each cluster [27]. Belsleyet et al. [28] believed that the existence of collinearity between models will not only increase the workload of modeling, but also affect the actual performance of the model. The variance expansion coefficient (VIF) was used to judge the collinearity of each sub-model. The larger the value of VIF was, the more serious the collinearity was. 10 are taken as the judgment boundary: when VIF < 10, there is no multicollinearity; when 10 ≤ VIF ≤ 100, there is strong multicollinearity; when VIF ≥ 100, there is serious multicollinearity. The calculation formula of VIF is as follows: where R i represents the multiple determination coefficient of the independent variable x i for the regression analysis of other independent variables.

Sub-Model Ensemble
The weight coefficient of the sub-model generally reflects the degree of influence of the sub-model on the ensemble model. The reasonable determination of the weight coefficients of the sub-models will directly affect the prediction accuracy of the model, so it is necessary to adopt appropriate methods to determine the weight coefficients of each sub-model. Information entropy is an effective measurement tool to describe information content (information structure, uncertainty, etc.). Using the information entropy method to determine the weight coefficient of each sub-model can effectively reduce the impact of the weak sub-model on the model performance [29]. In this paper, the information entropy method is used to obtain the weight coefficient of the optimal sub-model [11].

Implementation Step
A new selective ensemble modeling method was established, and its implementation process is shown in Figure 1, which mainly includes data collection and grouping, sub-model training, sub-model discrimination, and sub-model ensemble and model performance testing. The specific implementation steps were as follows:

Implementation Step
A new selective ensemble modeling method was established, and its implementation process is shown in Figure 1, which mainly includes data collection and grouping, sub-model training, sub-model discrimination, and sub-model ensemble and model performance testing. The specific implementation steps were as follows:  (1) Data collection and grouping The appropriate auxiliary variables were determined as the input variables of the model, and the dominant variable to be predicted were taken as the output variable of the model. Firstly, the collected original samples data set was normalized, and then the preprocessed original sample data set was randomly divided into a training set, a validation set, and a test set in an appropriate ratio. The training set was ensured to cover all the types of the experimental data and operating conditions.
(2) Sub-model training Firstly, different training sets were generated based on the diversity of the data. Then, multiple BPNN sub-models, ELM sub-models, and RBFNN sub-models with different structural parameters were established by using the realization method of structure diversity and parameter diversity, so that these sub-models have the characteristics of high diversity and accuracy.
(3) Sub-model discrimination The validation set was used to evaluate the predictive performance of all sub-models, and the Euclidean distance was used as the standard to evaluate the differences of sub-models. The fuzzy C-means algorithm was adopted to cluster all the sub-models, and the Calinski-Harabasz (CH) method was used to determine the optimal cluster number. After clustering, the collinearity detection method was used to eliminate some sub-models with high collinearity in the same cluster, and only some of the sub-models without collinearity were retained.
(4) Sub-model ensemble and model performance testing The information entropy method was utilized to calculate the weight coefficients of the retained sub-models, so as to establish the selective ensemble model. Then the test set was used to evaluate the prediction performance of the model.

Data Collecting and Grouping
Six essential parameters, including temperature, pressure, critical temperature (Tc), critical pressure (Pc), molecular weight (MW), and eccentricity factor (w) were taken as input variables for the CO 2 solubility predictive models [30,31]. Temperature and pressure will affect the solubility of CO 2 in the ionic liquid. For the same ionic liquid, the solubility of CO 2 in the ionic liquid increases when the temperature decreases or the pressure increases. Theoretically, Tc, Pc, M and w are the essential thermodynamic properties of ionic liquids. They can distinguish the species of ionic liquids and reflect the characteristics of ionic liquid structures [13,31]. In addition, the input variables of the model were only applicable to imidazolium ionic liquids. The solubility of CO 2 in ionic liquids was selected as the output variable of the model.
Data of critical temperature (Tc), critical pressure (Pc), molecular weight (M) and eccentricity factor (w) of nine imidazolium ionic liquids were collected by referring to a large number of literatures, as shown in Table 1 [7,[32][33][34][35][36][37][38][39][40][41]. The name and abbreviation of imidazolium ionic liquid are shown in Table 2. Meanwhile, a large number of data on the solubility of CO 2 in the nine imidazolium ionic liquids were collected, as shown in Table 3. A total of 1468 sets of samples were collected. All the solubility of CO 2 in ionic liquids in this paper was obtained in the equilibrium phase. The unit of the stoichiometry of reagents gas/ionic liquids is the molar ratio. For all sample data of each type of ionic liquid, 80% (1176 sets) were randomly selected as the training set for training the sub-model, 10% (146 sets) were randomly selected as the validation set for sub-model discrimination and sub-model ensemble, and the remaining 10% (146 sets) was used as the test set for the performance test of the ensemble model.

Sub-Model Training
In order to ensure the diversity of the data, the re-sampling technique (bootstrap) was used to generate 30 sub-training sets. In order to ensure the diversity of the sub-model structure and parameters, BPNN, ELM, and RBFNN were used to divide the generated 30 training sets randomly to these three algorithms, and 30 sub-models were obtained. All sub-models were implemented by MATLAB software (version 2016a, MathWorks, Natick, MA, USA). The structure and parameters of BPNN, ELM, and RBFNN sub-models were as follows: (1) BPNN Ten sub-models of BP neural network were established with a single hidden layer structure. The transfer function of the hidden layer was a tansig type excitation function, and the output layer was expanded by a purelin-type excitation function for range expansion. The training termination error was 4 × 10 −4 , and the learning rate was 0.05. The Levenberg-Marquardt algorithm was used in the training algorithm and the number of hidden layer nodes was 6-15.
(2) ELM Ten sub-models were established by adjusting the number of hidden layers and activation functions of the extreme learning machine. There were 5 sub-models with sigmoid activation function (the number of hidden layer nodes was 113-117), and 5 sub-models with sin activation function (the number of hidden layer nodes was 115-119).
(3) RBFNN Ten sub-models were established by adjusting the number of neurons and the activation function. Among them, the activation function used Gaussian kernel function, the center selection of the basis function adopted K-means clustering, the learning rate was 0.1, the training termination error was 1 × 10 −4 , and the number of neurons was 71-80.

Sub-Model Discrimination
The performance of the 30 sub-models established was evaluated by the validation set. Firstly, the fuzzy C-means algorithm was adopted to cluster all the sub-models, and then, the sub-models with high collinearity were eliminated based on the collinearity detection program. The performance indexes used for model evaluation included the mean absolute error (MAE), root mean square error (RMSE), and correlation coefficient (R 2 ). The specific calculation formula for each index was as follows: where N was the number of samples, x i was the predicted value of the sample i,x i was the true value of the sample i, and x was the average of all samples. The performance index data of each sub-model was obtained from the validation set. The performance index data of the BPNN, ELM, and RBFNN sub-models are shown in Tables 4-6 and Figure 2, respectively. It can be seen from Tables 4-6 that all BPNN, ELM, and RBFNN sub-models had good model performance. performance indexes used for model evaluation included the mean absolute error (MAE), root mean square error (RMSE), and correlation coefficient (R 2 ). The specific calculation formula for each index was as follows: where N was the number of samples, xi was the predicted value of the sample i, ˆi x was the true value of the sample i, and x was the average of all samples. The performance index data of each sub-model was obtained from the validation set. The performance index data of the BPNN, ELM, and RBFNN sub-models are shown in Tables 4-6 and Figure 2, respectively. It can be seen from Tables 4-6 that all BPNN, ELM, and RBFNN sub-models had good model performance.   The fuzzy C-means algorithm was used to cluster all the sub-models in Tables 4-6, and the CH index was used as the standard to evaluate the number of clusters. Formula (2) was applied to calculate the value of CH. The specific results are shown in Table 7. It can be seen from Table 7 that when the number of clusters was 3, the value of CH reaches the maximum; thus, the number of clusters was selected as 3. When three classes were selected as clustering target, the following clustering results could be obtained: the first class included seven sub-models (5 BPNN sub-models and 2 ELM sub-models), the second class included 12 sub-models (10 RBFNN sub-models and 2 BPNN sub-models), and the third class included 11 sub-models (3 BPNN sub-models and 8 ELM sub-models). Since the sub-models might be collinearity after clustering, it is necessary to carry out collinearity detection on the sub-models in the cluster to eliminate the adverse effects of collinearity. Variance Inflation Factor (VIF) was applied to judge the collinearity. The criterion was that there was no collinearity when VIF < 10. According to the criterion, the following results were obtained: 3 BPNN sub-models in the first class, 3 RBFNN sub-models in the second class, 2 BPNN sub-models, and 1 ELM sub-model in the third class.

Sub-Model Ensemble
For the nine sub-models obtained by discrimination, the information entropy method was used to calculate the weight coefficient of each sub-model. The specific results were as follows: Y = 0.1130y 11 + 0.0898y 12 + 0.0875y 13 + 0.1755y 21 + 0.1545y 22 + 0.1633y 23 + 0.0763y 31 + 0.0713y 32 + 0.0688y 33 (7) where Y is the output of the selective ensemble model based on information entropy, y ij (i = 1,2,3, j = 1,2,3) is each sub-model, where i is the number of clusters, and j is the number of sub-models in the clustering number.

Model Performance Testing
In order to compare the predictive performance of the selective ensemble model based on information entropy (selective ensemble model), the optimal BPNN sub-model (optimal BPNN), the optimal ELM sub-model (optimal ELM), the optimal RBFNN sub-model (optimal RBFNN), and the fully integrated model based on information entropy (fully integrated model) were also established. The test set was used to conduct performance tests on all the above models. The prediction performance of each model is shown in Figure 3. As shown in Figure 3, all models can well realize the prediction of the solubility of CO 2 in imidazolium ionic liquids. The histograms of the error distributions of the models are shown in Figure 4. Compared with the single optimal sub-model and the fully integrated model, from the perspective of error distribution, the selective ensemble model had smaller errors, which also verified the effectiveness of the proposed model. In other words, Figure 4 also proved the superiority of the selective ensemble model.    In order to quantitatively compare the prediction performance of the five models, Table 8 gave the specific results of MAE, RMSE, and R 2 of the five models based on the testing set.   In order to quantitatively compare the prediction performance of the five models, Table 8 gave the specific results of MAE, RMSE, and R 2 of the five models based on the testing set. It can be seen from Table 8 that all models have good prediction performance due to the reasonable selection of relevant physical and chemical parameters and structural parameters as the input of the prediction model for the solubility of CO 2 in ionic liquids.
Compared with the three optimal sub-models, the fully integrated model and the selective ensemble model made full use of the advantages of data diversity, parameter diversity, and structural diversity. The sub-models with different structures could excavate more global information contained in the data, and extract the useful information from the data by their different operation mechanisms. Both ensemble models were effective in reducing the error from predictions, thus improving the overall predictive performance. In addition, the information entropy method was used to reasonably select the combination weight coefficients of each sub-model. The models with different predictive abilities were assigned to different weight coefficients. In addition, the differences among the sub-models were fully considered, so that the prediction performance of the model was further improved.
Compared with the fully integrated model based on information entropy, the selective ensemble model based on information entropy used the fuzzy C-means algorithm and the collinearity detection method to screen the sub-models, which further ensured the diversity and accuracy of the models in different clusters, and removed the interference of some sub-models, thus ensuring the effectiveness of the selective ensemble model. Simultaneously, the selective ensemble model based on information entropy further fully mined the information inside the model, and extracted the useful information in the data from different angles to a great extent, and further improved the overall predictive performance.

Conclusions
In this paper, a selective ensemble modeling method for predicting the solubility of CO 2 in imidazolium ionic liquids was proposed. The implementation process of the selective ensemble modeling method included sub-model training, sub-model discrimination, sub-model ensemble and model performance testing. Sub-model training made full use of the advantages of data diversity, structural diversity, and parameter diversity. Sub-model discrimination used a fuzzy C-means clustering algorithm and collinearity detection method to ensure model diversity and reduce model collinearity. Sub-model ensemble adopted the information entropy weighting method to effectively reduce the impact of weak sub-models on model performance. The result of the prediction performance on the solubility of CO 2 in imidazolium ionic liquids showed that the solubility prediction model established by the selective ensemble modeling method had the best prediction performance compared with the other four models.
Although the prediction model established by the fusion modeling method had a good prediction effect for nine imidazolium ionic liquids in this study, it may not be applicable to predicting the solubility of CO 2 in other ionic liquids. The research work not only provides a feasible method to obtain the solubility data of CO 2 in ionic liquids, but also provides an effective means for further discrimination of ionic liquids, which has important practical significance.