Computational Screening of Metal–Organic Framework Membranes for the Separation of 15 Gas Mixtures

The Monte Carlo and molecular dynamics simulations are employed to screen the separation performance of 6013 computation-ready, experimental metal–organic framework membranes (CoRE-MOFMs) for 15 binary gas mixtures. After the univariate analysis, principal component analysis is used to reduce 44 performance metrics of 15 mixtures to a 10-dimension set. Then, four machine learning algorithms (decision tree, random forest, support vector machine, and back propagation neural network) are combined with k times repeated k-fold cross-validation to predict and analyze the relationships between six structural feature descriptors and 10 principal components. Based on the linear correlation value R and the root mean square error predicted by the machine learning algorithm, the random forest algorithm is the most suitable for the prediction of the separation performance of CoRE-MOFMs. One descriptor, pore limiting diameter, possesses the highest weight importance for each principal component index. Finally, the 30 best CoRE-MOFMs for each binary gas mixture are screened out. The high-throughput computational screening and the microanalysis of high-dimensional performance metrics can provide guidance for experimental research through the relationships between the multi-structure variables and multi-performance variables.


Introduction
With the rapid development of the social economy, people increasingly depend on energy; however, energy is not inexhaustible. In recent years, the acceleration of the energy crisis has prompted people to think about how to use cleaner, more environmentally friendly, more efficient energy. Separation technology plays an indispensable part in the chemical industry, and is widely used in medicine, food, petroleum, chemical engineering, metallurgy, and other fields. However, separation also consumes energy, especially high-throughput gas separation; for example, deep cryogenic separation was used to separate N 2 /O 2 in industry [1], where the energy consumption is much higher and the recovery rate is lower. In addition, chemical and physical absorption methods are used to remove acidic components (H 2 S and CO 2 ) from CH 4 in industry, including the low-temperature methanol, alcohol amine, and alkali methods. Energy consumption is extremely high during the conducted experiments on 15 two-component gas mixtures for separation requirements in production, and sought to find the best separation performance for these gases on CoRE-MOFMs. The 15 binary gas mixtures are as follows: CO 2 /CH 4 , CO 2 /H 2 S, CO 2 /N 2 , H 2 /CH 4 , H 2 /CO 2 , H 2 /N 2 , H 2 S/CH 4 , H 2 /O 2 , He/CH 4 , He/CO 2 , He/H 2 , He/N 2 , N 2 /CH 4 , O 2 /N 2 , and He/O 2 . Commercial gas membrane separations include O 2 /N 2 , CO 2 /CH 4 , H 2 /N 2 , He/air, and H 2 /CH 4 . CO 2 /CH 4 and H 2 S/CH 4 membrane separation is used to purify natural gas. The membrane separation of H 2 is employed in hydrocracker purge, hydrotreater purge, and H 2 recovery in refineries and ammonia plants [27,28]. Acid gas (CO 2 and H 2 S) are needed to be removed from natural gas. The separation of N 2 from air is also necessary, and N 2 has been widely used as a protective gas [29]. The separation of He or pure He gas is applied in various fields, for example: silicon wafer manufacturing, the inerting of hydrogen fuel lines for rockets, arc welding, nuclear magnetic resonance machines, and accelerators [30].
The simulation process consisted of six parts. First, we simulated 6013 CoRE-MOFMs by the Monte Carlo (MC) method and molecular dynamics (MD), and calculated the six feature descriptors and 44 performance metrics. Second, we analyzed the relationship between the six feature descriptors and 44 performance metrics. Third, we reduced the dimensions of the 44 performance metrics by principal component analysis (PCA). Fourth, we predicted six feature descriptors and separation performance metrics, and analyzed them using four machine learning algorithms. Fifth, we analyzed the relative importance of the six feature descriptors. Sixth, we screened out the CoRE-MOFMs with better separation performance for mixed gas components.

Model
In this work, the crystal structures of 6013 CoRE-MOFs were computationally screened; then, this database was refined after removing free solvent molecules and structural parameters that had been derived from the experimental data by Chung et al. [31,32]. We calculated the framework structures of the CoRE-MOFs after removing the solvent molecules. The atomic frameworks of the MOFs are described by the Lennard-Jones (LJ) and the electrostatic potentials: where ε ij represents well depth; σ ij represents the equilibrium distances; q i , q j is the atomic charge of atoms i and j, respectively; and ε 0 = 8.8542 × 10 −12 C 2 N −1 , representing the vacuum electric constant. The LJ potential parameters of all the MOFs come from the universal force field (UFF) [33], and are listed in Table S1. The atomic charge of MOFs is estimated quickly by the MOF electrostatic-potential-optimized charge scheme (MEPO-Qeq) method [34]. The UFF accurately predicted the adsorption and diffusion of MOFs, and was verified by the comparison with experimental data in previous work [24]. The structural descriptors of the MOFs were characterized by PLD (Å), large cavity diameter (LCD (Å)), VSA (m 2 /cm 3 ), porosity ϕ, pore size distribution (PSD% (2.5 to 3.5 Å) ) and MOF density (ρ (kg/m 3 )). We calculated the PLD and LCD in the Zeo++ package [35]. We calculated VSA using N 2 with a diameter of 3.64 Å and the ϕ using He with a diameter of 2.58 Å as probes under the RASPA software package [36], respectively. We calculated the Q 0 st of each gas by NVT-MC (N is the number of particles, V is the volume of system, T is the temperature of the system) with a single gas molecule in an infinite diluted state using 10 5 moves of Widom insertion in the RASPA software package [36].
To construct seven adsorption gas components (CH 4 , N 2 , H 2 S, O 2 , CO 2 , H 2 , and He), we derived the force field parameters of the seven gas components listed in Table S2 from the transferable potentials for phase equilibria (TraPPE) force field [37]. CH 4 is the joint atomic model. N 2 was considered to be a three-point model with the 1.10 Å bond length for N-N. H 2 S was a four-bit model with a 1.13 Å bond length for S-H and LJ potential energy on the atoms of S and H. In addition, a virtual atom was located near the S atom, and the H atoms and the virtual atom were partially charged, whereas the S atom was not charged [38]. O 2 was a three-point atom. For CO 2 , the bond length of C-O was 1.16 Å, and ∠OCO was 180 • . The bond length of He was 2.58 Å.

Methods
The Henry's constants K and diffusivities D of CH 4 , N 2 , H 2 S, O 2 , CO 2 , and H 2 in 6013 CoRE-MOFMs were respectively estimated by MC and MD simulations with the same ensemble, NVT. The time step of MD simulation was 1 fs, and the temperature was controlled at 298 K by the Andersen thermostat. In principle, a single gas molecule should be added into a MOF to mimic infinite dilution. To improve the statistics, 30 gas molecules were used; however, gas-gas intermolecular interaction was switched off. The cross-interactions between MOFs and adsorbent molecules were calculated by Lorentz-Berthelot rules. In each simulation, the MOF structures were kept rigid. The periodic boundaries were applied in a three-dimensional system, and the cells were simulated and expanded to at least 24 Å in three-dimensional directions, respectively. A spherical cut-off of 12.0 Å with long-range correction was used to calculate the LJ interactions, whereas the electrostatic interactions were calculated using the Ewald sum. The electrostatic interaction between the frame molecule and the gas molecule was calculated by using Ewald summation. In each MOF, the MC simulation was run 10 5 cycles with the first 50,000 cycles for equilibration and the last 50,000 cycles for ensemble average. The MD duration in each MOF was 2 ns, with the last 1 ns for production. All of MC and MD simulations were carried using the RASPA software package [36]. Through the adsorption and diffusion, the permeation was estimated.

Map for Project Structure
The research framework of this project is shown in Figure 1. We combined molecular simulation and various artificial intelligence algorithms to speed up the screening and accurate prediction of the gas separation performance of CoRE-MOFMs. We divided the research into six parts for the gas separation on MOFMs: (1) the separation performance of 15 different binary mixed gases were simulated by MC and MD on 6013 CoRE-MOFMs. For each CoRE-MOFM, we calculated six descriptors-LCD, ϕ, VSA, PLD, ρ, and PSD% (2.5 to 3.5 Å) -corresponding to permeability and permselectivity. (2) We analyzed the relationships between the geometrical descriptors and diffusion coefficients, diffusion selectivity, permeability, and permselectivity. (3) We analyzed the performance metrics and selected more than 85% of the top 10 principal components to cover all of the data variability information as a measure performance, including seven gas diffusion coefficients, 15 mixed gas diffusion selectivities, seven gas permeabilities, and 15 permselectivities using PCA dimensionality reduction. (4) To obtain a suitable artificial intelligence algorithm model for our materials and systems, we applied k times repeated k-fold cross-validation [21] to evaluate the predicted performance using four machine-learning methods: decision tree (DT), random forest (RF), support vector machine (SVM), and back propagation neural network (BPNN). We calculated the average linear correlation R and root mean square error (RMSE) as predictive criteria for machine learning. (5) By comparing and analyzing the four machine learning algorithms, we found that RF has the most accurate predicted effects and smaller errors. For further analysis, we calculated the relative importance of the six characteristic descriptors for the separation of 15 mixed component gases on CoRE-MOFMs. (6) Finally, based on the diffusion coefficient, diffusion selectivity, permeability, and permselectivity of 15 different gas components on CoRE-MOFMs, we selected the 30 optimal CoRE-MOFMs (the abbreviations are listed in SI).  Comparing with a N2/CO2/CH4 mixture in our previous work [25], the Di and Pi for each gas are increased, because the adsorbed amount and hence steric hindrance exist in real mixture. When the PLD is between 2.5-3.5 Å , PO 2 and PH 2 increase significantly with the increase of PLD. However, when PLD >3.5 Å , PO 2 and PH 2 tend to be in equilibrium above 100 Barrer. The study showed that when PLD >3.5 Å , both O2 and H2 could enter the MOFMs, because the pore sizes of O2 and H2 are <3.5 Å . The other five gases also exhibited similar trends, as shown in Figure S4. For , the relationship between permeability and  had a similar trend to PLD, except that He and H2 showed a monotonic upward trend, and the slope ratio of H2 to  was smaller than that of He to . Comparing with the , when ρ >3000 kg/m 3 , the dispersions of D and P are very high, as shown in Figure S2d(1-7). The reason is that the MOFs indeed possess a relatively larger free space, but they consist of very heavy metal atoms (e.g. gold, platinum, uranium), and hence the ρ value appears to be high. Thus, it is hard for ρ to indicate the performance of CoRE-MOFMs. Compared with the influence of PLD on permeability, the diffusion coefficient showed a similar trend, but the value of the diffusion coefficient remained small. The diffusion coefficient rises sharply when PLD <3.5 Å , and finally reaches stability when PLD >6.0 Å , as shown in Figure S2. The main reason is strong steric hindrance when the gas diameter is close to the pore size of the MOFs. Such a relationship between PLD and D is similar to that in previous work [25]. Figures 2c,d exhibited a relationship between the permselectivity of two mixtures gas (H2/CH4 and O2/N2) and PLD, where Sperm(i/j) = Sadsp(i/j) × Sdiff(i/j); and Sadsp(i/j) and Sdiff(i/j) represent the adsorption selectivity and diffusion selectivity of components i and j, respectively. We found that the permselectivity of H2/CH4 had a maximum when PLD was between 2.5-3.5 Å , but a minimum value also existed within the same range. When PLD >3.5 Å , the permselectivity of H2/CH4 first decreased and then increased, and finally tended toward equilibrium, as shown in Figure 2c. The results indicated that PLD was a key, but not perfect, separation performance index for H2/CH4. The permselectivity of O2/N2 decreased with increasing PLD, and finally approached equilibrium, as shown in Figure 2d. The results demonstrated that gas molecules reached a stable state exhibiting low permselectivity when PLD increased. The O2  Figure 2a,b show the relationship between P O 2 (or P H 2 ) and PLD, where P i = K i × D i , P i represents the permeability, K i represents the Henry's constant of the gas component, and D i is the diffusion coefficient. The amount of permeability of pure components was considered in this work. Comparing with a N 2 /CO 2 /CH 4 mixture in our previous work [25], the D i and P i for each gas are increased, because the adsorbed amount and hence steric hindrance exist in real mixture. When the PLD is between 2.5-3.5 Å, P O 2 and P H 2 increase significantly with the increase of PLD. However, when PLD > 3.5 Å, P O 2 and P H 2 tend to be in equilibrium above 100 Barrer. The study showed that when PLD > 3.5 Å, both O 2 and H 2 could enter the MOFMs, because the pore sizes of O 2 and H 2 are <3.5 Å. The other five gases also exhibited similar trends, as shown in Figure S4. For ϕ, the relationship between permeability and ϕ had a similar trend to PLD, except that He and H 2 showed a monotonic upward trend, and the slope ratio of H 2 to ϕ was smaller than that of He to ϕ. Comparing with the ϕ, when ρ > 3000 kg/m 3 , the dispersions of D and P are very high, as shown in Figure S2d(1-7). The reason is that the MOFs indeed possess a relatively larger free space, but they consist of very heavy metal atoms (e.g. gold, platinum, uranium), and hence the ρ value appears to be high. Thus, it is hard for ρ to indicate the performance of CoRE-MOFMs. Compared with the influence of PLD on permeability, the diffusion coefficient showed a similar trend, but the value of the diffusion coefficient remained small. The diffusion coefficient rises sharply when PLD < 3.5 Å, and finally reaches stability when PLD > 6.0 Å, as shown in Figure S2. The main reason is strong steric hindrance when the gas diameter is close to the pore size of the MOFs. Such a relationship between PLD and D is similar to that in previous work [25]. Figure 2c,d exhibited a relationship between the permselectivity of two mixtures gas (H 2 /CH 4 and O 2 /N 2 ) and PLD, where S perm(i/j) = S adsp(i/j) × S diff(i/j) ; and S adsp(i/j) and S diff(i/j) represent the adsorption selectivity and diffusion selectivity of components i and j, respectively. We found that the permselectivity of H 2 /CH 4 had a maximum when PLD was between 2.5-3.5 Å, but a minimum value also existed within the same range. When PLD > 3.5 Å, the permselectivity of H 2 /CH 4 first decreased and then increased, and finally tended toward equilibrium, as shown in Figure 2c. The results indicated that PLD was a key, but not perfect, separation performance index for H 2 /CH 4 . The permselectivity of O 2 /N 2 decreased with increasing PLD, and finally approached equilibrium, as shown in Figure 2d. The results demonstrated that gas molecules reached a stable state exhibiting low permselectivity when PLD increased. The O 2 molecules are neither adsorbed nor diffused by CoRE-MOFMs when the gas molecule is similar in size to the pore size of the CoRE-MOFMs. There was also a similar trend between the PLD and diffusion selectivity, as shown in Figure S3. Moreover, the S diff are highly dispersed in the figure, because gas diffusion is influenced by many factors. Take for example, CO 2 /CH 4 : one the one hand, because CO 2 has a stronger affinity than CH 4 with any MOF, CO 2 diffusion is retarded, which causes S diff(CO 2 /CH 4 ) < 1; on the other hand, CO 2 has a smaller diameter than CH 4 , which causes S diff(CO 2 /CH 4 ) > 1. Upon analyzing the VSA and ρ, they showed no clear effect on the permselectivity of each gas component. By comparing the effect of ϕ and VSA with the diffusion selectivity, we observed roughly a straight line inclined toward the lower right, as shown in Figure S3. This is attributed to a small free space available in a MOF at a small ϕ or VSA for gas molecules to permeate, thus leading to hindering the gas molecule with a larger diameter. Additionally, since the VSA was determined using N 2 as a probe with a diameter of 3.64 Å, thus, when the VSA of a MOF is zero or small, the smaller gas could pass the MOF such as CO 2 (3.3 Å) and H 2 S (3.6 Å). Through a comprehensive analysis, Figure 2a-d reveal that the PLD had good separation performance for some gas components within 2.5-3.5 Å. It also was revealed that the gas components had a strong bond energy relationship with CoRE-MOFMs when entering the membrane material, which indicated adsorbed selectivity or diffusion selectivity. Therefore, the permselectivity was large when the pore size of the membrane material was small. Figure 2e,f illustrate the relationship between the permselectivity of H 2 /CH 4 and O 2 /N 2 versus P H 2 and P O 2 . The red line in Figure 2e,f represented Robeson's penetration data based on a wide range of polymer film upper bounds. The membrane materials above the red line were given priority. We found that the permselectivity and P H 2 were not monotonous. The permselectivity either increased slightly or decreased slightly, which was followed by a small increase as the permeability increased. The relationships among the permeability, permselectivity of other gas components, and PLD are shown in Figure S6. molecules are neither adsorbed nor diffused by CoRE-MOFMs when the gas molecule is similar in size to the pore size of the CoRE-MOFMs. There was also a similar trend between the PLD and diffusion selectivity, as shown in Figure S3. Moreover, the Sdiff are highly dispersed in the figure, because gas diffusion is influenced by many factors. Take for example, CO2/CH4: one the one hand, because CO2 has a stronger affinity than CH4 with any MOF, CO2 diffusion is retarded, which causes Sdiff(CO 2 /CH 4 ) <1; on the other hand, CO2 has a smaller diameter than CH4, which causes Sdiff(CO 2 /CH 4 ) >1.

Feature Descriptors and Performance Metrics
Upon analyzing the VSA and ρ, they showed no clear effect on the permselectivity of each gas component. By comparing the effect of φ and VSA with the diffusion selectivity, we observed roughly a straight line inclined toward the lower right, as shown in Figure S3. This is attributed to a small free space available in a MOF at a small φ or VSA for gas molecules to permeate, thus leading to hindering the gas molecule with a larger diameter. Additionally, since the VSA was determined using N2 as a probe with a diameter of 3.64 Å, thus, when the VSA of a MOF is zero or small, the smaller gas could pass the MOF such as CO2 (3.3 Å) and H2S (3.6 Å). Through a comprehensive analysis, figures 2a-d reveal that the PLD had good separation performance for some gas components within 2.5-3.5 Å. It also was revealed that the gas components had a strong bond energy relationship with CoRE-MOFMs when entering the membrane material, which indicated adsorbed selectivity or diffusion selectivity. Therefore, the permselectivity was large when the pore size of the membrane material was small. Robeson's penetration data based on a wide range of polymer film upper bounds. The membrane materials above the red line were given priority. We found that the permselectivity and PH 2 were not monotonous. The permselectivity either increased slightly or decreased slightly, which was followed by a small increase as the permeability increased. The relationships among the permeability, permselectivity of other gas components, and PLD are shown in Figure S6.

Machine Learning
To screen structural variables with strong effects for all gas components and further predict the performance of new MOFMs, we weighed and analyzed 44 performance metrics using intelligent algorithms. In this work, we used PCA to reduce the dimensions of these 44 performance metrics, the details of PCA are in SI. We regarded the principal component as all of the largest indexes that covered the variation information by more than 85%. We selected the first 10 dimensions (87%) for further analysis. The percentage of variation information corresponding to each principal component is shown in Figure S7. In the study, six structural descriptors and 10 principal components were trained, tested, and analyzed after standardization. We randomly divided all of the data into k by using k times repeated k-fold cross-validation, where k = 5, in which one set was the test set, and the remaining four were training sets; see details in the Supplementary Information (SI). We repeated each training five times, and then each group of data was trained and predicted by four machine learning methods (DT, SVM, BPNN, and RF); see details in the SI. We calculated the average of R (the linear correlation coefficient) and the RMSE, which indicated the percentage of each principal component contributing to the data variation, from the training model. These were taken as the criterion for the algorithm prediction. The details of R and RMSE are shown in Table 1. The evaluation formula for the effect of the algorithm model is listed in Equations (2) and (3). Table S4 shows that the 10 principal components (R = 0.397) were larger and the RMSE = 0.619 was smaller in RF than in the other models. The DT algorithm model has better predictive performance with R = 0.575 and RMSE = 0.435. RF is composed of multiple DT, and is the improvement and optimization for DT. In the RF algorithm, a small change in the independent variable has no appreciable effect on the response variable. In addition, RF could make up for the weakness of the generalization of DT. However, it is easy for the RF model to cause over-fitting when the noise is big. Although the BPNN does not have better predictive performance for this system, it has a nonlinear mapping capability and self-study ability, which is suitable for complex internal mechanisms. The algorithm model is selected and the parameter is set accurately, which is very important for the prediction of different systems. The formulas are given in Equations (4) and (5). The results showed that the RF model had a better prediction performance, and was most suitable for the material system of this project. Figure 3 proved that the predicted performance for the first three principal components (PC1, PC2, and PC3) by the machine learning algorithm model of RF agreed well with the simulated results of CoRE-MOFMs.
The prediction of the other three methods (DT, SVM, BPNN) is shown in Figure S7. The principal component did not have a better effect for predictions with larger variation information in Figure 3. R = 0.81 for the prediction of PC1, which was smaller than R = 0.93 for the prediction of PC2. However, the value of PC2 was generally larger than that of PC1 after standardization. Therefore, RMSE = 1 of PC2 was much larger than RMSE = 0.08 of PC1. The prediction performance of PC3 was lower than that of PC1 and PC2 (R = 0.72 and RMSE = 0.70 in PC3). We compared the three other machine learning algorithms (DT, SVM, and BPNN), and found that RF had the best prediction performance in the first three principal components, as well as synthesizing 10 principal components.
These studies showed that machine learning had a good predictive effect on the principal components after dimension reduction. To further analyze the relative importance of the six feature descriptors versus the principal components, the RF was adopted. Based on the Gini coefficient, we calculated the relative importance of each feature descriptor. Figure 4 shows that the PLD had the largest weight for PC1 and PC2, which contained more variation information. The 10 principal component indexes showed that PLD had the highest weight. The relative importance of PLD was as follows: 0.33, 0.41, and 0.25. The results showed that PLD was the main factor affecting the separation performance of 15 mixed gases by CoRE-MOFMs; therefore, it should be considered first when researching the separation performance of these mixed components. When considering only the first three principal components (covering about 58% of the variation information), the relative importance of descriptors was as follows: PLD, ϕ, and ρ. When considering the top 10 principal components (covering about 87% of the variation information), the top four most important descriptors were as follows: PLD, ϕ, VSA, and LCD. further analysis. The percentage of variation information corresponding to each principal component is shown in Figure S7. In the study, six structural descriptors and 10 principal components were trained, tested, and analyzed after standardization. We randomly divided all of the data into k by using k times repeated k-fold cross-validation, where k = 5, in which one set was the test set, and the remaining four were training sets; see details in the Supplementary Information (SI). We repeated each training five times, and then each group of data was trained and predicted by four machine learning methods (DT, SVM, BPNN, and RF); see details in the SI. We calculated the average of R (the linear correlation coefficient) and the RMSE, which indicated the percentage of each principal component contributing to the data variation, from the training model. These were taken as the criterion for the algorithm prediction. The details of R and RMSE are shown in Table 1. The evaluation formula for the effect of the algorithm model is listed in equations (2) and (3). Table  S4 shows that the 10 principal components (R = 0.397) were larger and the RMSE = 0.619 was smaller in RF than in the other models. The DT algorithm model has better predictive performance with R = 0.575 and RMSE = 0.435. RF is composed of multiple DT, and is the improvement and optimization for DT. In the RF algorithm, a small change in the independent variable has no appreciable effect on the response variable. In addition, RF could make up for the weakness of the generalization of DT. However, it is easy for the RF model to cause over-fitting when the noise is big. Although the BPNN does not have better predictive performance for this system, it has a nonlinear mapping capability and self-study ability, which is suitable for complex internal mechanisms. The algorithm model is selected and the parameter is set accurately, which is very important for the prediction of different systems. The formulas are given in equations (4) and (5). The results showed that the RF model had a better prediction performance, and was most suitable for the material system of this project. Figure 3 proved that the predicted performance for the first three principal components (PC1, PC2, and PC3) by the machine learning algorithm model of RF agreed well with the simulated results of CoRE-MOFMs. The prediction of the other three methods (DT, SVM, BPNN) is shown in Figure S7. The principal component did not have a better effect for predictions with larger variation information in Figure 3. R = 0.81 for the prediction of PC1, which was smaller than R = 0.93 for the prediction of PC2. However, the value of PC2 was generally larger than that of PC1 after standardization. Therefore, RMSE = 1 of PC2 was much larger than RMSE = 0.08 of PC1. The prediction performance of PC3 was lower than that of PC1 and PC2 (R = 0.72 and RMSE = 0.70 in PC3). We compared the three other machine learning algorithms (DT, SVM, and BPNN), and found that RF had the best prediction performance in the first three principal components, as well as synthesizing 10 principal components.  follows: 0.33, 0.41, and 0.25. The results showed that PLD was the main factor affecting the separation performance of 15 mixed gases by CoRE-MOFMs; therefore, it should be considered first when researching the separation performance of these mixed components. When considering only the first three principal components (covering about 58% of the variation information), the relative importance of descriptors was as follows: PLD, , and . When considering the top 10 principal components (covering about 87% of the variation information), the top four most important descriptors were as follows: PLD, , VSA, and LCD.

Best CoRE-MOFMs
To select the best CoRE-MOFMs, the benchmarks of permeability and permselectivity were set for each component of 15 binary gas mixtures, as listed in Table S5. Besides, to avoid the influence on the over-estimation of K, the benchmarks of diffusion selectivity were added, S diff(CO 2 /CH 4 ) > 10, S diff(CO 2 /N 2 ) > 10, S diff(H 2 S/CH 4 ) > 5, S diff(CO 2 /H 2 S) > 10, as reported in the literature [26]. Based on these benchmarks, each binary gas mixture was preferably selected to the five best CoRE-MOFMs listed in Table S6, and the best two are listed in Table 2. Several repeated CoRE-MOFMs are noted in Table 2, indicating that these membranes have good separation properties for a variety of gas mixtures. Table 2 shows that the PLD was concentrated between 2.5-3.5 Å (80.0%). The results demonstrated that the gas molecules had stronger bond energy functions with the membrane materials. It was easier to distinguish among different sizes, and CoRE-MOFMs had better separation ability when the PLD was in a small range. Furthermore, a narrow range had a better separation effect when ϕ was between 0.1-0.3 (63.3%), indicating that ϕ also was a relatively important factor. Furthermore, there was a relatively comprehensive separation effect at smaller apertures when the LCD was between 3.3-5.0 Å (63.3%). These results were consistent with the final results of the intelligent algorithm model analyzed for the six feature descriptors. The PLD should be given priority, followed by ϕ and the LCD.

Conclusion
A high-throughput computational screening is used to calculate the separation performance of 6013 CoRE-MOFMs for 15 two-component gas mixtures; then, multiple intelligent algorithms are used to predict and analyze their structure-property relationships. First, we used PCA to reduce 44 performance metrics to 10 dimensions, covering about 87% of all the variation information. The four machine learning algorithms (DT, RF, SVM, and BPNN) were optimized and evaluated using fivefold cross-validation, which each algorithm repeated five times. The results show that the RF algorithm better predicted the effect on the data of this project due to the smallest RMSE = 0.397 and largest R = 0.619. The R values for the first three principal components are 0.81, 0.93, and 0.72, and the RMSE values are 0.08, 1.00, and 0.70, respectively. Furthermore, we calculate the relative importance of six feature descriptors on the 10 performance metrics using the Gini coefficient by RF. The results show that the PLD is the most important feature descriptor. Analyzing the permeability and permselectivity of 15 gas components shows that a CoRE-MOFM with the PLD in a certain range (2.5 to 3.5 Å) has good permselectivity properties for mixed gas components, which also is suitable for a complex multicomponent gases mixture. In addition, ϕ and LCD also exhibit high relative importance on the separation of two-component gas mixtures. Finally, on the basis of the permeability and permselectivity, 30 optimal CoRE-MOFMs are identified as being suitable for the separation of different gas mixtures. This computational work by high-throughput screening and machine learning techniques gives the guideline for the development of MOF membranes for gas separation.
Supplementary Materials: The following are available online at http://www.mdpi.com/2079-4991/9/3/467/s1. Figure S1: Models of CH 4 , N 2 , H 2 S, O 2 , CO 2 , H 2 and He. Figure S2: Relationships between diffusivity D and MOF descriptors. Figure S3: Relationships between diffusion selectivities S diff and MOF descriptors. Figure S4: Relationships between permeability P and MOF descriptors. Figure S5: Relationships between permselectivity S perm and MOF descriptors. Figure S6: Relationships between permeability P and permselectivity S perm . Figure  S7: Predicted performance of the first three principal components (PC1, PC2, PC3) by three machine learning algorithm model of DT, SVM, and BPNN versus the simulated results of CoRE-MOFMs on the test set. Table S1: Lennard-Jones parameters of MOFs. Table S2: Lennard-Jones parameters and charges of adsorbates. Table S3: Principal component covering the ratio of variation information for 44 performance metrics. Table S4: RMSE and R versus four machine learning by synthesizing 10 principal components. Table S5: Benchmark of permeability and permselectivity for 15 gas mixtures. Table S6: Best CoRE-MOFMs for different gas mixtures. Abbreviation lists. Details of six mathematical analysis methods (k times repeated k-fold cross-validation, principal component analysis, decision tree, random forest, support vector machine, back propagation neural network).
Author Contributions: W.Y. and H.L. conceived the idea. Z.Q. calculated all the materials' structural parameters and obtained valid data about the structure descriptors and performance. Z.L. analyzed the relationship between structure descriptors and performance. W.Y. used principal component analysis to reduce the dimensions of these performance metrics and used machine learning to predict and analyze the materials' performance. J.L., F.P., and W.Y. wrote the original draft. W.Y. and Z.Q. wrote the manuscript with contributions from all authors.
Funding: This research was funded by the National Natural Science Foundation of China (Nos. 21676094, 21576058, 21676060, and 21706197).