Application of Unsupervised Machine Learning for the Evaluation of Aerogels’ Efficiency towards Ion Removal—A Principal Component Analysis (PCA) Approach

Water scarcity is a global problem affecting millions of people. It can lead to severe economic, social, and environmental consequences. It can also have several impacts on agriculture, industry, and households, leading to a decrease in human quality of life. To address water scarcity, governments, communities, and individuals must work in synergy for the sake of water resources conservation and the implementation of sustainable water management practices. Following this urge, the enhancement of water treatment processes and the development of novel ones is a must. Here, we have investigated the potential of the applicability of “Green Aerogels” in water treatment’s ion removal section. Three families of aerogels originating from nanocellulose (NC), chitosan (CS), and graphene (G) are investigated. In order to reveal the difference between aerogel samples in-hand, a “Principal Component Analysis” (PCA) has been performed on the physical/chemical properties of aerogels, from one side, and the adsorption features, from another side. Several approaches and data pre-treatments have been considered to overcome any bias of the statistical method. Following the different followed approaches, the aerogel samples were located in the center of the biplot and were surrounded by different physical/chemical and adsorption properties. This would probably indicate a similar efficiency in the ion removal of the aerogels in-hand, whether they were nanocellulose-based, chitosan-based, or even graphene-based. In brief, PCA has shown a similar efficiency of all the investigated aerogels towards ion removal. The advantage of this method is its capacity to engage and seek similarities/dissimilarities between multiple factors, with the elimination of the shortcomings for the tedious and time-consuming bidimensional data visualization.


Introduction
Water scarcity is a growing problem worldwide, affecting millions of people and leading to various social, economic, and environmental challenges. According to the United Nations, over 2 billion people currently live in countries experiencing high water stress, and it is projected that, by 2050, at least one in four people will be affected by recurring water shortages [1,2]. The causes of water scarcity are complex and varied and include factors such as population growth, climate change, the over-extraction of water resources, poor water management practices, and pollution. These factors have led to a decline in the availability and quality of water in many parts of the world, particularly in developing countries. Water scarcity has a serious impact on human health, agricultural production, industrial development, and environmental sustainability. In areas affected by water scarcity, people often have limited access to clean water, which can lead to waterborne diseases and other health problems [1,2]. A lack of water can also reduce crop yields, which in turn can lead to food shortages and economic instability. To address water scarcity, governments, organizations, and individuals need to work together to promote sustainable Gels 2023, 9,304 3 of 11 the physical and chemical properties to the adsorption capacities, and even the trade-offs of the manufacturing process and the water treatment conditions' procedure [5][6][7].
Following the occurrence of multiple source constraints, such as the aforementioned variables involved in aerogels' fabrication, and to seek merging them in the most systematic way with the least decision errors, a multidimensional statistical analysis method could be adopted. Principal Component Analysis (PCA) is an unsupervised machine learning technique that could be employed to fulfill the constraints. PCA is a statistical technique used to identify patterns and relationships in large datasets [16]. It is a widely used method for data and dimensionality reduction, which involves reducing the number of variables in a dataset while retaining as much of the original information as possible. PCA works by creating new variables, called principal components (PCs), which are linear combinations of the original variables in the dataset [8]. The first PC is the linear combination that accounts for the largest amount of variation in the data. Subsequent PCs are created in order, with each component accounting for as much of the remaining variance as possible. By analyzing PCs, it can be possible to identify patterns and relationships in the data that may not be apparent from the original variables, when a single-handed approach is taken into consideration. It can also be used to identify outliers and to identify which variables are the most important in explaining the variation along the data [16].
In this study, the aim is to apply PCA for the investigation of the relevance of physical/chemical properties and adsorption features towards a set of NC-, CS-, and G-based aerogels. The different data have been obtained from the already published investigations of Paul and Ahankari [7].

Results and Discussion
PCA analysis was conducted and plotted based on previously published data (Table 1) from the study of Paul and Ahankari [7]. Table 1. Nanocellulose (NC)-, chitosan (CS)-, and graphene (G) oxide-based aerogels in ion removal: physical/chemical and adsorption parameters. (Adapted from Ref. [7] with permission from Elsevier.).  Figure 1 shows the PCA results for the previously published results of the physical/chemical properties and adsorption parameters of several nanocellulose (NC)-, chitosan (CS)-, and graphene (G)-based aerogels implicated in water treatment, from Paul and Ahankari [7]. The first two PCs revealed 45.55% of the total variance (26.95% and 18.60% from the sides of PC1 and PC2, respectively; Figure 1a). Even if the following variance is considered to be average, the yielded multidimensional investigation could reveal some hidden patterns, in comparison to the conventional bidimensional perspective. The obtained rates of the representativeness of the variance could originate from either one of the following scenarios: (1) low correlation between physical/chemical properties, from one side, and the adsorption features from another side or (2) the presence of other properties that are indispensable to explain the adsorption capacities of the investigated aerogels. The first scenario is to be rejected, since the adsorption capacity of a membrane is the synergetic outcome of the different chemical and physical conditions that not only depends on the structural features and chemical components of the aerogel, but goes beyond to the operational conditions of the treatment process. The second scenario is to be acquired, since multiple physical (tortuosity, permeability, viscosity, molecular diffusivity, etc.) and chemical (molecular weight, % of polar/nonpolar functional groups, H-bonding capacity of the matrix, etc.) are not considered [31]. Figure 1 shows the PCA results for the previously published results of the phy cal/chemical properties and adsorption parameters of several nanocellulose (NC)-, c tosan (CS)-, and graphene (G)-based aerogels implicated in water treatment, from P and Ahankari [7]. The first two PCs revealed 45.55% of the total variance (26.95% a 18.60% from the sides of PC1 and PC2, respectively; Figure 1a). Even if the following v iance is considered to be average, the yielded multidimensional investigation could rev some hidden patterns, in comparison to the conventional bidimensional perspective. T obtained rates of the representativeness of the variance could originate from either one the following scenarios: (1) low correlation between physical/chemical properties, fr one side, and the adsorption features from another side or (2) the presence of other pro erties that are indispensable to explain the adsorption capacities of the investigated ae gels. The first scenario is to be rejected, since the adsorption capacity of a membrane is synergetic outcome of the different chemical and physical conditions that not only pends on the structural features and chemical components of the aerogel, but goes beyo to the operational conditions of the treatment process. The second scenario is to be quired, since multiple physical (tortuosity, permeability, viscosity, molecular diffusiv etc.) and chemical (molecular weight, % of polar/nonpolar functional groups, H-bond capacity of the matrix, etc.) are not considered [31]. For the variables, the porosity (ε) and removal efficiency (R%) showed the high contribution in PC1, accounting for 27.40% and 28.27%, respectively ( Figure 1b). Ap from the number of reuse/regeneration (Nr) and BET surface area, the rest of the variab showed a very minor contribution towards PC1 (12.87% for Nr and 17.51% for BET; Figu 1b). For PC2, the highest contribution was scored for removal efficiency after regenerat (Rr%), accounting for 49.09% of PC2′s variance. The second highest contributor towa PC2 is the pH, accounting for 33.14% (Figure 1b). The other variables had negligible va ance for this PC. The trends indicate an average representativeness of PC1, for R% and as adsorption features; the latter are more likely to be influenced by porosity (ε) and B surface area, from the physical and chemical properties' side. Interestingly, BET and ε located at the same position as R%, with a high positive influence and a moderate o along PC1 and PC2, respectively ( Figure 1a). As for PC2, it shows a high repetitiveness removal efficiency after regeneration (Rr%), from the adsorption features' side, and t outcome is more likely to be influenced by the pH.

Sl
For the individuals, it interestingly showed one bulk clustering of most of the inv tigated aerogels, around the node (gray cluster, Figure 1a). These trends can give rise several hypotheses: (1) The individuals of high influence have skewed the different tren of the others; in this case, NC11 has shown a high negative influence, along PC1, with significant one among PC2. In contrast, NC8 and NC9 have shown moderate and h For the variables, the porosity (ε) and removal efficiency (R%) showed the highest contribution in PC1, accounting for 27.40% and 28.27%, respectively ( Figure 1b). Apart from the number of reuse/regeneration (Nr) and BET surface area, the rest of the variables showed a very minor contribution towards PC1 (12.87% for Nr and 17.51% for BET; Figure 1b). For PC2, the highest contribution was scored for removal efficiency after regeneration (Rr%), accounting for 49.09% of PC2 s variance. The second highest contributor towards PC2 is the pH, accounting for 33.14% (Figure 1b). The other variables had negligible variance for this PC. The trends indicate an average representativeness of PC1, for R% and Nr as adsorption features; the latter are more likely to be influenced by porosity (ε) and BET surface area, from the physical and chemical properties' side. Interestingly, BET and ε are located at the same position as R%, with a high positive influence and a moderate one, along PC1 and PC2, respectively ( Figure 1a). As for PC2, it shows a high repetitiveness for removal efficiency after regeneration (Rr%), from the adsorption features' side, and this outcome is more likely to be influenced by the pH.
For the individuals, it interestingly showed one bulk clustering of most of the investigated aerogels, around the node (gray cluster, Figure 1a). These trends can give rise to several hypotheses: (1) The individuals of high influence have skewed the different trends of the others; in this case, NC11 has shown a high negative influence, along PC1, with no significant one among PC2. In contrast, NC8 and NC9 have shown moderate and Gels 2023, 9, 304 6 of 11 high influences, respectively, along PC2, with no significant one along PC1 (Figure 1a).
(2) Several variables showed a low influence for the first two PCs (shown above). In addition, both Teq (time to reach equilibrium) and AC (adsorption capacity) showed a proximity to the node, indicating their low influence towards both PCs (Figure 1a). (3) The investigated aerogels had a high level of similarity in respect to their application in ion removal. This statement cannot be confirmed, due to the moderate total variance yielded in our case. For the sake of validating at least one of the hypotheses, a series of pre-treatments and assumptions will be adopted. For hypothesis (1), an application of PCA to the whole dataset, except for the aerogels with a high contribution, will be taken into consideration ( Figure 2). For hypothesis (2), a PCA investigation will be run with the exclusion of Teq and AC (Figure 3). Hypothesis (3) will be supported following the overall trends, yielded by different findings. 9, x FOR PEER REVIEW 6 of Several variables showed a low influence for the first two PCs (shown above). In additi both Teq (time to reach equilibrium) and AC (adsorption capacity) showed a proximity the node, indicating their low influence towards both PCs (Figure 1a). (3) The investiga aerogels had a high level of similarity in respect to their application in ion removal. T statement cannot be confirmed, due to the moderate total variance yielded in our case. F the sake of validating at least one of the hypotheses, a series of pre-treatments and sumptions will be adopted. For hypothesis (1), an application of PCA to the whole datas except for the aerogels with a high contribution, will be taken into consideration (Figu 2). For hypothesis (2), a PCA investigation will be run with the exclusion of Teq and A ( Figure 3). Hypothesis (3) will be supported following the overall trends, yielded by d ferent findings.   Figure 2 shows the PCA findings of the whole dataset, except for NC8, NC9, a NC11. The first two PCs showed a slightly higher total variance of 49.46%, if compared the one yielded in the case when the whole dataset was taken into consideration (Figu 1). The first two PCs accounted for 26.92% and 22.54%, showing, interestingly, a near fluence towards the investigated dataset. These trends allow the attribution of alm equal influences of the first two PCs towards the investigated properties, from one pa and the adopted aerogels, from another part. In addition, a slightly higher dispatchm Several variables showed a low influence for the first two PCs (shown above). In additi both Teq (time to reach equilibrium) and AC (adsorption capacity) showed a proximity the node, indicating their low influence towards both PCs (Figure 1a). (3) The investigat aerogels had a high level of similarity in respect to their application in ion removal. T statement cannot be confirmed, due to the moderate total variance yielded in our case. F the sake of validating at least one of the hypotheses, a series of pre-treatments and sumptions will be adopted. For hypothesis (1), an application of PCA to the whole datas except for the aerogels with a high contribution, will be taken into consideration (Figu 2). For hypothesis (2), a PCA investigation will be run with the exclusion of Teq and A ( Figure 3). Hypothesis (3) will be supported following the overall trends, yielded by d ferent findings.   Figure 2 shows the PCA findings of the whole dataset, except for NC8, NC9, a NC11. The first two PCs showed a slightly higher total variance of 49.46%, if compared the one yielded in the case when the whole dataset was taken into consideration (Figu 1). The first two PCs accounted for 26.92% and 22.54%, showing, interestingly, a near fluence towards the investigated dataset. These trends allow the attribution of alm equal influences of the first two PCs towards the investigated properties, from one pa and the adopted aerogels, from another part. In addition, a slightly higher dispatchm of the dataset can be noticed, since three clusters were obtained, rather than the one cl  Figure 2 shows the PCA findings of the whole dataset, except for NC8, NC9, and NC11. The first two PCs showed a slightly higher total variance of 49.46%, if compared to the one yielded in the case when the whole dataset was taken into consideration ( Figure 1). The first two PCs accounted for 26.92% and 22.54%, showing, interestingly, a near influence towards the investigated dataset. These trends allow the attribution of almost equal influences of the first two PCs towards the investigated properties, from one part, and the adopted Gels 2023, 9, 304 7 of 11 aerogels, from another part. In addition, a slightly higher dispatchment of the dataset can be noticed, since three clusters were obtained, rather than the one cluster, in the case of the whole-dataset PCA (Figure 1). For the variables, density showed the highest contribution towards PC1, accounting for 30.05% (Figure 2a). Average contributions were yielded for the rest of the variables along PC1. For PC2, time to reach equilibrium (Teq) and removal efficiency (R%) exhibited the highest contributions, accounting for 38% and 29.32%, respectively (Figure 2). Similar to the PCA of Figure 1, two variables were yielded near the node, yet different variables for this case (Nr, and pH; Figure 2). In addition, there was an agglomeration of removal efficiency after regeneration (Rr%), BET surface area, and porosity (ε) were conserved. This trend indicates the similar influence of these variables along the investigated aerogels, in contrast to their lower one on the excluded aerogels (NC8, NC9, and NC11 in this case; Figure 2). When the approach of excluding individuals is adopted, a huge turnover of the different trends can be noticed. The most noticeable ones are the high contribution of Teq, which was negligible previously, and the tremendous decrease in the contribution of removal efficiency after regeneration (Rr%).

2023,
For the individuals, three different clusters can be distinguished (gray, yellow, and blue; Figure 2a). For the gray cluster, it showed a centered position around the node and was positively correlated along with the pH and the number of reuse/regeneration (Nr). Interestingly, this cluster has gathered most of the investigated aerogels. For the yellow cluster, it gathered CS3 and CS4, and to a lower extent CS1, and showed a high positive correlation along the most influential variable, the density. CS1 was yielded at the interface between this cluster and the gray cluster. Interestingly, the yellow cluster has gathered all chitosan-based aerogels, except for CS2 (Figure 2a). It showed a moderate-to-high positive correlation among PC1, with a low-to-negligible one along PC2. For the blue cluster, it combined NC1 along with NC12 and showed a high positive correlation along the adsorption capacity (AC) and the removal efficiency (R%). It showed a high and moderate negative correlation, along PC1 and PC2, respectively (Figure 2a). NC5 was located individually on the high positive side of PC2, with a negligible influence along PC1. This aerogel peculiarly showed a high positive effect along the time required to reach equilibrium (Teq; Figure 2a). Following the aforementioned results (Figures 1 and 2), the exclusion of some individuals has shown a slightly higher efficiency in presenting the different trends, in the bidimensional perspective. Nonetheless, the recurrent agglomeration of most of the investigated aerogels around the node makes this method more likely unreliable in this case. Figure 3 shows the PCA findings of the whole dataset, with the exclusion of time to reach equilibrium (Teq) and adsorption capacity (AC) from the variables. The first two PCs showed 58.20% of the total variance (34.28% and 23.91% for PC1 and PC2, respectively; Figure 3a). This higher value, in comparison with the two previous investigations (Figures 1 and 2), ascertains the efficiency of the adopted approach, towards the reveal of a higher variance in data treatment when variables with minor influence are discarded. For the variables, a profile of contributions similar to the PCA of the whole dataset was noticed (Figures 1b and 3b). These trends make sense, since the variables excluded (Teq and AC; Figure 3) possessed a minor contribution towards the investigated aerogels.
For the individuals, three different clusters can be distinguished (gray, yellow, and blue; Figure 3a). For the gray cluster, it showed a centered position around the node and was positively correlated along with the density, which is considered a minor contributor for both PCs (Figure 3b). For the yellow cluster, it gathered CS3 and NC1 and showed a moderate contribution along PC1. This cluster was positively influenced by removal efficiency (R%), BET surface area, and porosity (ε), which are considered moderate contributors towards PC1 (Figure 3b). For the blue cluster, it gathered NC10 and NC12 and showed moderate-to-high positive correlation along PC2, with a slight negative and positive correlation for PC1 (Figure 3a). It showed a positive influence by the pH, which is considered a moderate contributor towards PC2 (Figure 3b). For CS2, it was exclusively Gels 2023, 9, 304 8 of 11 located at the negative SIDE of PC1, with a minor influence by PC2. This aerogel was mostly influenced by the number of reuse/regeneration (Nr).

Conclusions
In this work, we have envisaged to perform, for the first time, "Principal Component Analysis" (PCA) in the purpose of estimating the efficiency of ion removal along several types of aerogels. The three categories, in-hand, are nanocellulose (NC)-based, chitosan (CS)-based, and graphene (G)-based aerogels. In the case of an all-in-one dataset approach (Figure 1), a moderate total variance has been obtained, indicating a low repetitiveness of the "Total Truth". From the variables side, an interesting high rate of separation between variables can be noticed, between the first two PCs. In fact, density (d), porosity (ε), and BET surface area are the highest influencers for removal efficiency(R%) and the number of reuse/regeneration (Nr) to a lower extent. Additionally, pH as a physical/chemical property has been found to be the most influential for removal efficiency after regeneration (Rr%). Nonetheless, the aforementioned findings cannot be confirmed, due to the slightly low variance of the first two PCs. In order to overcome this issue, two hypotheses have been investigated, requiring two different approaches. The first hypothesis implies that the high influence of a minority of the investigated samples (NC8, NC9, and NC11, in this case) have biased the whole dataset of the PCA biplot. In order to seek the validity of this statement, a PCA investigation was envisaged, without the consideration of NC8, NC9, and NC11 ( Figure 2). The aforementioned approach has slightly raised the total variance, indicating its efficiency towards revealing different aerogel samples, along the adopted properties. The second hypothesis implies that the low influence of some of the variables (AC and Teq, in this case) has skewed the different trends of the multidimensional PCA approach. In order to overcome this issue, a PCA investigation discarding the low impact variables has been adopted ( Figure 3). The aforementioned strategy has shown its efficiency in revealing better trends in the investigated dataset, as a more noticeable increase has been scored (around 58% of the total variance; Figure 3). The third hypothesis implied a high similarity of the investigated aerogels in respect to their application in ion removal. The latter hypothesis is the most likely to be true, since when applying both approaches a similar trend was obtained. This is noticeable in the sense that, for the three followed PCA approaches, the different aerogel samples were located in the center of the biplot and were surrounded by different physical/chemical and adsorption properties. This would probably indicate the similar efficiency in ion removal of the aerogels in-hand, whether they were nanocellulose-based, chitosan-based, or even graphene-based.

Materials and Methods
The methodology approach in this work is similar to our previously published works in the application of PCA [32,33]. PCA is a statistical technique that simplifies complex datasets by reducing the number of variables while retaining the most important information. It achieves this by transforming the original variables into a smaller set of uncorrelated variables, known as principal components (PCs), through linear combinations. The first PC captures the direction of the highest variability in the dataset, with subsequent PCs being orthogonal to previous components and capturing the next highest variability. PCA is commonly used in data analysis and machine learning to extract valuable insights from large datasets, identify patterns and relationships, and detect outliers and anomalies. However, its effectiveness is limited in cases where nonlinear relationships or complex structures exist, and it can be sensitive to outliers [16].

Data Collection and Pre-Treatment
Data have been collected from the published study of Paul and Ahankari [7]. Table 1 presents the inventory of the different investigated NC-, CS-, and G-based aerogels, along their performance capacity, adsorption parameters, and physical/chemical characteristics. The data of each of the investigated variables have different weights. To remove any bias yielded by the difference of magnitude, a normalization technique, such as the one of Younes et al. (cite), has been adopted as follows: where "Y st " presents the standardized dataset values.

Principal Component Analysis (PCA)
After normalization, PCA findings were yielded using XLSTAT 2014 software, following an approach similar to the one adopted by Murshid et al. [17]. In this study, the missing data were estimated using a built-in feature that replaces a missing value with the "Mode", following the respective variables.
The aim of this study is to apply PCA on the data found in a previous study by Paul and Ahankari [7] (Table 1), applying PCA targets searching for any hidden layers between the physical/chemical properties, from one side, and adsorption parameters, from another side. In the case that they are found, this will help in the better interpretation, and therefore better understanding, of different factors that influence the applicability of a certain aerogel membranes. The output information yielded by PCA could help in several stages of the water treatment process, from the manufacturing approach going to the experimental conditions to the removal efficiency of a selected membrane. Here, we have applied PCA, for 8 different factors, influencing 24 investigated aerogels (Table 1). PCA is a data-driven unsupervised machine learning technique, which works on the reduction of a certain dataset. The outcome of such reductions has been applied for a better visualization of a certain phenomenon to seek hidden knowledge by the given correlations (negative or positive) and the representativity of the principal components (PCs) to the population in-hand. The jth PC matrix (Fi) is expressed using a unit-weighting vector (Uj) and the original data matrix M with m x n dimensions (m: number variables, n: number of datasets), as follows [34][35][36][37]: where U is the loading coefficient and M is the data vector of size n.
The Lagrangian function can be defined, by performing the Lagrange multiplier method, as follows: For Equation (7), "U T U − 1" is considered to be equal to zero, since the weighting vector is a unit vector. Hence, the maximum value of Var(M) can be calculated by equating the derivative of the Lagrangian function (L), in respect to U, as follows: Informed Consent Statement: Not applicable.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.

Conflicts of Interest:
The authors declare no conflict of interest.