1. Introduction
The properties of electrolyte solutions, such as osmotic coefficients, deviate from ideal models due to electro-static forces and ion–water interactions. Accurately quantifying these properties is not merely a fundamental scientific exercise; it is a prerequisite for optimizing industrial processes with significant sustainability implications [
1]. For instance, in water desalination and ion separation, a precise understanding of these properties allows for the design of more energy-efficient systems, directly contributing to the monitoring and reduction in the socio-economic costs associated with water and energy use.
In recent decades, many researchers have focused on studying the osmotic coefficient due to its importance in various aqueous solutions. Ibrahim et al. [
2] studied the thermodynamic properties of various aqueous sugar solutions. They used both a Perturbed Hard Sphere Chain equation of state and an Artificial Neural Network (ANN) model, with a specific focus on accurately predicting the osmotic coefficient. Abedi et al. [
3] studied drug–biomolecule interactions in water by measuring osmotic coefficients in mixed solutions of certain drugs and amino acids. Their findings, which used vapor pressure osmometry at body temperature (310.15 K), showed that osmotic behavior depends on the specific amino acid and suggests ion pair formation between drug molecules. Patil et al. [
4] measured the osmotic coefficient of two bio-ionic liquid solutions in water. Their analysis showed these liquids act as electrolytes, with their thermodynamic behavior significantly influenced by hydrophobic hydration effects, similar to other ionic liquids.
Given its critical role in understanding electrolyte equilibrium, the osmotic coefficient has been the subject of considerable study. Xin et al. [
5] measured vapor pressure lowering to determine the osmotic coefficients of lithium salts in organic solvents at 298.15 K. The data was modeled to calculate the salts’ activity coefficients, which is relevant for lithium-ion battery electrolyte research. Grundl et al. [
6] investigated the osmotic coefficients and water activity of binary water/5-(hydroxymethyl) furfural and in ternary water/5-(hydroxymethyl) furfural/salt solutions using vapor pressure osmometry at 298.15 K. They used a Pitzer-type model for the binary systems and the Zdanovskii–Stokes–Robinson (ZSR) mixing rule for the ternary systems to calculate the activity coefficients of the components. Meng et al. [
7] used molecular dynamics simulations and Raman spectroscopy to study NH
4Cl solutions, finding that contact ion pairs form at higher concentrations, altering the hydrogen bond structure. Their work also established a preliminary link between the solution’s osmotic coefficient and the specific configuration of its hydrogen bonds. Wu et al. [
8] modeled the activity and osmotic coefficients of rubidium-containing electrolyte solutions using two computational models, Electrolyte Molecular Interaction Volume Model (eMIVM) and Electrolyte Molecular Interaction Volume Model–Energy Term (eMIVM-ET). Their results show that eMIVM-ET performed better for predicting properties of mixed electrolyte solutions, while both models effectively described single-electrolyte systems. Rudakov et al. [
9] developed a thermodynamic model focusing on the osmotic coefficient of 2-1 electrolyte solutions. The model, which accounts for hydration and ion pairing, successfully described the osmotic behavior of CaCl
2 solutions across a wide temperature range.
The importance and improvement of prediction accuracy in chemical systems, especially for thermodynamic properties, has received increased attention based on machine learning algorithms [
10,
11]. Therefore, in this study, focusing on the osmotic coefficient in chemical processes, particularly desalination, the aim is to predict the osmotic coefficient of aqueous electrolyte systems for various chloride, sulfate, and phosphate mineral salts. The approach combines an optimization algorithm with machine learning models. This combination has not been used before for predicting osmotic coefficients in electrolyte systems. Specifically, two machine learning algorithms including Decision Tree and Gradient Boosting Machine (GBM) will have their parameters optimized using the Gazelle Optimization Algorithm (GOA). This specific combined method has not been evaluated in previous studies.
2. Data and Methods
2.1. GOA-DT Hybrid Approach
The Gazelle Optimization Algorithm (GOA) is a contemporary metaheuristic inspired by the natural behaviors of gazelles, which alternate between escaping predators to facilitate exploration and grazing in safe zones to promote exploitation when addressing optimization problems [
12]. In this study, the Gazelle Optimization Algorithm is employed to autonomously identify the optimal hyperparameters for a Decision Tree (DT) model intended to predict osmotic coefficients. The methodology follows a systematic and sequential procedure.
Initially, the dataset is prepared for analysis. It comprises 27 features and 893 samples, which are organized into a feature matrix and a target vector. The data are subsequently partitioned into two subsets, with 70 percent allocated for model training and the remaining 30 percent reserved for final testing. This division ensures that the model is ultimately assessed on unseen data, thereby providing a reliable evaluation of its generalization capability. The GOA is then configured for the optimization task. The algorithm begins by generating a random population of 50 candidate solutions, each representing a potential combination of three Decision Tree hyperparameters: the minimum leaf size, the maximum number of splits, and the minimum parent size. The search space is constrained by predefined minimum and maximum allowable values for each parameter. The core optimization loop subsequently commences. During each iteration, every candidate updates its position according to one of two simple rules selected at random. The first rule governs exploitation, whereby the candidate moves toward the best solution identified thus far and another randomly selected candidate, thereby concentrating the search in promising regions. The second rule governs exploration, whereby the candidate undertakes a randomized step to investigate new areas of the search space and avoid premature convergence. The magnitude of these steps diminishes progressively over time, allowing the search to begin broadly and become increasingly refined as the algorithm progresses. Following the positional update, the quality of each candidate is assessed using a dedicated objective function. This function takes the proposed hyperparameters, trains a Decision Tree on the training data using five-fold cross-validation, and returns the R squared value derived from this validation process. The incorporation of cross validation is critical, as it evaluates the performance of the hyperparameter configuration across different subsets of the training data, thereby mitigating the risk of overfitting and steering the GOA toward a robust solution. Selection is then performed. If a candidate new position yields a superior fitness score compared to its previous position, the updated configuration is retained. Throughout this process, the best solution discovered by any candidate in the entire population is continuously tracked. This iterative procedure continues over numerous cycles, during which the population of candidates gradually converges toward the optimal set of Decision Tree hyperparameters. The best fitness score from each iteration is recorded to illustrate the algorithm progressive improvement over time.
Upon completion of the GOA, the single best set of hyperparameters is extracted. A final Decision Tree model is trained on the entirety of the training dataset using these optimal settings. This model, which has not been involved in the tuning process, is then applied once to the separate test set to generate final predictions. Its performance is quantified using evaluation metrics including R squared and Root Mean Squared Error (RMSE).
Finally, all results are presented comprehensively. This includes the values of the three optimal hyperparameters, the performance metrics for both the training and test datasets, and several graphical plots. These visualizations depict the GOA convergence trajectory over successive iterations, a comparison between actual and predicted values, and a bar chart illustrating the relative importance of each of the 27 input features as determined by the Decision Tree model.
2.2. GOA-GBM Hybrid Approach
In this hybrid methodology, the Gazelle Optimization Algorithm is applied to automatically determine the optimal hyperparameters for a Gradient Boosting Machine (GBM) model designed to predict the osmotic coefficient. The entire procedure adheres to a clear and structured workflow.
The process commences with data preparation. The feature data and target labels are loaded and transposed into the appropriate format, with the target variable being formatted as a column vector. This yields a feature matrix and a target vector. Subsequently, the data are divided into training and testing subsets. To ensure reproducibility of results, a fixed random seed is established. The total samples are randomly shuffled, with 70 percent allocated to the training set and the remaining 30 percent reserved as an independent test set. This separation creates distinct data partitions for model development and for the unbiased evaluation of final model performance. The GOA optimization framework is then established. Key algorithmic parameters are defined, including a population of 30 candidate solutions and search boundaries for five GBM hyperparameters. These parameters are the number of trees, the learning rate, the minimum leaf size, the maximum number of splits, and the minimum parent size. The GOA population is initialized by randomly assigning each candidate a value for each hyperparameter within the specified ranges, with integer parameters subsequently rounded. The fitness of each candidate is evaluated through a dedicated objective function, and the best solution within the initial population is identified. The principal GOA loop executes for a predetermined number of iterations. During each iteration, every candidate updates its position according to one of two straightforward strategies selected randomly. The first strategy involves exploitation, wherein the candidate moves toward the current best solution and a randomly selected peer. The second strategy involves exploration, wherein the candidate undertakes a randomized step. A decreasing weight factor ensures that exploration is more pronounced in early iterations and gradually gives way to exploitation in later stages. Following each positional update, the new configuration is verified to remain within the established bounds. The quality of the updated position is evaluated using the objective function, which accepts the proposed hyperparameters, performs five-fold cross-validation on the training set, and returns the average R squared value. The use of cross-validation in this context prevents overfitting and provides reliable guidance for the search process. If the new position yields an improved fitness score, it replaces the previous configuration. The overall best solution is updated as necessary, and the best fitness score from each iteration is recorded to monitor progress. Upon completion of the GOA, the optimal hyperparameters are extracted from the best solution identified. These five values represent the optimal number of trees, learning rate, minimum leaf size, maximum splits, and minimum parent size. A final GBM model is then trained on the complete training set using these optimal settings, employing the Least Squares Boost (LS Boost) method. Importantly, this final training phase does not involve cross-validation but instead utilizes all training data to construct the most robust model possible for subsequent testing. This final model is employed to generate predictions on both the training set and the held out test set. Performance metrics, including R squared and Root Mean Squared Error (RMSE), are calculated for both datasets. These metrics illuminate the model proficiency in learning the training data and, more importantly, its accuracy in generalizing to new, unseen test data. The analysis also computes feature importance, revealing which of the 27 input features the GBM model found most valuable for prediction. Additionally, learning curves are generated by tracking the reduction in model error (MSE) on both training and test sets as the number of trees in the ensemble increases.
2.3. Data Collection
In this study, 893 samples were collected to evaluate and predict the osmotic coefficient in the equilibrium systems of inorganic materials. The dataset includes 27 parameters such as HCl, LiCl, NaCl, KCl, NH
4Cl, CsCl, MgCl
2, CaCl
2, BaCl
2, Li
2SO
4, Na
2SO
4, K
2SO
4, (NH
4)
2SO
4, MgSO
4, MnSO
4, NiSO
4, CuSO
4, ZnSO
4, NaH
2PO
4, KH
2PO
4, (NH
4)H
2PO
4, Na
2HPO
4, K
2HPO
4, (NH
4)
2HPO
4, Na
3PO
4, K
3PO
4, and (NH
4)
3PO
4. The experimental osmotic coefficient data were obtained from published scientific studies and were carefully processed to ensure reliable and consistent results. This preprocessing included unit normalization to maintain consistent measurement scales across all parameters, as well as outlier removal to eliminate unusual data points that could distort the results. For accurate model performance evaluation, the dataset was randomly split into training (70%) and testing (30%) subsets using a fixed random seed, ensuring reproducibility and fair comparison across different optimization experiments. Performance metrics such as R
2 were calculated exclusively on the test set.
Table 1 provides detailed information about the collected dataset, including the target minerals. The dataset used for training and evaluating the performance of the algorithms in this study is provided in the
Supplementary Materials.
3. Results and Discussion
In
Figure 1, the complete correlation matrix is presented to evaluate the effect of dissolved mineral components on the osmotic coefficient in an equilibrium system. The figure shows that most features have low correlation with each other, indicating that each mineral contributes relatively independently to the osmotic behavior of the solution. Therefore, the osmotic coefficient is controlled by the combined influence of multiple ions rather than by a single dominant component. Chloride-based salts show moderate positive correlations among themselves. This suggests that minerals such as NaCl, KCl, and NH
4Cl affect the osmotic coefficient in a similar way because they have comparable ion–water interactions. These salts mainly control osmotic pressure through ionic strength and hydration of monovalent and divalent cations. Sulfate salts form another clear correlated group. The strong correlation between different sulfate minerals indicates that the sulfate anion plays an important role in determining the osmotic coefficient. Sulfate ions have a higher charge and stronger electrostatic interactions, which increase non-ideal behavior in the solution and significantly affect osmotic properties. In contrast, phosphate-based salts show very weak correlations with other minerals. This means their effect on the osmotic coefficient is different and more complex. Phosphate ions have higher valence and stronger ion pairing, which leads to nonlinear and system-specific effects in equilibrium conditions. Based on the results of this figure, the osmotic coefficient in a mineral equilibrium system is mainly influenced by ion type, ionic charge, and anion group. Multivalent ions generally have a stronger impact than monovalent ions. The low correlation between most features shows that using all mineral components is important for accurate modeling and prediction of the osmotic coefficient.
In
Figure 2, the feature importance for predicting the osmosis coefficient is presented using the Out-of-Bag (OOB) method. This chart shows which chemical features have the strongest influence on the osmotic coefficient. The most important feature is LCI, with the highest importance score, meaning it has the greatest effect on the osmotic coefficient. It is followed by CaCl
2 and MgCl
2, which also show strong influence. Features such as NiSO
4 and HCl have moderate importance, while chemicals like NaCl, MnSO
4, and ZnSO
4 have lower impact. At the bottom of the ranking, features like K
3PO
4, Na
3PO
4, and (NH4)
3PO
4 have considerably lower effects on the osmotic coefficient. This means that changes in these low-importance features do not significantly change the predicted osmotic coefficient value.
Figure 2 helps identify which chemical factors should be prioritized when modeling or controlling the osmotic coefficient. As shown in
Figure 2, LiCl exhibits the highest feature importance score, followed by NaCl and KCl. This dominance of LiCl can be explained by its unique physicochemical properties: the small ionic radius and high charge density of Li
+ lead to a strong hydration shell and pronounced ion–dipole interactions with water molecules, which significantly alter the hydrogen bonding network and thus the osmotic coefficient of the solution. In contrast, larger alkali ions (Na
+, K
+, Cs
+) have lower charge densities and weaker hydration effects, resulting in relatively lower importance scores. Among divalent cations (Mg
2+, Ca
2+, Ba
2+), despite their stronger electrostatic interactions, their overall influence on the osmotic coefficient in the present dataset is moderate, likely due to different ion pairing behavior and solubility constraints.
In
Figure 3, the training and test results for the hybrid GOA-DT approach are shown. This approach was used to predict the osmotic coefficient in an equilibrium system containing inorganic materials. The optimal hyperparameters identified were: a minimum leaf size of 2, a maximum of 199 splits, and a minimum parent size of 2. The model demonstrated high performance on both datasets. On the training data, the model achieved an excellent R
2 score of 0.9670 and a very low Root Mean Squared Error (RMSE) of 0.0604.
When evaluated on the independent test set, the model maintained strong predictive accuracy with an R2 of 0.9260 and an RMSE of 0.0947. These results indicate that the GOA–optimized Decision Tree model is highly effective and shows good generalization to unseen data.
In
Figure 4, the training and test results for the hybrid GOA-GBM approach are presented. The Gradient Boosting Machine (GBM) model was optimized using the Gazelle Optimization Algorithm (GOA). The optimal hyperparameters identified were: 427 trees, a learning rate of 0.3680, a minimum leaf size of 1, and a maximum of 3 splits per tree. The model demonstrated exceptional performance on both datasets. On the training data, the model achieved a near-perfect R
2 score of 0.9974 and a very low Root Mean Squared Error (RMSE) of 0.0171. When evaluated on the independent test set, the model maintained outstanding predictive accuracy with an R
2 of 0.9734 and an RMSE of 0.0568. These results indicate that the GOA–optimized GBM model is highly effective and shows excellent generalization capability for predicting the osmotic coefficient in equilibrium systems containing inorganic materials.
The results reveal distinct strengths and weaknesses for each of the two hybrid optimization methods. The GOA-GBM model demonstrated superior predictive accuracy, achieving the highest test R2 (0.9734) and lowest test RMSE (0.0568). Its primary strength lies in its exceptional learning capability and generalization power, making it the most reliable model for this application. However, its weakness is increased model complexity, which can make it computationally more expensive and less interpretable than simpler models.
The GOA-DT model showed very strong performance with a test R2 of 0.9260, offering an excellent balance between accuracy and simplicity. Its strength is providing near-state-of-the-art results while maintaining model transparency and faster execution. The noticeable gap between its training R2 (0.9670) and test R2 suggests a potential weakness: slight overfitting to the training data compared to the more stable GBM approach.
In
Table 2, the performance results of the two machine learning algorithms without optimization are presented. The comparison between baseline models (without optimization) and hybrid GOA-based models clearly highlights the impact of hyperparameter optimization on predictive performance. Among the baseline models, the GBM algorithm exhibited the best performance, achieving a test R
2 of 0.9292 and a test RMSE of 0.0927. This indicates that boosting-based methods inherently possess strong learning and generalization capabilities, even without optimization, while the Decision Tree showed the lowest accuracy, confirming its sensitivity to model configuration and tendency toward suboptimal generalization. However, when compared to the optimized models presented earlier (GOA–DT and GOA–GBM), a substantial improvement in performance is observed. The GOA–GBM model significantly outperformed its baseline counterpart, improving the test R
2 from 0.9292 to 0.9734 and reducing the RMSE from 0.0927 to 0.0568. This demonstrates the effectiveness of GOA in fine-tuning critical hyperparameters such as learning rate and tree structure. Similarly, the GOA–DT model showed a remarkable improvement over the standard Decision Tree, increasing test R
2 from 0.8115 to 0.9260, which indicates that optimization plays a crucial role in enhancing simpler models.