1. Introduction
Vapor–liquid equilibrium (VLE) calculations constitute the foundation of a very wide range of reservoir engineering applications [
1,
2,
3]. These include phase behavior modeling, compositional reservoir simulation, material balance models, Enhanced Oil Recovery (EOR) studies, and separation processes [
3]. For instance, in reservoir simulations, given the pressure (P), temperature (T), and mole concentration of each component (
), phase stability analyses [
4] and flash calculations [
5] predict, at each discretization block of the grid model (
Figure 1) and for each time step, the exact number and type of coexisting phases in equilibrium, as well as the concentration of each component in each equilibrium phase.
The common approach to treating the flash problem, which is depicted in
Figure 2, is to seek a solution which satisfies the component mass balance [
7,
8,
9] and ensures that each component shares same fugacity value in both phases [
10]. The principle of mass conservation states that for a fluid with composition
that is split into a vapor phase,
, and a liquid phase,
, the total number of moles of a specific component,
is equal to the number of moles of that component in the vapor phase,
plus the number of moles of that component in the liquid phase,
. The equality of fugacities dictates that at equilibrium, the fugacities of each component in the two phases must be equal, i.e.,
, such that the Gibbs free energy of the final two-phase system is minimized [
11].
By combining the two principles and considering the constraint that the composition of each equilibrium phase sums up to unity, the well-known Rachford–Rice equation [
12] is derived, which is frequently written as a function of the vapor phase molar fraction,
:
where
and
denote the overall molar fraction and equilibrium ratio (k-value or distribution factor) of component
in the mixture, respectively. Imposing specific k-values essentially ensures the equality of component fugacities.
The Rachford–Rice equation (Equation (1)) demonstrates the direct effect that equilibrium ratios (k-values) have on the flash calculations. By numerically solving this equation for a given feed,
, and values
, the molar fraction of the vaporized feed,
, can be calculated. The composition of each equilibrium phase is then given by:
The range of, which determines the preference of a constituent to remain in the vapor phase or liquid phase, can vary widely depending on the pressure and temperature conditions under which the flash calculation is performed [
13]. A component’s k-value greater than one indicates that it is volatile and tends to stay mostly in the vapor phase, while a k-value less than one indicates its higher affinity to remain in the liquid phase [
14]. As the flash conditions approach the mixture’s critical ones (known as criticality conditions), the vapor and liquid phases become less distinguishable from each other, resulting in the k-values converging to unity [
15].
Figure 3 provides an illustration of this variation by depicting isotherms plotted on a k-values versus pressure diagram for three components of different molecular weights: a light component (C
1), an intermediate component (i-C
5), and a heavy component (C
7+). The k-values clearly approach unity as pressure increases close to the critical one (approx. 4500 psi).
Once the phase compositions are determined, the properties directly incorporated into the flow simulation, such as saturation, density, and viscosity, are computed as functions of and .
In the compositional reservoir simulation context, in order to simulate the phase behavior of multicomponent fluids, billions of flash calculations are required to be carried out. Indeed, once the stability test indicates an unstable feed,
, at each iteration of the non-linear solver, at each grid block, and at each time step, a flash calculation needs to follow to estimate
. Various computational methods are used for flash calculations, including successive substitution iteration (SSI), the quasi-Newton method, the Newton–Raphson (NR), the steepest descent method [
16,
17,
18,
19], and their respective variations, as well as hybrid approaches [
5,
20,
21,
22]. These computational techniques, though CPU-time intensive, are known for their high accuracy, ensuring precise reservoir simulation predictions across a wide range of pressure–temperature conditions. However, challenges arise when dealing with criticality. This is because properties crucial in flow simulation, such as saturation, density, viscosity, and the effects of gravity, are highly sensitive to k-values under such conditions. Consequently, even minor errors in estimating k-values can result in significant inaccuracies in these important properties.
In addition, several non-iterative methods exist for estimating k-values. Wilson’s correlation [
23] stands as the most notable, while other correlations include Standing’s correlation [
24], Hoffman et al.’s correlation [
25], Whitson and Torp’s correlation [
26], and the convergence pressure method [
2]. Specific correlations have also been developed for the plus fraction [
27,
28,
29] and non-hydrocarbons [
30]. However, for demanding phase behavior calculations, the accuracy of these approximations falls short. Therefore, the development of faster and more accurate k-value estimation methods is highly desirable.
Over the years, several machine learning (ML) techniques have appeared in the literature, aiming to accelerate the time-consuming process of solving flash calculations. A phase stability-targeted support vector machine (SVM) methodology was first proposed by Gaganis et al. [
31], who utilized a uniformly drawn stability test dataset to generate a discriminating function that replicates a mixture’s phase boundary. The classifier was trained using stability labels (stable/unstable) in order to obtain fast predictions. Later, they [
3,
32] expanded their research and solved both phase stability and phase-split problems by combining SVMs for classification and ANNs for regression in a single prediction system. To further accelerate the calculations, reduced variables were used to shrink the output, back-transforming the ANN predictions to regular k-values. In 2014, Gaganis et al. developed a technique to rapidly solve the multiphase stability problem using SVMs [
33]. After that, Gaganis [
34] proposed a more efficient treatment of the stability problem utilizing two custom discriminating functions, each single-sided correct, to denote the stability of a mixture. The functions are built so that the ambiguous space, called “the grey area” (where no discriminating function is positive), is as narrow as possible, reducing the need to run a conventional stability test.
Kashinath et al. [
35] also treated the stability problem as a binary classification one using SVMs, this time tailoring it to CO
2 flooding simulations. In their work, if the classifier predicts an unstable phase, an ANN model is used to predict the prevailing k-values. Zhang et al. [
36] introduced a self-adaptive deep learning model capable of predicting the number of phases and their respective properties. Li et al. [
37] presented a deep artificial neural network (ANN) model to tackle the iterative flash problem, which is a prevalent issue in phase equilibrium calculations within the moles, volume, and temperature (NVT) framework. Similarly, Poort et al. [
38] employed a combination of classification and regression neural networks to address both phase stability and phase property predictions. In addition, Wang et al. [
39] built two ANN models to handle the stability and phase-split problems. Similar processes were developed by various other authors [
40,
41,
42,
43]. Schmitz et al. [
44] developed a classification method using a feed-forward ANN and a probabilistic ANN to extend the previous approaches and solve the phase stability problem. A more recent work was developed by Samnioti et al. [
45], who employed ANNs to accelerate complex gas condensate phase behavior calculations. The ANN was trained using an extensive dataset obtained from the simulation of various gas recycling schemes, covering any possible compositional changes that might occur inside a reservoir to account for the large compositional variability in the gas reinjection process. Later, Anastasiadou et al. [
46] progressed similarly by trying to solve the phase stability problem for an even more complex acid gas reinjection system. The authors proposed three classification approaches, ANNs, decision trees (DTs) and SVMs, to solve the phase stability problem, using a large ensemble of training data.
The main drawback of the ML techniques described above is that the error function utilized in the models’ training accounts equally for all datapoints regardless of their proximity to criticality. As a result, when the prevailing conditions are close to the critical ones, as can be the case in gas condensate reservoirs, these models’ k-value estimates may lead to significantly large errors in the fluid properties of real interest in flow simulations. It should be noted that apart from the critical point itself, criticality also appears along the convergence locus (CL), that is, the pressure–temperature conditions’ locus, where negative flash solutions vanish [
47]. In the case of gas condensates, the CL lies very close to the dew point phase boundary and hence to the interior of the phase envelope where the flash calculations are run.
In this paper, we present a novel methodology aimed at enhancing the training quality of ML models addressing the thermodynamic phase-split problem. Our approach focuses on improving the efficiency of ML models, particularly in the vicinity of criticality, by generating uniformly distributed rather than biased deviations across various flash conditions. Specifically, we propose an approach to fine-tune the ML model’s learning capacity without altering its structure or training algorithm while only affecting the training dataset, taking into consideration the impact of k-values on the fluid property of interest in the subsequent flow simulation.
This technique is directly applicable to a wide range of computational problems where an ML model utilizes an input, , to predict an output, , although the primary focus lies on the accuracy of some dependent property, . Our research aligns with the broader field of optimizing regression machines for specific engineering objectives, and it delves into the realm of computational methods designed to improve the performance of these machines in a targeted manner.
The paper is laid out as follows:
Section 2 establishes formally the need for a new physics-oriented approach to train ML models that solve the flash problem.
Section 3 describes the proposed methodology, while
Section 4 discusses the results obtained. Conclusions are presented in
Section 5.
2. Proof of Concept
In this section, the significance of obtaining poor-quality k-values when running flash calculations close to critical conditions is demonstrated firstly by a theoretical analysis, followed by numerical calculations. In a regular two-phase flash calculation, the quantity of a phase, such as the vapor molar fraction, represented by
, always falls within the physical interval [0, 1]. This implies that the recombination of
moles of gas with composition
y and
moles of liquid with composition
x results in the reconstruction of the original feed composition, Whitson and Michelsen [
47] extended the regular flash calculation to conditions under which the fluid is physically monophasic and demonstrated that the phase-split equations can be satisfied even when the
values lie outside the physical domain [0, 1]. In a negative flash with
(at pressures exceeding the bubble point),
represents the vapor phase amount that needs to be removed from
moles of liquid to reconstruct one mole of the original fluid feed composition. Similarly, when
(at pressures above the upper or below the lower dew point),
moles of liquid need to be removed from
moles of gas. Clearly, the negative flash results are indicative of hypothetical states and lack direct applicability in fluid flow computations. However, such calculations can significantly enhance the convergence properties of regular flash computations near the phase boundary by allowing phase molar fractions at some iteration to temporarily cross the phase boundary.
The convergence pressure (P
conv) [
27] in a multicomponent mixture refers to the pressure at which a negative flash,
k, approaches unity at a fixed temperature. Similarly, for a fixed pressure, the convergence temperature (T
conv) is defined as the temperature at which a negative flash,
k, converges to unity. In the pressure–temperature plane, the CL is the line that connects all of the convergence pressures and temperatures. The regular phase envelope and the CL meet at the mixture’s critical point. The negative flash calculations yield non-trivial results, meaning two distinct solutions for the compositions of the liquid and gas phases, only within the region bounded by the regular phase envelope and the convergence locus (CL), which contains the convergence pressures and temperatures. This region is often referred to as the shadow region. The regions discussed are shown in
Figure 4 for a gas condensate. Note that the diamond marker represents the fluid’s critical point.
Performing flash calculations in the vicinity of a fluid’s CL within the phase boundary is challenging as only slightly inaccurate k-value estimates may lead to significant phase compositional errors. This can be expressed mathematically through the limit of
as the k-value of component
,
, approaches unity. Note that
is the partial derivative of Equation (2) with respect to
and represents the sensitivity of
(or
) to inaccuracies in
:
where
and
is Kronecker’s delta. This expression is derived by differentiating the implicit Rachford–Rice equation (Equation (1)). Specifically, the independence of the k-values from each other allows for the cancellation of the summation in Equation (1). By applying the quotient rule and expressing
in terms of the other variables, the partial derivative
is obtained. When
approaches unity, as indicated in Equation (5), the limit of
tends towards infinity:
Thus, when the prevailing conditions are close to criticality (either to the critical point itself or to the CL), the need for precise k-value, , estimates is particularly high to ensure that dependent properties of real interest in reservoir simulation, such as the saturation, density, and viscosity of each phase, are computed accurately. Densities are directly related to gravitational effects, whereas their derivatives with respect to pressure are directly related to fluid compressibility, which dominates viscous flow.
After mathematically demonstrating the extent of the k-values’ limited accuracy problem, numerical calculations were also carried out on two real reservoir fluids, a lean gas condensate and a rich gas condensate, at various sets of pressure–temperature (P–T) conditions. The compositions of the utilized fluids are shown in
Table 1.
Figure 5 illustrates the phase envelopes (depicted by the blue curves) of the lean gas condensate and the rich gas condensate, respectively, as obtained using the Peng and Robinson cubic equation of state (1978) [
2,
14,
48,
49,
50,
51,
52,
53]. The diamond-shaped blue points on the phase envelopes represent the critical points of the mixtures, while the purple dashed lines represent the convergence loci (CLs). Five points were selected along the red isotherm of the reservoir temperature of each fluid, each exhibiting a varying distance to criticality.
The k-value norm,
accounting for the sum of the squares of the natural logarithms of the experimental k-values,
(Equation (6)), was introduced to serve as an indicator of a point’s proximity to criticality, with lower values denoting a closer proximity. From
Table 2 and
Table 3, it follows that among the five points selected for each fluid (
Figure 4), the proximity to criticality increases gradually between the first and fifth point.
Regular flash calculations were conducted using conventional iterative algorithms to compute the gas phase ratio,
, and the equilibrium phases’ properties at each point along the isotherm. Subsequently, random noise of a fixed amplitude was added to the k-values to replicate the error of an ML model trained to predict those outputs. This choice stems from the fact that, in this work, an attempt is made to minimize the way an ML model operates. In this case, the absolute error objective function is minimized by the training process instead of a relative error metric. In fact, noise was added to the property predicted by the ML model, which is the logarithm of the k-value. The Rachford–Rice equation was then rerun using the distorted k-values, and the molar ratio and phase properties were reestimated. Finally, the obtained deviations, defined by their average errors, were determined.
Table 4 and
Table 5 demonstrate that as the CL is being approached, there is a significant increase in the absolute errors of the vapor phase molar fraction,
, the liquid and vapor phase compositions,
and
, and the liquid and gas phase densities,
and
, even though the same level of noise was added to all the selected points. This underscores that highly reliable k-value estimations are especially crucial in close proximity to the CL due to the increased sensitivity of the derived phase properties to k-value errors in that area. Note that, for this specific application, adding a relative basis noise would cause an unbalanced effect on the k-values due to their very wide span, which covers several orders of magnitude. This, in turn, would not allow for the analysis of the side effect on the properties of interest for the subsequent flow simulations.
To further demonstrate the problem, the efficacy of a classic supervised ML model trained to accurately reproduce the training data associated with the P–T points near the CL of the rich gas condensate was investigated. Firstly, a total of 100,000 random pressure points were uniformly selected across the range of 1500 psi (i.e., a typical abandonment pressure) to 4054.3 psi (i.e., the gas dew point pressure) at the reservoir temperature, and a dedicated MATLAB code that performs regular flash calculations was run for each of these pressures at the reservoir temperature using the feed fluid composition, , to yield the vapor phase and liquid phase compositions, y and x, respectively. Subsequently, the corresponding k-values, k, for each pressure point were determined by calculating the ratio of the vapor phase composition of a given component, i, to its corresponding liquid phase composition. The data collected form the “source dataset”.
A “base case dataset” was generated by randomly selecting 2000 uniformly drawn pressure points, along with their associated k-values,
, from the source dataset in order to ensure statistical significance with respect to the population. The pressure histogram of the base case dataset in
Figure 6 exhibits uniformly distributed bars with no discernible peaks or valleys, which confirms its uniformity. On the other hand, the frequency distribution of the corresponding k-value norms is highly non-uniform, as shown in the same figure, as the effect of pressure on the k-values of each component of a mixture at a constant temperature is distinct and depends on the unique properties of each component.
Subsequently, a conventional feedforward artificial neural network (ANN) was trained against the base case training dataset to predict the logarithms of the k-values, given the prevailing pressure. Note that these pressure values reflect the dynamic conditions within the reservoir, and the role of the ANN is to establish a functional relationship between these pressure inputs and the resulting k-values. Τraining was repeated 100 times to mitigate the inherent stochasticity of ANN training, which stems from random weight initialization and the stochastic nature of the training algorithm. The Rachford–Rice equation (Equation (1)) was solved for using a combined Newton–Raphson bisection method, given the overall mole fractions, , and predicted k-values. Given , the phase compositions and were obtained using Equations (2) and (3), respectively, and the molecular weights of the two phases in equilibrium were determined based on their respective compositions. Finally, the Peng–Robinson cubic equation of state was used to compute the molar volumes, of each phase, which, when combined with the molecular weights, enable the liquid and gas densities to be determined. To gain a more comprehensive understanding of the effectiveness of the classic supervised ML model training approach, the properties of interest were averaged across the 100 training runs, yielding a single representative value for each property.
Conventional ANN training aims at utilizing its flexibility to vary the model parameters (weights) to optimally reproduce the training outputs, i.e., the k-values, by minimizing the loss function,
, described by Equation (7).
where
and
correspond to the estimated and exact k-value, respectively, of the
th component of the
th pair. As a result, it focuses on accurately reproducing k-values rather than the dependent properties of interest in a flow simulation.
To evaluate the accuracy of the predicted dependent properties in conjunction with their proximity to criticality, the 2000-datapoint training space was divided into 10 classes based on the k-value norms, where the first class encompasses the P–T points that lie closest to the CL, and the last class comprises the points that are farthest from tit. As can be seen from the k-value norm distribution in
Figure 5 and the ANN prediction error statistics in
Table 6, class 1 contains 246 points, whereas class 10 has only 169 points. From
Table 6, it can be further seen that the errors associated with points in close proximity to criticality are significantly greater than those for points that lie further away, although the former dominate the training process due to their abundance in the dataset, resulting in the error function being minimized by the learning step. This is visually confirmed by the decaying value of the average absolute error of all properties with increasing class number, i.e., while departing from criticality (
Figure 7). These findings confirm the need for a new, focused approach to training an ANN to solve the thermodynamic phase-split (flash) problem. Moreover, these findings can be repeated using any alternative machine learning model, rather than an ANN, to predict k-values given the flash conditions.
3. Methodology
The previous section established the necessity of focusing on the physical properties of interest, rather than solely on the model output, while training machine learning models. Therefore, ML model training must be modified to prioritize the primary objective, that is, to accurately predict a dependent property, , based on an input, , where represents the original ML model output.
To attain a physically sound predictive model, the loss function, , in Equation (7) can be modified to incorporate weighting factors for individual training points or groups of points. These weights assign varying levels of importance to the datapoints based on their significance in predicting the dependent properties of interest. In the context of this research, highly important datapoints are those in the vicinity of the mixture’s convergence locus (CL), where the uncertainty in the indirectly derived properties is maximized. However, implementing this approach requires significant modifications to the training algorithm, such as adjusting the loss function, its gradient, and the Hessian matrix, all of which are indispensable parts of the training algorithm. Instead of introducing individual weights for each datapoint or groups of points, which can be both complex and significantly alter the native training formulation, this work proposes implementing a resampling technique to enhance the flash ML models’ predictions near criticality while simultaneously reducing the average error and standard deviation for each property (, , , , and ).
The proposed resampling technique is designed to improve the performance of an ML model by balancing the datapoint population within a training dataset derived from a source dataset that encompasses a substantial volume of datapoints. This is accomplished by considering the datapoints’ proximity to criticality () and the prediction errors associated with a specific property of interest within each class of the training dataset. The population of datapoints belonging to classes which exhibit poor performance is enhanced by picking more training samples from the source dataset, corresponding to a stronger contribution of such points to the training error function. This way, the training algorithm that minimizes is forced to pay more attention to those points, thus improving their prediction over the other classes and recovering the required accuracy of the physical properties which follow. A hyperparameter, denoted by , controls the level of adjustment made to the number of datapoints in each class during the resampling process. A high value of results in more intense adjustments, while a lower value results in more subtle changes. can be considered analogous to the weight given to each class’s average logarithmic error of the selected dependent property when balancing the training data. Once the resampling step from the source dataset has been completed, a balanced dataset emerges, and the machine training is run regularly.
Figure 8 outlines the resampling algorithm used to improve the ML model performance against some specific dependent property. Firstly, regular training is performed using the initially available training population to obtain the training error of each datapoint,
. The algorithm utilizes the average absolute error per class,
(Equation (8)), for the derived property of interest, obtained by considering the total number of datapoints in each class,
, of the training dataset, denoted by
. Subsequently, the algorithm calculates the log absolute error per class,
(Equation (9)), with which the extent (
) to which these errors deviate from the corresponding errors of the best-trained class can be determined, as described in Equation (10). Ideally, all classes would share the same average absolute error, leading to
values equal to zero and hence no need to modify the overall balance of the dataset. For non-zero
values, the algorithm defined increases the number of samples in each class, as outlined in Equation (11). Clearly, the bigger the spread between a class error,
and that of the optimally learned one (
), the bigger the increase in the datapoints in that class, from
to
. In the final step, the resampling algorithm employs uniform down sampling to reduce the number of datapoints in the resampled training dataset while maintaining the new distribution of the resampled training dataset. This process ensures that the resampled training dataset is of equal size to the base training dataset, preventing any bias introduced by the resampling process.
4. Results and Discussion
The proposed resampling technique was applied to the base case training dataset of the rich gas condensate presented in
Section 2. Firstly, the liquid density error was selected as the balancing metric to resample the datapoint population. This selection was motivated by the significant liquid dropout that occurs when the pressure of the rich gas condensate falls below the dew point, which needs to be modelled accurately due to its high commercial value. In addition, liquid density is included in the expression of isothermal compressibility, as described in Equation (12), which, in turn, governs fluid flow.
To determine the optimal value for hyperparameter
which controls the level of adjustment imposed, a sensitivity analysis was performed. This involved varying the value of
across a wide range and evaluating its impact on predicting the five dependent properties of interest (
,
,
,
and
) using the ML-model-predicted k-values. As shown in
Figure 9, higher values of
significantly adjust the datapoints distribution in each class during resampling, focusing more on classes with poorer performances (i.e., class 1), as reflected by the stronger contribution of such points to the training error function. As an example, consider the distribution of the balanced dataset’s population across the 10 classes for
values ranging between 0.8 and 2.4, as obtained by considering the liquid density error (i.e.,
) to control the resampling process. Since the points in each class exhibit varying degrees of proximity to criticality, their cardinality is differently affected by the resampling process. Class 1 contains the datapoints which lie closer to criticality conditions; they exhibit maximum error, and hence their number is severely affected by the correction factor
. Specifically, this class originally contained 246 points, which increased to 435 and 697 for
and
, respectively, whereas the number of points in the remaining classes decreased accordingly.
Figure 10 depicts the overall improvement (or decline) in the average error and standard deviation for the combined dependent properties of interest. The k-values utilized in this analysis were generated from the ML model trained with the balanced training dataset of the ρ
L error resampling algorithm. Subsequently, the standard deviation and absolute error of each property for various
D values were recorded. The sum of the standard deviations and absolute errors of all five dependent properties at each
D was then calculated to obtain a measure of the improvement in predictions over all the properties of interest. Comparing those values to the performance of the original, unbiased model led to the datapoints plotted in
Figure 10. The analysis covers a range of
D values spanning from 0.8 to 2.4, presenting a comparative evaluation against the results obtained from the base ML training. Clearly, the overall improvement (or decline) in the average error and standard deviation is equal to zero when
D = 0, i.e., when no resampling takes place.
Exaggerated values of D may lead to overfitting in favor of the formerly weak class, ultimately deteriorating the overall model performance, as reflected by the negative mean absolute error values. This corresponds to an improvement in the errors for classes near criticality, while the deterioration in other classes suggests a negative overall trade-off. For the rich gas condensate base training dataset considered in this work, the optimal value of was found to be equal to two. At this value, the average error and standard deviation per property exhibit substantial improvements, reaching a cumulative improvement of 48% and 250%, respectively, compared to the base case training scenario.
Figure 11 provides a visual representation of the distribution frequency of k-value norms within the optimally resampled training dataset (
= 2). As expected, the histogram of k-values exhibits a right-skewed pattern, confirming that a significant portion of the datapoints in the dataset correspond to pressures close to the CL of the fluid. Likewise, the histogram of pressures will display a left-skewed pattern.
Table 7 illustrates the absolute average errors within each class obtained from training the ANN using the resampling approach, while
Figure 12,
Figure 13,
Figure 14,
Figure 15 and
Figure 16 provide a comparison of the class errors obtained by conventional and balanced dataset training. Based on these results, it is safe to say that the proposed methodology leverages more efficiently the learning capacity of the ANN, leading to significant improvements in the errors of the underperforming classes while reducing, at the same time, the average error and standard deviation for each property. Eventually, the model prediction error in the dependent properties of interest is much more uniformly distributed, thus ensuring the similar performance of the ML model across all flash conditions, only weakly related to their proximity to criticality. Note that the model trained with the resampled training dataset exhibits slightly worse performance for classes 5–10 compared to the model trained with the base training dataset. However, this discrepancy is a strategic trade-off in the approach. Emphasis is put on optimizing the model’s learning capacity across various flash conditions, particularly in classes closer to criticality, where inaccuracies in predicted k-values have a more pronounced impact on the results, i.e., the dependent properties. In other words, it is preferable to “sacrifice” some of the performance of the fine-performing classes while improving that of the classes performing really poorly, as is the case with the near-critical points.
By reducing the average error, the ANN improves its ability to make predictions that are, on average, closer to the exact values. Additionally, by reducing the standard deviation, the model ensures that the datapoints are less spread out, resulting in more concentrated and reliable predictions. It is interesting to note that although the resampling process was carried out against the liquid phase density solely, the performance of all the dependent properties was positively affected due to their natural correlation.
A second attempt was carried out to resample the dataset, this time based on the error of the vapor phase molar fraction,
, per class, which determines the saturation of the coexisting phases in equilibrium.
Figure 17 depicts the corresponding improvement or decline in the sum of the absolute average errors and standard deviations per property when varying hyperparameter
in the range of 0.8 to 2. The maximum improvement in the overall average error and standard deviation combined occurs when
equals 1.2, which is a 19% reduction in the overall average error and a 245% improvement in the standard deviation. It is important to note that when the error in
was the basis for the resampling algorithm, the overall improvement in the average error and standard deviation was more pronounced. This indicates that for each fluid, there is a specific dependent property whose error should be incorporated into the resampling algorithm to optimize predictions using an ML model. Nevertheless, the whole process can be easily automated in a computer program, thus minimizing the use of human resources and the time required to optimize the ML training step.