Use of Synthetic Data in Maritime Applications for the Problem of Steam Turbine Exergy Analysis

: Machine learning applications have demonstrated the potential to generate precise models in a wide variety of ﬁelds, including marine applications. Still, the main issue with ML-based methods is the need for large amounts of data, which may be impractical to come by. To assure the quality of the models and their robustness to different inputs, synthetic data may be generated using other ML-based methods, such as Triplet Encoded Variable Autoencoder (TVAE), copulas, or a Conditional Tabular Generative Adversarial Network (CTGAN). With this approach, a dataset can be trained using ML methods such as Multilayer Perceptron (MLP) or Extreme Gradient Boosting (XGB) to improve the general performance. The methods are applied to the dataset containing mass ﬂow, temperature, and pressure measurements in seven points of a marine steam turbine as inputs, along with the exergy efﬁciency ( η ) and destruction ( Ex ) of the whole turbine (WT), low-pressure cylinder (LPC) and high-pressure cylinder (HPC) as outputs. The achieved results show that models trained on synthetic data achieve slightly worse results than the models trained on original data in previous research, but allow for the use of as little as two-thirds of the dataset to achieve these results. Using R 2 as the main evaluation metric, the best results achieved are 0.99 for η WT using 100 data points and MLP, 0.93 for η LPC using 100 data points and an MLP-based model, 0.91 for η HPC with the same method, and 0.97 for Ex WT , 0.96 for Ex LPC , and 0.98 for Ex HPC using a the XGB trained model with 100 data points.


Introduction
The number of applications of machine learning (ML) and other artificial intelligence (AI) methods have been growing in recent years in maritime applications. Many authors, including Aylak (2022) [1] comment on the wide-reaching impacts of such models applied in logistics, but such models are present in maritime engineering as well. Karatug and Arslanoglu (2022) comment on the application of this machine learning for condition-based maintenance and fault diagnosis in ship engine systems, demonstrating that high-precision models can be developed for these tasks. Gonca et al. (2022) [2] use artificial neural networks to model the maximum performance characteristics, in the case of a sevenprocess cycle engine. Other applications also include the prediction of fuel efficiency in the case of the cargo vessel, where Fam et al. (2022) [3] achieve high precision using kernel probability density functions and artificial neural networks, ship speed prediction by Bassam et al. (2022) [4] by five different ML-based regression algorithms, and ship fatigue damage prediction by Wang (2022) [5]-where the authors utilize artificial neural networks. The existing research shows a preference by authors to apply artificial neural networks as the main regression algorithm of choice. While high-precision models generated with such methods can be extremely beneficial to many operations in maritime environments, the underlying issue of data availability is oftentimes a common problem for many researchers and data engineers who wish to apply these types of methods. ML methods require large amounts of data, with data points necessary for a quality model numbering in at least hundreds, if not thousands [6]. Collecting this amount of data can present an issue. Many ship systems may not be equipped for automated data collection by default [7], requiring either time investment by operators or financial investment by ship owners-neither of which are looked upon kindly without explicit proof of performance improvements that can be obtained by developed models. This commonly results in relatively small "proofof-concept" datasets containing a smaller amount of data points than desired by the data engineers. This issue is common within many environments. A common tactic in imagebased ML was the use of deterministic or stochastic image augmentation processes to generate artificial data points for training [8]. Still, large amounts of data collected in maritime and engineering applications does not come in the form of images, but numerical data tables. Recently, synthetic data generation has been a common tactic for artificially increasing datasets. This refers to the generation of synthetic data points which contain the same statistical parameters such as the original data used to generate them, but offer new data. As most ML-based methods are designed to determine models based on exactly these statistical intricacies of the data, high-performance models have been achieved in multiple applications through training the models on it. Most of the examples of synthetic data in maritime applications focus on the use of the so-called simulation synthetic data-such as Kastner et al. (2022) [9] who use simulations for container flow data generation, mainly targeting maritime contained terminals-presenting a conceptual generation model. Bruns et al. (2023) [10] use enhanced CEP rule learning to develop maritime data streams and data generators. Research primarily focuses on ship activity patterns and is validated on real maritime data streams. Synthetic data has also been applied on ship wake detection, for data fusion from multiple sensors by Higgins et al. (2022) [11], showing an improvement in results when real data is bolstered by synthetically generated ones. While a wider application of synthetic data in maritime applications focuses on images [12,13], there is a clear lack of application of synthetic data generators which create additional data points based on statistical methods. The current state of research into applying ML for exergy/energy analysis shows that the field is active. Taghavifar and Perera [14] (2023) demonstrate the use of supervised ANNs for the problem of data-driven modeling of exergy and energy in marine engines. The authors use fuel type and injection angle to classify different operation modes and achieve a high-quality model, with R 2 regression scores above 0.95. Strušnik [15] (2022) shows the application of ML for the tuning of a steam turbine condenser vacuum. Strušnik demonstrates that the process efficiency can increase by over 2% by using an ML model for system control. Kartal and Ozveren [16] (2022) demonstrate the chemical exergy calculation for torrefied biomass using ML, namely feed-forward ANNs, achieving a R 2 score of over 0.79 and MAPE below 4% on the training set. Arslan et al. [17] (2023) demonstrate the application of a tree-based regressor and Pace for obtaining exergy and efficiency models, specifically mathematical equations, with the model errors ranging between slightly above 8% and below 2%. While the existing research shows that the application of ML in exergy analysis is a common research topic, most authors note the data collection process to be complex and that it is hard to obtain a large amount of data necessary for model training, especially in engineering-focused papers. This is a research gap that could be addressed with the use of synthetic data. If this process is applicable, it could allow the provision of relatively low amounts of data which can be used in part for data generation-which will, in turn, be used for model training, and in part for validation-allowing for a full validation process to be performed on "unseen" data, while having enough data for training [18]. The main goal of the presented research is to test the possibility of the application of such data methods on a dataset describing the efficiency and exergy of a gas turbine described below. The dataset was previously used for ML modeling on the original data, allowing for a direct comparison of performance on it and the original data. The authors will test the dataset generation with three methods-TVAE, copula, and CTGAN. Then, two regression techniques will be used for the performance testing-MLP and XGB. MLP was selected due to being used in the previous research, allowing for a direct performance comparison. XGB has been used because it has shown significant performance in similar tasks, with the benefit of generating comparatively simple models. The research aims to address the following research questions: • RQ1-can synthetic data be used for modeling GT using XGB and MLP? • RQ2-what is the performance of these models compared to a model trained on purely original, collected data? • RQ3-how much data needs to be used to generate synthetic data that yields satisfactory results? • RQ4-what is the performance impact of fewer data used for synthetic dataset generation?
To address these questions, the authors will provide information on the data used and methods applied, followed by the presentation of the results. The following sections will describe the general methodology of the research, along with the process of data generation and regression modeling. As mentioned, the main idea of the paper is to research the possibility of using synthetic data to generate more detailed datasets in marine applications. For this reason, the original collected dataset is used to develop synthetic data, based on different data point amounts. This data is then used in model training based on two methods-XGB and MLP. The overview of the used methodological approach is given in Figure 1.

Dataset
The dataset is measured on a main marine steam turbine. The turbine in question is in operation at an LNG carrier with a weight of 100,450 tons (gross tonnage). The maximum power of the turbine in question is 29,420 kW [19]. A schematic overview of the operating points used in the measurement is given in Figure 2. As shown, the main marine steam turbine consists of two cylinders-the high-pressure cylinder (HPC) and the Low-Pressure Cylinder (LPC). The HPC consists of one Curtis and seven Rateau stages, while the LPC consists of eight Rateau stages. Each propulsion system consists of two operating steam generators, and it delivers the majority of cumulatively produced steam mass flow rate to the HPC inlet [20]. The HPC consists of one steam extraction for steam delivery to auxiliary processes. Both steam generators in this plant use a Heavy Fuel Oil Heater (HFO) and a Boil Off Gas Heater (BOG). After the extraction, the remainder of the steam mass flow rate expands through the HPC. Before the LPC, one additional extraction exists for steam delivery for a high-pressure feed water heating system, consisting of one high-pressure feed water heater and deaerator [21]. In some operating regimes, part of the steam extracted here (operating point 4) is delivered to the air heaters used for heating at the steam generator's entrance. The remaining steam mass flow rate (operating point 5) expands through the LPC, which has one-step extraction delivering steam to a low-pressure condensate heating system-consisting of one low-pressure condensate heater and evaporator. The remaining steam mass flow rate is delivered to the main marine steam condenser for condensation [22]. The analyzed main turbine is designed without steam reheating, meaning it does not have the additional Intermediate Pressure Cylinder (IPC), and steam reheating-common on newer variants of marine propulsion steam turbines [20,23]. Three steam extraction processes are not necessarily all open during the main turbine operation. Regulation valves can close and open these extractions to regulate the extracted steam mass flow rate in each of them, according to the predefined regulation procedure. Both cylinders of the observed turbine are connected to the main marine gearbox which drives one or two propellers (P1 and P2).
It should be noted that some additional losses which occur in the plant are neglected for simplicity or due to the impossibility of measurement/calculation such as steam mass flow water leakage through gland seals of each cylinder [24], heat losses in the pipelines and through cylinder housing, mechanical losses [25], and similar. These losses have a minor impact on the exergy analysis. Figure 3 shows the histograms of the data, which allows us to discuss the sparsity and the distribution of the data. Some of the variables, such as P2, P3, P4, P5, M1, M5 and M7 have a good distribution of the data across the entire range of the data, which means that they do not present a concern. For other data points, there are areas in the variable range that are lacking. An example of those would be M3 and P7, which have a good amount of data for most of the variable range, with some missing values. Other data points such as temperature data, P1, M2, M4 and M6 have histograms showing that there is not data in certain, larger, parts of their ranges. This is caused by the operating regimes of the turbine, as the measurements were performed on a working turbine. As these data points cannot be easily filled through additional measurements, it has to be noted that the performance of the models may be influenced for the types of data which are shown to be scarcely distributed in the dataset.

Physical Model of the Exergy Destruction and Exergy Efficiency
The dataset itself consists of the values of mass flow rate, temperature, and pressure at seven points as marked in Figure 2. These values allow the calculation of exergy efficiency and exergy destruction for LPC, HPC, and whole turbine (WT), according to the overall steady-state exergy balance equation [26]: In the above, P represents the mechanical power (used/produced) andĖx DES represents the exergy loss.
The values of mass flow rate (m), pressure (p), and temperature (t) allow for the calculation of exergy Ex and enthalpy h in each of the points. These are in-between values and are not included in the dataset. Then, the values of the exergy loss and exergy efficiency can be determined [27]. These equations are given below.

HPC
Developed mechanical power can be defined with: exergy destruction can then be defined via: with the exergy efficiency being calculated as:

LPC
For the developed mechanical power, expressed as: exergy destruction can be calculated with the expression: and the efficiency with:

WT
The whole turbine mechanical power can be calculated as a sum of LPC and HPC powers: Exergy destruction of the WT is then expressed as: and the exergy efficiency as: The above values (exergy destruction and exergy efficiency) are used as the outputs of the dataset, for further regression modeling. As this means that there are six outputs, and the methods used can only regress a single value at once, this means six separate models will be created and evaluated separately-one for each output.

Synthetic Data Generation
Three different methods are utilized on the described dataset, to generate statistical synthetic data. As the total dataset has 150 data points, four data thresholds were used to determine the data-100 randomly selected data points and 50 randomly selected data points, 25 randomly selected data points, and 10 randomly selected data points. Then, three methods are used to generate a total of 1000 data points, with each dataset being evaluated separately. The best-performing datasets are then selected for modeling.
To leverage machine learning, a generative adversarial network (GAN) is trained to perform the data generation processes, for each method described below. A GAN consists of two networks-a generator and a discriminator. The discriminator is trained on real data and randomly generated data. The goal of the discriminator is to determine whether data is real or generated [28]. This is performed by using a datapoint X i as an input and then processing it through the network consisting of multiple convolutional layers. These layers perform convolution between the output of a previous layer and the random filter values, F returning a predicted valueŶ X i , which is the predicted class (1 for real data, and 0 for random, "fake" data). The error, called loss, is then calculated as [29]: Based on this error, the values contained in F are adjusted based on the gradient F = This means that a larger error results in a larger adjustment to the parameters. This training is repeated on multiple data points until the error of the discriminator is lowered. When the discriminator is trained to a point where it can differentiate data somewhat well, generator training starts. The generator network is trained to take random noise as an input and provides data in the shape of real data as an output. The output of this process is then fed into the discriminator, which classifies it as "real" or "fake" data. This result is then used to adjust the internal parameters F of the generator until it generates data that the discriminator thinks is real.
The type of data input in this network varies, with three different approaches used in this research: copula, TVAE, and CTGAN. The methods at hand were selected as the synthetic dataset generation methods for multiple reasons. First, they are all available for use as a part of the Synthetic Data Vault library [30]. This allowed the authors to use a set of verified methods, and avoid possible errors with the authors implementing the methods manually. The methods at hand were also selected because previous research has shown them to be high-performing on numeric datasets of various types [31][32][33][34]. All of the methods are trained in such a way that they generate data within the ranges of data contained in the original dataset, and that they generate non-rounded (decimal) values. All models are trained for 500, 1000, and 4000 (iterations) [35].

Copula
Copulas are statistical tools that allow for the modeling of the dependence structure between variables. For a data vector X = [X 1 , X 2 , X 3 , · · · , X n ], with the appropriate cumulative distribution functions F i , ∀i ∈ [1, n] a copula C can be defined. This copula is in essence a multivariate distribution function defined on a hypercube H, with the following properties [36]: Each of the distribution functions are uniform -F i (X i ) = U i , with U i being a corresponding unit uniform random variable Then, the copula can be defined as a function that defines the joint cumulative distribution function according to [37]: In other words, a copula defines the transition from a dataset's cumulative distribution function into a uniform variable distribution defined in n dimensions. This function can then be used to generate data fitting the original cumulative distribution functions, by generating random data that satisfies the condition of having uniform distribution in the hypercube H space. Various functions are tested as copula candidates and the best performing one is selected, with the function parameters based on the original data. In the approach used, the copula function is defined with a GAN which performs the transformation process. The benefit of this is that neural networks can easily be inverted, with the output and input replaced, meaning that no complex evaluation of inverse copula has to be performed.

TVAE
Triplet-based encoding is a process in which the data is modeled in an n dimensional space, where n is the number of variables in the tested dataset. Each data point is defined with an anchor (real data point), a similar value called a positive instance, and a dissimilar value called a negative instance. Then, the model is trained with the goal of achieving such an encoding that minimizes the distance between anchors and the positive instance points. In TVAE, this model is developed using a GAN, in the same manner as the aforementioned copula. If we define the data anchor data point as A, positive instance as P, and negative instance as N, then the loss function of the trained GAN can be defined as [38]: The distance calculation function d is the Euclidian distance between two data points. The trained encoder can then also be inverted, as in the case of copula, and used to generate data, by generating points with a small L-value, and then transforming them back into the real data space [39].

CTGAN
The final method used is CTGAN, which is a GAN whose architecture has been tuned to work on tabular data. In other words, this GAN uses one-dimensional data points as input/output vectors, training on the two-dimensional dataset [40]. Unlike copula or TVAE, no data transformation is used and the data from the dataset is fed into the network as-is (assuming the data is given in the tabular form, as the dataset used in this research).

Synthetic Dataset Evaluation
The synthetic data developed needs to have its quality evaluated, which is performed by comparing correlations between two datasets [41].
Correlation evaluation is performed using Pearson's coefficient of correlation, defined for data columns X i and X j as: Based on that, a score can be calculated between a pair of real data points (X R i , X R j ) and synthetic data points (X S i , X S j ) as [42]: This value is calculated for each of the column pairs in real and synthetic datasets, with the final output being the average value of this score across all column pairs. The value defined as this will range between 0 and 1, with the values closer to 1 indicating a higher quality synthetic dataset.

Regression
Regression is performed based on the input values collected in the dataset to determine the defined output values-EXD and η for both turbines and the entire system. This is conducted to test whether the synthetically generated data may be used for training similar models in the future. Two methods are used-MLP and XGB. MLP was selected due to its high performance and previous application of it in the research performed on the same dataset [27]. XGB was selected due to it being a so-called explainable model-the models generated by it are given in the shape of decision trees and can be further analyzed if necessary, which is not the case for models originating from MLP. It has also shown a successful application in previous research in similar domain [43].
Both of the methods have so-called hyperparameters which dictate the shape of the models generated and have a high impact on the performance. These values can be tuned via a grid-search (GS). This method takes an array of different values for a multitude of tuned hyperparameters and performs training for each combination of the hyperparameters, marking the values achieved. All values of hyperparameters used are given in the below subsections for each of the methods. Finally, both methods are trained on the synthetic data by using the synthetically generated dataset with 70% of the data used for training and 30% used for testing. This means that 70% of the data is used to adjust the internal parameters of the models, with 30% of the data used to evaluate the performance during training. The models are finally evaluated on real, collected data points.

Multilayer Perceptron
Multilayer perceptron (MLP) is a feed-forward artificial neural network. It consists of three main parts-an output layer consisting of a singular neuron, an input layer consisting of n neurons where n is equal to the number of dataset variables, and some hidden layers. As mentioned, these layers consist of neurons, each of which works as a sumator of weighted inputs [44]. Each neuron is densely connected, which means that its inputs consist of each neuron in the previous layer, passed through the activation function to obtain the output o as [45]: In the above, F is the activation function, w k is the weight of the connection connecting the previous layer's neuron k with the current neuron, and o k is the output of the neuron k. All neuron outputs are calculated according to this equation, except the input neurons for which the output value is the value of the corresponding datasets variable for the data point currently being used. The MLP is trained in the same manner as the previously mentioned GANs, with the weight w being equivalent to filter values F and adjusted in the training process based on the loss value. The hyperparameters used for training MLP within GS are given in the Table 1. The upper limit of six hidden layers was selected for two reasons. First is the gradient loss, in which deep neural networks can experience the loss of error due to the gradient being repeatedly calculated in each of the layers [46]. The second reason is the computational complexity, as deeper neural networks require longer training times. The latter reason was also used as reasoning for not using more than 500 neurons per layer. The hyperparameters of the MLP were elected based on the previous research the authors have performed on the non-synthetic dataset [27].

XGBoost
XGB is an ensemble method based on creating decision trees. Decision trees form a structure that consists of nodes, and each of the nodes contains a decision choosing the direction (left/right) based on the value of a certain variable. For example, one such node could be "if the temperature of the output stream on point 7 is higher than 60". Then, this decision leads to other nodes. These trees are constructed by selecting a variable and its value randomly and then testing the loss function of the decision. The constructed decision which causes the highest decrease in the regression value is selected. The process is then repeated until no decisions cause a significant change in the prediction value. The final model will consist of multiple decision trees, the outputs of which are averaged to determine the output value. This algorithm is commonly referred to as random forest, due to consisting of multiple decision trees which are randomly generated [47]. XGB does not generate new trees randomly but instead generates new trees with the particular goal of lowering the current loss. Instead of generating a new tree without any information, XGB will calculate loss gradients (as was performed in the case of GAN and MLP) to determine the particular error of the current ensemble model [48]. Then, it will generate trees which will lower this particular error. This serves to obtain trees that may have a poor performance by themselves but can create well-performing models when placed together, with each tree addressing different particularities of the dataset. XGB is trained within the GS scheme using the hyperparameter values presented in Table 2. The values of hyperparameters in the GS procedure were selected based on previous research, in which similar ranges of hyperparameters provided good results [43].

Computational Resources
The models are trained using Z4 HPC Cluster [49], using CPU nodes for MLP and XGB training and a node with GPUs to calculate and generate synthetic data. Each singular model (a single variation of the synthetic data or a single target GS for MLP or XGB) is trained on a separate node of the aforementioned type. The real execution times, as marked using the OpenPBS batching system used on the cluster are given in the Table 3, with the deviation across various variations. It can be seen that MLP has a slightly higher average training time compared to XGB. The training times for the synthetic data creation are significantly shorter compared to regression model training times.

Regression Evaluation
Six different metrics are used to evaluate the performance of the models: coefficient of determination (R 2 ), Explained Variance Score (EVS), Root Mean Square Error (RMSE), Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE) and Maximum Percentage Error (MPE). If we define the predicted numerical values asŷ =ŷ 1 ,ŷ 2 , · · · ,ŷ n and the actual values as y = y 1 , y 2 , · · · y n , we can define these values as per below.
R 2 is a statistical measure representing the proportion of the variance in the dependant variable explained by the independent variable in the regression model, calculated as [50]: EVS defines the proportion of the variance in the set y explained in the predicted valueŷ and it is calculated as [51]: Both of these values are expressed in the range [0, 1], with the higher values indicating a better-performing model. The other metrics used are various error metrics. MAE was selected because it is commonly used in prediction models in data science as a loss function, and it has been used in this research as well, allowing for a direct evaluation of performance. It is calculated as [52]: MSE error has been selected due to its common use in the field as an evaluation metric [52]: Finally, Maximum Percentage Error is a common metric that shows the poorest performance of the model. This value is in this research expressed as the percentage of the total range, to allow for easier evaluation of performance [53]: This is the same reason why MAPE is used, to more easily evaluate the error, as it is expressed as a percentage of the entire variable range [54]:

Synthetic Data Generator Results
The scores of different generated datasets, where a different dataset is denoted as one created with a different of the three methods used and with a different number of original data points for training, are given in Table 4. The best scores are generally achieved using TVAE, except for the 50 original data points being used, in which the copula method slightly outperforms it. CTGAN shows the poorest performance across all data point amounts, compared to the other methods. In addition to this, it is clear to see how scores show a relatively large performance decrease in comparison to original data when fewer data points are used. Because of the presented results, to speed up and simplify the further training, only TVAE datasets have been used, except for the 50 original data point datasets, in which a copula-generated dataset has been used.

Regression Results
The scores achieved using the XGB modeling algorithm are presented in Table 5. The scores are given for each of the dataset targets, and each amount of original data points used in creating the training dataset. As mentioned, validation is performed on separated 50 data points, on which the presented results are calculated.  The results demonstrate that XGB achieves the best results on the dataset created with the most, 100, original data points as the input, with the results ranging from R 2 of 0.91 in the case of η HPC to 0.99 for η W T. The result shows a sharp decrease in performance when datasets generated with a lower amount of data points are used to create a training dataset, and this follows across all of the six regression targets.
In the same manner, the results for the MLP algorithm are given in the Table 6. The MLP algorithm shows a similar performance compared to XGB-a sharp decrease in performance for datasets developed using a lower amount of data points. Some of the performance decreases are even larger, such in the case of η WT where the performance decreases from 0.98 to 0.74 when observing R 2 as the main performance indicator, with absolute percentile error (MAPE) almost doubling in the same case. The results are overall satisfactory when observing the 100 original data point dataset-trained models with all models achieving R 2 above 0.90. Still, some of these models do show poorer performance compared to the models that were generated on the same datasets using XGB as the regression algorithm. For easier comparison of individual results, the achieved scores have been illustrated in Figures 4-9. Each of the figures illustrates all of the scores used-with R 2 , EVS, MPE, and MAPE being given on the left y-scale and MAE and MSE given on the right scale, due to the difference in amounts. It should be noted that MPE and MAPE are expressed in the range [0, 1] instead as a percentage in the range [0, 100]% for easier visibility. Each of the figures contains two subfigures-with the left showing the best scores achieved with the MLP algorithm, and the right showing scores achieved with the XGB algorithm. Figure 4 shows how similar, high results are achieved on the 100 original data point dataset. The result is similar across all of the used metrics for both of the methods, without any of the used metrics demonstrating a significant increase/decrease between the bestperforming models of both methods.  Observing the results achieved for η LPC in Figure 5, it is shown that both models achieved poorer results in comparison to the originals. Still, XGB is clearly shown as the algorithm generating a model that outperforms MLP by a significant margin. This is especially obvious observing the maximum error, expressed as the percentage-which increases from 7.21% in the case of XGB to over 30% for the MLP model. The same behavior, with XGB outperforming MLP across all of the metrics used for the model evaluation is shown in Figure 6, showing the performance of η HPC -developed models. The decrease here is even more significant than previously-with the coefficient of determination decreasing from 0.91 to 0.82, and similar performance decreases are observable in other metrics-with MAPE by almost 1% from 3.46% to 2.48%. The results in Figure 7 show that MLP-based models achieve slightly higher scores when compared to the ones developed using XGB for Ex WT as a target. Still, the performance difference is not as high as it was in the case of η LPC and η HPC . The models regressing Ex LPC show a similar performance when comparing the two regression algorithms. For example, in the highest performing data point dataset, the EVS decreases from 0.96 to 0.94, with this change in performance being confirmed when compared with other metrics. While this decrease in performance is not as significant as it was in the case of η LPC and η HPC , it is present. Finally, the results for Ex HPC are provided in Figure 9, and they follow the trend set by the previous two exergy targets, where MLP-based models show a slightly improved performance compared to XGB-based models. Still, these models are the closest in performance-the difference between the two models in the case of MAE is only 6.15 (MAE being 81.23 for XGB, and 75.08 for MLP-based models), which is a small difference. How slight this difference is is better demonstrated with MAPE which decreases from 1.26% to 1.17%. The hyperparameters of the best-performing model for each of the targets are given in Table 7. The hyperparameters of the XGB models show that they are similar, as all of them use a slow learning rate and a large number of estimators with a large maximum depth. MLP models are also similar to each other, as all use the same number of neurons-six hidden layers of five hundred neurons in each layer. All of the MLP models use the ReLU activation function with an LBFGS solver and a slow learning rate. These values being large indicates a complex problem which was difficult for the algorithm to generate a model for. When comparing these to the hyperparameter values used to obtain high-performing models on real data in the previous research ( [27]), the indication is that the regression on synthetic data seems to be harder to perform than the one on original data.

Discussion
The results demonstrated in the previous section point to several findings that can be interesting in the domain of addressing the research questions posed at the start of this study. The first important thing to note is the comparison with the results achieved on the original dataset, in previous research. According to the previously published research which modeled the targets on the original dataset using the original, non-synthetic, data points [27]-the models can achieve R 2 scores of 0.99, except for Ex LPC which achieves at the max 0.97. Comparing those results with the results achieved in this research, the results on synthetic data show either equal results (η WT using XGB), or a decrease in performance varying from slight (∆R 2 = 0.06 for η LPC using XGB, ∆R 2 = 0.02 for Ex WT using MLP, ∆R 2 = 0.01 for Ex LPC using MLP, and ∆R 2 = 0.01 for Ex HPC using MLP) to significant (∆R 2 = 0.06 for η HPC using XGB). This indicates that synthetic data may be used in similar research targeting marine power plants-but the expected model performance is lower compared to direct training on original data. Still, in cases where a large amount of original data is not available, such an approach can be valid to generate the initial research results, especially in the proof-of-concept stage. Comparing the performance of the two used regression algorithms, there is not a clear performance winner across all targets, with exergy efficiency regression showing better performance when the XGB algorithm is used, and exergy destruction regression showing better performance when evaluated with MLP-based models. For this reason, the focus should be given to the use of multiple algorithms when such data modeling is attempted. On the topic of synthetic data generation, two things can be noted. First, it is shown that the initial correlation evaluation of the synthetic datasets compared to the original datasets is a good indicator of performance for future trained models, with the lower correlation scores for datasets created based on fewer original data points providing poorer results. This indicates that in similar research in the future, time can be saved by using such a metric for direct evaluation of created datasets, before training. This is important due to ML training being highly computationally demanding, resource-and time-wise, while data generation techniques require fewer resources. The difference in performance between different synthetic data techniques is also visible, with TVAE providing the overall best results compared to the other two techniques used, indicating that future research in this domain should focus on it as a data generation technique, at least initially. Finally, the amount of original data points used shows a large impact on synthetic dataset performance and synthetic dataset-based models. Any amount of data lower than 100 has been shown to yield very poor results. This indicates that the amount of data for models cannot be lowered significantly, but it must be noted that this is still two thirds of the original dataset. Depending on the application and the specific case in which the data is being collected, lowering the necessary data by one third of the total amount can simplify and speed up the process, indicating that this approach should be considered in some cases, where data collection is impractical.

Conclusions
In this research, the authors have used a previously collected dataset containing information on the operation of a steam turbine, namely the measured temperature, mass flow rate, and pressure in seven measurement points. Then, the calculation of exergy destruction and efficiency was performed for three elements-the LPC, the HPC, and the turbine as a whole. This dataset was then split into multiple smaller parts-50 data points were reserved for validation and the rest was used to construct training subsets consisting of 100, 50, 25, and 10 data points. Using these subsets, three methods were used to generate statistical synthetic data points-TVAE, CTGAN, and copula. The best-performing synthetic datasets were then used to generate ML-based models using two methods-MLP and XGB. After the evaluation of the reserved original 50 data points, the performance of the models was concluded to be satisfactory, where models with a higher R 2 than 0.9 were achieved. From the obtained results, the originally posed research questions can be addressed: • RQ1-synthetic data can be used to generate data used for gas turbine modeling.
• RQ2-in comparison to models trained on original data, the performance shows a slight decrease, depending on the method used. • RQ3-synthetic datasets based on less than two-thirds of the original data points show poor performance when evaluated in comparison to the original and in modeling. • RQ4-the performance impact of using less original data for data generation is significant, with large decreases in performance visible in the obtained results.
The main conclusion of the article can be summarized as follows-it should be possible to use synthetically generated data for modeling in maritime applications, especially in such cases where extreme precision is not necessary, and where the collection of larger datasets is expensive or highly impractical. The limitations of this research lay in the use of a single dataset for analysis. A clear limitation of the presented work is a lack of external validation on a dataset collected from a different steam turbine. In the future, further model validation should be performed using a separate validation dataset. Another limitation to note was that no hyperparameter tuning based on sensitivity analysis was performed. While the achieved results were satisfactory within the context of the study (testing the performance of models on synthetic data), such analysis could further improve the models and should be performed-especially in the case of real-world applications. The distribution of the collected data, due to measurements on a working turbine, is not uniform. This may cause poorer performance of models in such cases where the data given for prediction comes from the area which is sparsely populated in the dataset. Wider-reaching conclusions should focus on the use of multiple datasets surrounding similar maritime applications to observe the data more clearly. This is planned to be addressed in future research.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author. The data are not publicly available due to approval for data sharing needing to be obtained from data provider.