Improvement of Machine Learning-Based Modelling of Container Ship’s Main Particulars with Synthetic Data

Majnarić, Darin; Baressi Šegota, Sandi; Anđelić, Nikola; Andrić, Jerolim

doi:10.3390/jmse12020273

Open AccessArticle

Improvement of Machine Learning-Based Modelling of Container Ship’s Main Particulars with Synthetic Data

¹

Faculty of Mechanical Engineering and Naval Architecture, University of Zagreb, Ul. Ivana Lučića 5, 10000 Zagreb, Croatia

²

Faculty of Engineering, University of Rijeka, Vukovarska 58, 51000 Rijeka, Croatia

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

J. Mar. Sci. Eng. 2024, 12(2), 273; https://doi.org/10.3390/jmse12020273

Submission received: 21 December 2023 / Revised: 18 January 2024 / Accepted: 31 January 2024 / Published: 2 February 2024

(This article belongs to the Special Issue Machine Learning and Modeling for Ship Design)

Download

Browse Figures

Versions Notes

Abstract

One of the main problems in the application of machine learning techniques is the need for large amounts of data necessary to obtain a well-generalizing model. This is exacerbated for studies in which it is not possible to access large amounts of data—for example, in the case of ship main data modelling, where a limited amount of real-world data (ship main data) is available for dataset creation. In this paper, a synthetic data generation technique has been applied to generate a large amount of synthetic data points regarding container ships’ main particulars. Models are trained using a multilayer perceptron (MLP) regressor on both original and synthetic data mixed with original data points. Then, the authors validate the performance of the obtained models on the original data and conclude whether a synthetic-data-based approach can be used to develop models in instances where the amount of data on ship main particulars may be limited. The results demonstrate an improvement across almost all outputs, ranging between 0.01 and 0.21 when evaluated using the coefficient of determination (

R^{2}

) and between 0.27% and 3.43% when models are evaluated with mean absolute percentage error (MAPE). This indicates that the application of synthetic data can indeed be used for the improvement of ML-based model performance. The presented study demonstrates that the application of ML-based syncretization techniques can provide significant improvements to the process of ML-based determination of a ship’s main particulars at the early design stage. This paper suggests that, in cases where only a small dataset is available, artificial neural networks (ANN) can still be effectively employed to derive early-stage design values for the main particulars through the use of synthetic data.

Keywords:

container ships; copulas; main particulars; concept ship design; synthetic data

1. Introduction

The accurate determination of a vessel’s main particulars holds paramount significance in the initial design phase of naval architecture, as these values serve as foundational parameters shaping the vessel’s overall characteristics. However, the complexity associated with deriving these particulars arises from a myriad of interrelated factors, rendering the task intricate and challenging.

Traditionally, the vessel’s main particulars were determined in the first iteration of the design spiral developed by J.H. Evans [1], also called the concept phase. As the design continued to develop through the design spiral, the main particulars and the rest of the design would go through the next phase, called preliminary design, as stated by Papanikolaou [2]. When starting the concept design phase, engineers usually begin with a comparison with similar already-existing vessels. For that purpose, statistical, rational, and empirical methods based on comparative data from similar ships were developed. Watson has developed design formulas based on which the main ship dimensions can be estimated [3]. While investigating ship hull performance and the effect of main ship dimensions on weight, Strohbusch [4] has explored and developed hull form coefficients and ratios of main dimensions for merchant ships. Watson later updated his formulas [5], while Strohbusch’s approach was further developed by Papanikolaou [2]. Specifically, the main dimensions of container ships, especially their length, are estimated based on cargo capacity parameters such as required deadweight, hold, or TEU capacity [2,6]. Similar to other merchant ships, in the phase of container ship concept design, the estimation of main dimensions is based on linear or nonlinear equations. These equations are derived from methodologies that utilize a database of previously constructed ships. Piko [7] used a database of container ships built before 1980 to develop equations through a nonlinear regression methodology, employing deadweight capacity as an input. Papanikolau [2] also utilized nonlinear regression methods, using data from container ships built before 2005, with deadweight capacity as the input in his research. Kristensen [8] employed a second-degree polynomial and linear power regression method with a more recent database of container ships built before 2013. In his case, the input criterion was the TEU number.

With the advancement of artificial neural networks (ANN), scientists have increasingly utilized them to estimate the values that are crucial in ship design. Artificial neural networks can analyze data based on datasets; however, a limitation in the marine industry is the scarcity of available data. This limitation narrows the scope of areas that can be effectively analyzed using ANN, particularly to achieve high-quality results, as identified in [9]; therefore, some of the primary research areas include main engine power, fuel consumption, resistance, and main dimensions, as explored in various specific research studies.

For instance, in [10], scientists introduced a model that utilizes a combination of artificial neural networks (ANN) and multi-regression (MR) techniques to estimate a ship’s power and fuel consumption. This model was employed to predict potential fuel savings, and the results indicate that such a model can also play a crucial role in developing decision support systems (DSS) to maximize a ship’s energy efficiency. In [11], a regression model utilizing an artificial neural network (ANN) approach was proposed to predict ship performance, specifically focusing on the fuel consumption of the vessel’s main engine. The authors drew several conclusions from their research. They found that sigmoid and tangent sigmoid functions exhibit high and stable regression values, even with a small number of hidden layers and neurons. In contrast, the ReLU function requires an adequate number of hidden layers and neurons to achieve high and stable regression values. Regression analysis using ANN becomes essential for predicting ship fuel consumption when dealing with nonlinear relationships between input and output variables. The conclusion is that regression analysis using ANN can effectively and accurately predict ship performance, serving as a complex and real-time model for the future of the shipping and marine industry.

A specific study conducted by Alkan et al. [12] involved the analysis of initial stability parameters in fishing vessels using neural networks. Parameters such as the vertical center of gravity (KG), height of the transverse metacenter above the keel (KM), and vertical center of buoyancy (KB) were calculated with high accuracy levels when compared to the actual ship data. The added resistance coefficient was also examined through the application of ANN [13]. A formula was developed and presented using ANN, showcasing practical applicability at the preliminary design stage. Similar to other ANN analyses, the limiting factor in this study was the dataset.

Analysis of the main dimensions has been conducted in multiple studies [14,15,16]. In [14], a method was proposed to determine the initial main particulars of chemical tankers. The study demonstrated that the LM algorithm achieved the best results, employing 13 hidden neurons. In [15], researchers analyzed the main particulars of 250 container ships using MLP and GBT ML algorithms. Machine-learning-based models, such as those developed in this research, could be utilized by engineers in the preliminary design stages. Artificial neural networks were used in [16] to analyze container ship length. Equations were developed to estimate the length between perpendiculars (Lpp) based on container number and ship velocity (v).

All the methods used so far rely on a dataset, and consequently, they are heavily limited by the size of that dataset. A large amount of data is available for typical types of ships, usually limited to specific values like main particulars. Other values, such as ship coefficients, resistance, fuel capacity, or information about crucial systems, are either unavailable or hard to obtain. Another influencing factor is the quality of the gathered data for the dataset, which significantly impacts the results of the methodology used. Therefore, the available data and their quality can be major constraints on the values that can be analyzed and used in ship design. This is especially problematic for specialized types of ships, such as those used for special operations, research vessels, submarines, and unmanned underwater vehicles (more specifically, the subcategory of underwater autonomous vehicles). Hence, this research aimed to establish an ML methodology that will overcome the lack of real-world data by using synthetic data to improve the quality of the ML-based model’s performance.

The main point of novelty of this research lies in answering the following research questions:

RQ1—Can quality synthetic data be generated from a relatively small dataset regarding container ships, using the limited probability density functions available to such modelling?
RQ2—Can such synthetic data be used to improve the performance of the model when regressed using the multilayer perceptron (MLP) method, in comparison to a model trained on just the original data?
RQ3—Is the improvement, observed across multiple metrics, equal across all targets in the dataset, or does it vary per target?

This study will first present the used methodology, focusing on the description of the dataset, followed by the techniques of data analysis and synthesis of that data. Finally, the regression methodology will be given a brief description. The results obtained with the described methodology are presented in the following section, they are discussed, and finally, the conclusions are drawn in the final section of the paper.

2. Methodology

The approach used in this research is to test whether synthetically generated data points can be used to improve the performance of limited datasets. This study serves as a continuation and complement to the earlier research [15], where the established database will be utilized as the foundation for this article. The original dataset is separated into two—the validation and training sets. The former set will be used for two purposes—the first is training the models based on the original data, while the second is the generation of the synthetic data. In addition to the original data-based models, additional models are developed based on the synthetic data, for performance comparison. Both of the models will finally be evaluated on the validation set. The original data consisting of 252 points is split into the source data for synthetic data generation and validation data. The validation data are left aside, while the source data are used for synthetic data model training, and it generates a total of 1000 additional data points. These synthetic data points are mixed with the original 152 source data points, and the models are trained with them, as well as just the source data without added synthetic data. These models are then evaluated on the original validation data. This process is presented in Figure 1.

2.1. Dataset Description

The dataset is split into 152 data points for the training and 100 points for the validation. The 152 points will be the basis for the generation of the synthetic data and train the original data models for comparison. The dataset consists of 11 variables, 2 of which will be used as inputs and 9 which will be model outputs. The regression technique that will be used and described in the following sections only predicts a single value as its output. Because of that, a separate model must be developed for each of the outputs. The nine elements of the dataset used as outputs are:

length overall (LOA),
length between perpendiculars (LPP),
molded breadth (B),
depth (D),
draught,
gross tonnage (GT),
net tonnage (NT),
deadweight (DWT), and
engine power (KW).

The two inputs used are number of TEU (TEU) and speed (V). The class databases used in the creation of the dataset are DNV, Lloyd’s, and Bureau Veritas [15].

2.2. Generations of Synthetic Data Points

The method used in this research for obtaining synthetic data points is copula. The implementation of the copula method this research uses is GaussianCopula from the Synthetic Data Vault (SDV) [17]. The method works by generating a hypercube with the dimensions of

< 0, 1 >^{d}

, where d is equal to how many variables are present in the dataset. For the previously described dataset used in this research, the copula method will create a hypercube of dimensions

< 0, 1 >^{11}

. Then, the method determines copulas. These copulas are equations that map each of the variable vectors in the original dataset to the hypercube. This mapping is done in such a way that the statistical distribution of the original variable is transformed into a uniform distribution, in which each value has the same probability of being randomly selected. This equation is created using a Taylor series. Once these equations are determined, they can be inverted. This determines the main application of the copula method. Once the hypercube and the equations are created, random values can be generated uniformly within the hypercube space. When this data vector from the hypercube is transformed back to the original data space, due to the nature of the inverted copula equation, the transformed values should retain the probability density function of the original data [18].

As can be seen from the name of the applied method—GaussianCopula (GC)—the method assumes that the original data are normally distributed. Not all data follow a normal distribution, which is why GC can also be used with the assumption that the data follow some other common distributions. These distributions are the beta distribution, uniform, gamma, and truncated normal distribution [19]. Because of this, the first step in creating the synthetic data is to determine which of the possible distributions is the best fit to the original data variables. This is achieved using the Kolmogorov–Smirnoff (KS) test. If we assume the original distribution function of the variable is defined as

F_{n}

, where

n \in [1, 11]

is the observed variable, then the empirical distribution function

F_{n} (x)

can be defined as the ratio of the number of elements smaller than x to the total number of elements in the data vector, or [20]:

F_{n} (x) = \frac{1}{n} Σ_{i = 1}^{n} 1_{< - \infty, x]} (X_{i}) .

(1)

If the probability density function that is being tested (for example normal distribution) is defined as

G (x)

, following the same equation, then the KS statistic can be defined as the supremum of the distances [21]:

D_{n} = sup_{x} | F_{n} (x) - G (X) | .

(2)

By testing against all possible distributions, we can determine the one that has the lowest difference relative to the real data distribution.

One of the key concerns that need to be addressed when synthetic data are used, is the learning limitations. The synthetic data are not completely new data introduced to the dataset but are instead derived from the descriptive statistics of the original data. Because of this, there are limits to the amount of data that can be generated before experiencing issues such as re-learning the same data points and mode collapse. The main issue is the bias towards the original data set, which is used as the basis of the synthetic data. The models trained with synthetic data provided from the original data may have a poorer generalization performance on new data, compared to actually collecting and training the models with new data.

Another concern with synthetic data is the generation of infeasible designs. As main particulars are calculated using random selection, not all sets of values may give realistic main particulars—for example, a very large LPP combined with a shallow draught and a small B. To avoid extremely large discrepancies and impossible values, the synthetic data are limited to the range of values equal to the range of values contained in the original dataset. As that may still generate unrealistic main particulars, it is important to note that synthetic data in the given use-case is not meant to create realistic values. It is simply used to fill out the distributions of the original data and address possible gaps (visible in Figure 2). This process should allow for the creation of more robust and precise models for the points which were not necessarily contained in the dataset, as well as smooth out the probability density functions, which are utilized as a key factor in modelling using statistical ML-based methods such as the one used in this research. The result of this “filling out” procedure will be shown in the figures showing data pairs and the comparison of probability density functions for real and synthetic data.

To assist in the visualization of the created data points, the authors have randomly selected one of the synthetic data points. Using the synthetic values as the main particulars of a vessel design, the ship form given in Figure 3 was created, using pre-existing ship form lines adjusted to the main particulars obtained with the synthetic data method. Observing this form, it can be seen that the synthetic data values can be used for the creation of realistic vessels.

2.3. Regression Methodology

The regression test is performed using a multilayer perceptron (MLP) artificial neural network (ANN). This ANN is constructed from neurons arranged in three layers. These are the output and input layers, with additional “hidden” layers in between them (at least one). A neuron is connected to all the neurons of the following layers. The input layer consists of neurons whose number is equal to the variables in the dataset, and the output layer consists of a single neuron. The number of neurons in the hidden layers is arbitrary. The network works by taking each row in the dataset, defined as

X_{m}

, and using it as the input. These values are then propagated through the network by calculating the values of neurons. Each neuron value is calculated by taking the value of each neuron

x_{i}^{j}

(i being the neuron, and j being the layer) in the previous layer and multiplying it by the weight value

w_{i}

of the connection between the two neurons [22]:

x_{i}^{j} = \sum_{i = 0}^{N} x_{i}^{j - 1} \cdot w_{i},

(3)

with N being the number of neurons preceding it. This process is repeated until the output neuron is reached. This output value can be defined as the predicted value

\hat{y_{m}}

. Comparing that to the real value of the output correlated to the data point

X_{m}

, defined as

y_{m}

, we can obtain the current error of the neural network for that data point. If we define M as the number of data points in the training set, then the error of the network in the current iteration of training can be defined as [23]:

ϵ = \frac{| \hat{y_{m}} - y_{m} |}{M} .

(4)

The model is developed by adjusting connection weights proportionally to the error gradient, from the initial randomly set values. By repeating this process multiple times, the

ϵ

will be minimized, theoretically obtaining a well-performing model [24].

The values of the weights are the parameters of the network. In addition to those parameters, values exist within the network that define the architecture of the network—which are referred to as the hyperparameters of the network. These are the number of layers and the number of neurons in each layer, the activation function (value that adjusts the value of neurons to control the output range), regularization parameter L2 (parameter that controls the influence of the better correlating values), the learning rate (factor controlling the speed at which the weights are adjusted, as well as the type of adjustment), and solver (algorithm that calculates the weight adjustments) [25]. In the presented research, these values are adjusted using the grid search (GS) algorithm. GS is a simple algorithm that tests all the possible combinations of the hyperparameters given as inputs. The possible values of hyperparameters of the MLP that were used in this research are given in Table 1. The amount of neurons per layer indicates the number of neurons in each of the layers of the neural network, with more neurons indicating a more complex network. More complex networks show better performance when complex problems are modeled, but significantly raise the time necessary for model training, due to the larger number of weight adjustments that are necessary—as the number of connections between two layers of size n is equal to

n^{2}

. Each of the neurons will perform a simple summation of n weights, multiplied by n results of the output neurons, and then processed with the activation function. The activation functions are simple, so their complexity can be assumed to be

O (1)

, resulting in a total complexity of

O (n^{2})

. Still, as for the trained network, n is a constant. Considering that the architecture was already decided, the complexity of a neuron can be simplified to

O (1)

.

It can be seen that a relatively large number of layers was used in the network configurations. The training of the networks was performed using the Bura Supercomputer, available at the University of Rijeka, and because using larger networks did not present a significant time impact, the authors decided to explore the larger networks as well, in the hopes of obtaining better-performing models. Still, such large networks may not be necessary for the model regression.

Each of the obtained models from this procedure is evaluated on the separate test set (20% of the training set), with two values to determine the performance of the model on unseen data, with these values representing the metric known as the coefficient of determination (

R^{2}

) and the

M A P E

(mean absolute percentage error).

R^{2}

shows how well the variance is represented between the predicted and original data and is calculated as [26]:

R^{2} = 1 - \frac{Σ_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}{Σ_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}} .

(5)

The total value of

R^{2}

essentially explains the amount of variance from the original data that is explained in the predicted data. The best value of

R^{2}

is found when all of the variance is explained between the original and generated datasets.

M A P E

is the average absolute error, expressed as the percentage. It has been selected due to the multiple values with different ranges used as outputs. Due to this, using an error normalized to the range will allow for a simpler comparison between results.

M A P E

is calculated as [27]:

M A P E = \frac{1}{n} Σ_{i = 1}^{n} | \frac{y_{i} - {\hat{y}}_{i}}{y_{i}} | .

(6)

3. Results

3.1. Data Distributions

The distributions determined to be best fitting amongst the tested distributions are visualized in Figure 2. It should be noted that these are not necessarily the overall best-fitting probability density functions but are instead the best fitting among the ones that were tested (normal, truncated normal, gamma, uniform, beta). The tests show that the best-fitting probability density functions are:

gamma for the B, D, DWT, GT, KW, LOA, LPP, NT, and TEU variables, and
normal for the DRAUGHT and V variables.

The figures generally show a good fit of the selected probability density functions, especially in the case of V, DWT, NT, TEU, and GT. KW, LPP, B, D, DRAUGHT, and LOA show some bins that do not fit the selected probability density functions, in the upper part of the variable range. This indicates that there are other probability density functions besides the tested ones that may be a better fit, but as they are not available within the GC framework at this time, the selected distributions still provide a satisfactory fit.

3.2. Synthetic Data Results

The overall quality score of the synthetic data, calculated as the similarity between correlations in the original and synthetic data, shows an overlap of 94.71% in comparison to the original data. This is a very high score, indicating a well-synthesized dataset. Another tested metric regarding the indicative performance of the synthetic dataset is the comparison of distribution shapes. The comparison returns a 90.54% overlap between data types, which is shown in Figure 4.

The final test performed on the synthetic data to compare the results of the used method is the analysis of column pairs. This is performed by plotting two variables, each on one of the plot axes. Then, the dispersion between the original data and the synthetic data can be observed. While this is performed for all of the variable pairs, for brevity, only the visualizations of used inputs (V and TEU) with the rest of the outputs are shown below. The column pair trends for V are shown in Figure 5, and the column pair trends for TEU are shown in Figure 6. Synthetic data pairs are shown by light teal points, while the original data are shown by purple points. There are a few specifics in the data that can be seen.

The column pair trends show an overall rate of similarity equal to 98.87%. Observing the values of original data compared to synthetic data within the datasets, it is noticeable that some of the data points in original data have a staggered value approach—such as B and D (visible in both Figure 5 and Figure 6). This is caused by most of the observed vessels having a standard length and breadth. The synthetic data fills the gaps between such elements, which may provide a more finely trained model.

3.3. Regression Results

As mentioned, the data are originally trained on the training/test split with the 152 data points selected for that. This is done to determine the best-performing models, which can later be used in the validation process. The results for the models that used synthetic data for training are given in Table 2, while the results and model architectures for those using original data are given in Table 3.

Comparing the two tables, it can be seen that some of the models were achieved with significantly smaller networks comparing synthetic and original models—this is most clearly seen in the case of engine power (KW). In the same manner, models tend to use different architectures in the case of original data, and models achieved higher scores mostly using the ReLU activation function, compared to the synthetic data, in which the best performing models used either Tanh or ReLU. The same can be noted for the solver algorithm—LBFGS was mostly preferred by synthetic-data-based models, while most of the best-performing models based on original data preferred Adam. Archived scores on test data vary, but most of the achieved scores are relatively close. A notable exception is engine power, which achieves a significantly better score when the synthetic dataset is used.

As shown in Table 2 and Table 3, most of the selected models could be considered large—having either a large number of layers (4+) or a large number of neurons per layer (16+). In the selection of the models, the smaller models were preferred in cases where two models have the same performance score to the second digit of significance. The reason why this process yielded mostly larger architectures is that the trained MLP ANNs had trouble converging to a solution during the training with a smaller amount of parameters (weights connecting layers of neurons, discussed in Section 2.3) and tended towards the selection of larger networks. This phenomenon could occur for two reasons: the complexity of the problem and/or the lack of information in data [28]. While the first is a possibility, the second is more likely in the presented case, because comparing the network sizes for original data in Table 3 and synthetic data in Table 2, it can be seen that the network sizes, on average, appear to be smaller for the larger dataset created with synthetic data.

Finally, the results can be directly compared to the validation data. When the previously obtained models are applied to create predictions for the validation dataset, synthetic-based models show an improvement in scores, as can be seen in Table 4. The results clearly show that the results of models that used synthetic data are either better for both used metrics or equal in one and better in the second metric—for example, DWT shows equal

R^{2}

, but lower

M A P E

in comparison to the original data-based model. Table 4 shows scores as calculated according to Equations (5) and (6), respectively. As mentioned, these values are obtained on the validation data, using the previously trained models. The values of these evaluation metrics may be changed if a different set of data is used for validation, but due to the large amount of training data, it can be assumed that the values will be similar to the ones presented in this paper. If the value of

M A P E

decreases, this indicates a more precise model with a lower error, while an increase indicates a poorer model. As for

R^{2}

, the opposite is true: an increased value indicates a better model, and a lower value a model that performs worse.

For easier comparison of scores, they are provided graphically in Figure 7. The same observations can be made as when the table was presented. The clearest improvement is visible in the case of KW, which goes from a barely usable model to a model that achieves satisfactory scores.

To further analyze the improvements of the scores, some additional metrics were used—namely mean absolute error (MAE) [29], root mean square error (RMSE) [29], and Tweedie score [30]. Due to extremely different ranges, these values may be hard to illustrate together, so the general improvement is used. The general improvement between metrics is calculated as the difference between the score achieved on the validation data with the model trained on synthetic data and the model trained on solely original data. To avoid big range differences, this value is then divided by the larger of the two values, providing a percentage of the improvement ranging from 0.0 to 1.0. Please note that for the visibility of the illustration, the absolute improvement was used, using absolute values, as otherwise the improvement would be negative for MAE and RMSE (where the lower score is better), instead of positive, as in the case of Tweedie (where higher is better). To avoid confusion, and since all of the scores achieved with synthetic data are equal or better for all additional evaluation techniques, the absolute was used, with

M_{s}

being the metric for synthetic data, and

M_{o}

being the metric for original data:

I = \frac{| M_{s} - M_{o} |}{m a x (M_{s}, M_{o})} .

(7)

The results of this comparison are given in Figure 8, in which it can be seen that synthetic data demonstrates improvements across all additional scores used—except Tweedie score in the case of DWT, in which it is equal between two models (as was the case for

R^{2}

).

4. Discussion

There are multiple points to the discussion of the achieved results. The first, and most important, element of note is that the models that utilize synthetic data for regression modelling achieve higher scores on the targeted outputs when compared to the scores achieved by models that were trained by fewer, original, data points. This is visible for all of the presented data points, with the sole exception of the DWT target, which achieved an equal score when evaluated with

R^{2}

—but still showed improvement in the

M A P E

metric, which dropped from 4.02% to 1.53%.

From the ML perspective, it is interesting to note how different architectures were found to perform best between the two datasets. This was the case even though there was high similarity between datasets, as shown in Figure 4, Figure 5 and Figure 6. This behavior indicates that despite the similarities between datasets shown in the current analysis of the data, there are underlying differences in the data hyperspace that are not captured with the analysis. It can be concluded that this means that performing the GS procedure from the start for different data models is necessary, and simply transferring model architectures between original and synthetic datasets is not applicable.

Performance between the training sets and the test set seems to be a good indication of the final model performance. This is best seen in the case of the KW output—where the test performance on original data shows an

R^{2}

of 0.78 and

R^{2}

of 0.91 for synthetic test data. This respectively translates to

R^{2}

values of 0.72 and 0.93 for the real and synthetic models on validation data. Of course, validating data against a third dataset (in the presented work referred to as the validation dataset), which is unseen by both the synthesizing and regression models, is crucial. Still, there are similarities between the performance shown on test and validation sets, which allowed the models that were selected based on their test performance to show themselves to perform well in the validation step. This leads to potential time savings in this methodology, due to avoiding the need to test multiple models for each of the targets.

Finally, most of the models achieved with synthetic data using the described methodology are high performing enough to be considered for use in practical applications. The models that achieved

R^{2}

performance higher than 0.95 are LOA, LPP, B, D, DRAUGHT, GT, NT, and DWT—in other words, all models except the KW model. If we evaluate the models using

M A P E

, all of the models trained on synthetic data achieve the condition of having an error lower than 5%. It should be noted that most of the original data models (the exceptions being DRAUGHT and KW) also achieve scores that are satisfactory according to the given conditions. Still, due to the relative simplicity of synthetic data application, and the low computational cost of the procedure, the obtained improvements indicate that such a methodology could prove to be useful in gaining additional performance from data-driven ML-based models.

5. Conclusions

The presented research attempted to demonstrate the ability to improve the performance in main ship particular modelling through the use of synthetic data generation techniques. It used a previously collected dataset of container ship particulars to generate models based on just the original data and models based on a synthetic data-enhanced dataset, comparing the two approaches. Comparing the performance of models based on original and synthetic data on the separate validation part of the dataset, the improvement in regression quality is shown across almost all of the targets. Various ranges of improvement are present, from essentially the same performance (e.g., DWT evaluated with

R^{2}

) to significant improvement (e.g., KW improving the

R^{2}

metric from 0.72 to 0.93 and dropping the

M A P E

from 8.29 to 4.86). None of the values have shown a drop in scores when synthetic data was introduced into the modelling process, indicating the synthesizing process, at worst, does not hurt the performance of the data.

The obtained results point towards the fact that it is possible to improve the design of main ship particulars using synthetic data. The main benefit of this lies in cases such as ship modelling, where large datasets cannot be collected from real-world data, due to a large amount of data points simply not existing. The fact that synthetic data generation is computationally cheap, especially in comparison to the actual regression modelling, has to be considered. This means that adding synthesis in the process of AI modelling may be a good idea, as it will not harm the performance but may significantly improve it.

The results allow us to address the original research questions posed at the introduction of this research. (RQ1) Yes, the generation of high-similarity synthetic data is possible, based on only 152 randomly selected original data points. (RQ2) The improvement in scores is apparent across all targets for models trained with the synthetic data combined with the original data. (RQ3) The improvement is not uniform across all targets, with different targets showing different levels of improvement depending on the observed metric.

Based on these findings, it can be concluded that an extended dataset augmented with synthetic data and analyzed using artificial neural networks (ANN) can yield favorable results. These outcomes are valuable for early-stage ship design, facilitating the estimation of main particulars. In ship design, access to large quantities of data is often limited. This paper demonstrates that even in such cases, ANN can be effectively employed by incorporating synthetic data. This approach proves especially beneficial for designing nonstandard ship types, such as ships intended for special purposes.

The limitations of the work lay in the fact that the research is performed on a relatively small dataset, which is focused on container vessels, so it is not possible to securely claim the possibility of generalizing the used approach on other vessel types. This should be addressed through the further testing of the described approach on different vessel types—on their own, as well as with datasets combining the main particulars of different vessel types. Two more limitations arise from the usage of synthetic data. First is the lack of real, new, data in the dataset—meaning that data points obtained from newly created vessels may not provide good results when evaluated with the created models. This limitation cannot be directly addressed at this time, as it could take years to collect additional data points from newly created container vessels, especially ones that differ significantly from existing ones. The second limitation is that as the models are not constrained to realistic proportions, the main particulars obtained in the synthesis process may not be realistic. While this is not an issue for the current study, it presents a limitation for the use of synthetic models such as GC, which are not able to generate realistic main particulars—and more advanced techniques such as custom deep convolutional generative adversarial networks should be applied if that is the goal, which may be an element of future work in this field. While it may be argued that the values obtained from ANN, despite fitting the original dataset well, may not be precise enough, it needs to be stressed that the developed work is mainly meant as an expert support system, which can provide starting values for further refinement by professionals with experience in the field. Future work may also include the use of the original and synthetic datasets presented in this study to create better models. The focus should be given to explainable AI techniques, in which models can be further analyzed and the logic behind them investigated. The importance of this lies in the simpler presentation and integration of created AI models into the actual workflow of researchers and engineers.

Author Contributions

Conceptualization, D.M. and S.B.Š.; methodology, S.B.Š. and N.A.; software, S.B.Š.; validation, N.A. and J.A.; formal analysis, N.A. and J.A.; investigation, D.M.; resources, S.B.Š.; data curation, D.M.; writing—original draft preparation, D.M. and S.B.Š.; writing—review and editing, J.A.; visualization, N.A.; supervision, J.A.; project administration, J.A.; funding acquisition, J.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research has been (partly) supported by the CEEPUS network CIII-HR-0108, European Regional Development Fund, under the grant KK.01.1.1.01.0009 (DATACROSS); project CEKOM, under the grant KK.01.2.2.03.0004; the Erasmus+ projects WICT, under the grant 2021-1-HR01-KA220-HED-000031177, and AISE, under the grant 2023-1-EL01-KA220-SCH-000157157; the University of Rijeka, under scientific grants uniri-tehnic-18-275-1447 and uniri-mladi-technic-22-61 and University of Zagreb scientific grant 2023: 0640-10-8-Multi-criteria design of ship structures with included technological and production features.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset used in this study is available on request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

B	Molded Breadth
D	Depth
DWT	Deadweight
GS	Grid Search
GT	Gross Tonnage
KW	Engine Power
LOA	Length Overall
LPP	Length Between Perpendiculars
MAPE	Mean Absolute Percentage Error
ML	Machine Learning
MLP	Multilayer Perceptron
NT	Net Tonnage
$R^{2}$	Coefficient of Determination
TEU	Number of TEU
V	Vessel Speed

References

Evans, J.H. Basic design concepts. J. Am. Soc. Nav. Eng. 1959, 71, 671–678. [Google Scholar] [CrossRef]
Papanikolaou, A. Ship Design: Methodologies of Preliminary Design; Springer: Berlin/Heidelberg, Germany, 2014. [Google Scholar]
Watson, D.G. Estimating Preliminary Dimensions in Ship Design; Institution of Engineers and Shipbuilders in Scotland: Glasgow, UK, 1962. [Google Scholar]
Schneekluth, H.; Bertram, V. Ship Design for Efficiency and Economy; Butterworth-Heinemann: Oxford, UK, 1998; Volume 218. [Google Scholar]
Watson, D.G. Practical Ship Design; Elsevier: Amsterdam, The Netherlands, 2002; Volume 1. [Google Scholar]
Chądzyński, W. Elements of contemporary design methods of floating objects. In Scientific Reports of Szczecin University of Technology; Department of Ocean Engineering and Marine System Design: Szczecin, Poland, 2001. [Google Scholar]
Piko, G. Regression Analysis of Ship Characteristics; Australian Government Publishing Service: Canberra, ACT, Australia, 1980. [Google Scholar]
Kristensen, H.O. Determination of regression formulas for main dimensions of tankers and bulk carriers based on IHS fairplay data. In Project No. 2010-56, Emissionsbeslutningsstøttesystem Work Package 2; Report No. 02; Technical University of Denmark: Kongens Lyngby, Denmark, 2012. [Google Scholar]
Alwosheel, A.; van Cranenburgh, S.; Chorus, C.G. Is your dataset big enough? Sample size requirements when using artificial neural networks for discrete choice analysis. J. Choice Model. 2018, 28, 167–182. [Google Scholar] [CrossRef]
Farag, Y.B.; Ölçer, A.I. The development of a ship performance model in varying operating conditions based on ANN and regression techniques. Ocean. Eng. 2020, 198, 106972. [Google Scholar] [CrossRef]
Jeon, M.; Noh, Y.; Shin, Y.; Lim, O.K.; Lee, I.; Cho, D.S. Prediction of ship fuel consumption by using an artificial neural network. J. Mech. Sci. Technol. 2018, 32, 5785–5796. [Google Scholar] [CrossRef]
Alkan, A.; Gulez, K.; Yilmaz, H. Design of a robust neural network structure for determining initial stability particulars of fishing vessels. Ocean. Eng. 2004, 31, 761–777. [Google Scholar] [CrossRef]
Cepowski, T. The prediction of ship added resistance at the preliminary design stage by the use of an artificial neural network. Ocean. Eng. 2020, 195, 106657. [Google Scholar] [CrossRef]
Gurgen, S.; Altin, I.; Ozkok, M. Prediction of main particulars of a chemical tanker at preliminary ship design using artificial neural network. Ships Offshore Struct. 2018, 13, 459–465. [Google Scholar] [CrossRef]
Majnarić, D.; Šegota, S.B.; Lorencin, I.; Car, Z. Prediction of main particulars of container ships using artificial intelligence algorithms. Ocean. Eng. 2022, 265, 112571. [Google Scholar] [CrossRef]
Cepowski, T.; Chorab, P.; Łozowicka, D. Application of an artificial neural network and multiple nonlinear regression to estimate container ship length between perpendiculars. Pol. Marit. Res. 2021, 28, 36–45. [Google Scholar] [CrossRef]
Patki, N.; Wedge, R.; Veeramachaneni, K. The synthetic data vault. In Proceedings of the 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), Montreal, QC, Canada, 17–19 October 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 399–410. [Google Scholar]
van der Westhuizen, S.; Heuvelink, G.B.; Hofmeyr, D.P. Multivariate random forest for digital soil mapping. Geoderma 2023, 431, 116365. [Google Scholar] [CrossRef]
Liu, Y.; Chen, F.; Liang, N.; Yuan, Z.; Yu, H.; Wu, P. The improved Amati correlations from Gaussian copula. Astrophys. J. 2022, 931, 50. [Google Scholar] [CrossRef]
Baumgartner, D.; Kolassa, J. Power considerations for Kolmogorov–Smirnov and Anderson–Darling two-sample tests. Commun. Stat.-Simul. Comput. 2023, 52, 3137–3145. [Google Scholar] [CrossRef]
Tan, Y.; Zhao, G. Multi-view representation learning with Kolmogorov-Smirnov to predict default based on imbalanced and complex dataset. Inf. Sci. 2022, 596, 380–394. [Google Scholar] [CrossRef]
Sivasankari, S.; Surendiran, J.; Yuvaraj, N.; Ramkumar, M.; Ravi, C.; Vidhya, R. Classification of diabetes using multilayer perceptron. In Proceedings of the 2022 IEEE International Conference on Distributed Computing and Electrical Circuits and Electronics (ICDCECE), Ballari, India, 23–24 April 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–5. [Google Scholar]
Lin, R.; Zhou, Z.; You, S.; Rao, R.; Kuo, C.C.J. Geometrical interpretation and design of multilayer perceptrons. IEEE Trans. Neural Netw. Learn. Syst. 2022, 1–15. [Google Scholar] [CrossRef] [PubMed]
Shi, S.; Wang, Y.; Dong, H.; Gui, G.; Ohtsuki, T. Smartphone-aided human activity recognition method using residual multi-layer perceptron. In Proceedings of the IEEE INFOCOM 2022-IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), Virtual, 2–5 May 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–6. [Google Scholar]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Bonamente, M. The Linear Correlation Coefficient. In Statistics and Analysis of Scientific Data; Springer: Singapore, 2022; pp. 263–276. [Google Scholar]
Zhou, J.; Lin, H.; Jin, H.; Li, S.; Yan, Z.; Huang, S. Cooperative prediction method of gas emission from mining face based on feature selection and machine learning. Int. J. Coal Sci. Technol. 2022, 9, 51. [Google Scholar] [CrossRef]
Hastie, T.; Tibshirani, R.; Friedman, J.H.; Friedman, J.H. The Elements of Statistical Learning: Data Mining, Inference, and Prediction; Springer: New York, NY, USA, 2009; Volume 2. [Google Scholar]
Chai, T.; Draxler, R.R. Root mean square error (RMSE) or mean absolute error (MAE). Geosci. Model Dev. Discuss. 2014, 7, 1525–1534. [Google Scholar]
Nofal, S. Forecasting next-hour electricity demand in small-scale territories: Evidence from Jordan. Heliyon 2023, 9, e19790. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Visualization of the methodology used in the presented research.

Figure 2. The distributions of the data, with the calculated best fitting probability density functions, with the number of data points given in blue, and the fitted probability density function given as a yellow line.

Figure 3. A visualization of a ship form fitting a synthetically generated set of main particulars.

Figure 4. Comparison of probability density functions for real and synthetic data.

Figure 5. Comparison between real and synthetic data pairs for ship speed (v).

Figure 6. Comparison between real and synthetic data pairs for TEU.

Figure 7. Visualization of achieved scores. (a) Comparison of

R^{2}

scores across models (higher is better). (b) Comparison of

M A P E

scores across models (lower is better).

Figure 7. Visualization of achieved scores. (a) Comparison of

R^{2}

scores across models (higher is better). (b) Comparison of

M A P E

scores across models (lower is better).

Figure 8. Score improvements across additional metrics used.

Table 1. The hyperparameters used in the training process.

Hyperparameter	Possible Values
Hidden layers	1, 2, 3, 4, 5, 6
Neurons per layer	1, 2, 4, 8, 16, 32, 64, 128, 256, 512
Activation function	ReLU, Identity, Logistic, Tanh
Solver	Adam, LBFGS
Learning rate type	Constant, Adaptive, Inverse Scaling
Learning rate	0.5, 0.1, 0.001, 0.0001, 0.00001
L2	0.1, 0.01, 0.001, 0.0001

Table 2. The results of the models trained and tested using the training/test data split for the models trained on synthetic data.

Target	$R^{2}$	MAPE	Activation	L2	Layers	Neurons	LR Type	LR	Solver
LOA	0.95	5.12	Tanh	0.01	1	32	Inv. Scaling	0.001	LBFGS
LPP	0.96	5.23	Tanh	0.01	1	256	Constant	0.5	LBFGS
B	0.94	4.91	Tanh	0.1	2	32	Inv. Scaling	0.001	LBFGS
D	0.94	5.92	Tanh	0.1	5	64	Constant	0.0001	LBFGS
DRAUGHT	0.93	4.24	Tanh	0.1	5	64	Adaptive	0.5	LBFGS
GT	0.98	1.21	ReLU	0.1	4	512	Inv. Scaling	0.1	LBFGS
NT	0.98	1.34	ReLU	0.001	1	64	Constant	0.5	LBFGS
DWT	0.97	1.56	ReLU	0.01	1	8	Inv. Scaling	0.0001	LBFGS
KW	0.91	3.45	ReLU	0.01	1	2	Constant	0.5	Adam

Table 3. The results of the models trained and tested using the training/test data split for the models trained on original data.

Target	$R^{2}$	MAPE	Activation	L2	Layers	Neurons	LR Type	LR	Solver
LOA	0.96	4.94	ReLU	0.1	3	128	Adaptive	0.1	Adam
LPP	0.97	4.24	Identity	0.1	4	128	Constant	0.1	LBFGS
B	0.96	4.28	ReLU	0.1	5	32	Adaptive	0.01	Adam
D	0.95	4.61	ReLU	0.1	4	64	Constant	0.01	Adam
DRAUGHT	0.93	5.29	ReLU	0.01	3	64	Constant	0.01	Adam
GT	0.99	3.16	ReLU	0.01	4	128	Constant	0.1	Adam
NT	0.96	4.17	ReLU	0.01	5	128	Constant	0.1	LBFGS
DWT	0.97	2.88	ReLU	0.1	4	32	Adaptive	0.1	Adam
KW	0.78	7.95	Logistic	0.01	4	64	Inv. Scaling	0.5	Adam

Table 4. Validation results comparison between the MLP model trained on original data and the one trained on synthetic data.

		Original Train Data	Synthetic Data
LOA	$R^{2}$	0.95	0.97
LOA	$M A P E$	4.83	3.21
LPP	$R^{2}$	0.96	0.98
LPP	$M A P E$	3.94	3.08
B	$R^{2}$	0.95	0.97
B	$M A P E$	4.16	3.89
D	$R^{2}$	0.92	0.97
D	$M A P E$	4.58	3.94
DRAUGHT	$R^{2}$	0.92	0.95
DRAUGHT	$M A P E$	5.64	4.64
GT	$R^{2}$	0.98	0.99
GT	$M A P E$	2.12	1.04
NT	$R^{2}$	0.95	0.98
NT	$M A P E$	4.10	1.65
DWT	$R^{2}$	0.97	0.97
DWT	$M A P E$	4.02	1.53
KW	$R^{2}$	0.72	0.93
KW	$M A P E$	8.29	4.86

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Majnarić, D.; Baressi Šegota, S.; Anđelić, N.; Andrić, J. Improvement of Machine Learning-Based Modelling of Container Ship’s Main Particulars with Synthetic Data. J. Mar. Sci. Eng. 2024, 12, 273. https://doi.org/10.3390/jmse12020273

AMA Style

Majnarić D, Baressi Šegota S, Anđelić N, Andrić J. Improvement of Machine Learning-Based Modelling of Container Ship’s Main Particulars with Synthetic Data. Journal of Marine Science and Engineering. 2024; 12(2):273. https://doi.org/10.3390/jmse12020273

Chicago/Turabian Style

Majnarić, Darin, Sandi Baressi Šegota, Nikola Anđelić, and Jerolim Andrić. 2024. "Improvement of Machine Learning-Based Modelling of Container Ship’s Main Particulars with Synthetic Data" Journal of Marine Science and Engineering 12, no. 2: 273. https://doi.org/10.3390/jmse12020273

APA Style

Majnarić, D., Baressi Šegota, S., Anđelić, N., & Andrić, J. (2024). Improvement of Machine Learning-Based Modelling of Container Ship’s Main Particulars with Synthetic Data. Journal of Marine Science and Engineering, 12(2), 273. https://doi.org/10.3390/jmse12020273

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Improvement of Machine Learning-Based Modelling of Container Ship’s Main Particulars with Synthetic Data

Abstract

1. Introduction

2. Methodology

2.1. Dataset Description

2.2. Generations of Synthetic Data Points

2.3. Regression Methodology

3. Results

3.1. Data Distributions

3.2. Synthetic Data Results

3.3. Regression Results

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI