Modeling of Ship Fuel Consumption Based on Multisource and Heterogeneous Data: Case Study of Passenger Ship

: In the current shipping industry, quantitative measures of ship fuel consumption (SFC) have become one of the most important research topics in environmental protection and energy man-agement related to shipping operations. In particular, the rapid development of sensor technologies enables multisource data collection to improve the modeling of the SFC problem. To address the features of such heterogeneous data, this paper proposes an integrated model for the estimation of SFC that includes three modules: a multisource data collection module, a heterogeneous data feature fusion module and a fuel consumption estimation module. First, in the data collection module, data related to SFC are collected by multiple sensors installed aboard the ship. Second, the feature fusion module employs a series of moving overlapped frames to merge different frequency data into small frames so that fusion features can be extracted from the heterogeneous data of multiple sources. Fi-nally, in the fuel estimation module, the fusion features provide a novel way to consider the modeling and estimation of SFC as a classical time-series analysis using various machine learning techniques. Experimentally, linear regression (LR), support vector regression (SVR), and artiﬁcial neural network (ANN) were employed as the machine learning methods to train SFC models. Compared with the traditional feature extraction method, the accuracy of LR, SVR, and ANN were improved by 8.5, 0.35 and 51.5%, respectively, using the proposed method. The main contribution of this work is to consider the multisource and heterogeneous problem of sensor-based SFC data and propose an integrated model to extract the information of SFC data. Moreover, the experimental results showed that the estimation accuracy can be greatly improved.


Introduction
The shipping industry is one of the pillars of the world economy, as more than 80% of world merchandise trade by volume is carried by sea [1]. However, shipping causes a great deal of environmental pollution compared with other modes of transportation, and the carbon dioxide emissions generated account for a large part of total global greenhouse gas emissions [2]. In recent years, the issue of carbon emissions from ship operations has become a focus of many organizations, including the International Maritime Organization (IMO) and most shipping operators. Global carbon emissions from the shipping industry need to be significantly reduced in the future. Meanwhile, fuel costs have become the largest expenditure item in shipping operations, which has also been a topic of many concerned parties and shipping enterprises [3]. Therefore, in the face of rising fuel costs and the need for environmental protection, the IMO, which includes major shipping countries, urgently needs an effective and quantifiable fuel consumption assessment and estimation method to improve ship energy management.
For modeling ship fuel consumption (SFC), many studies have used artificial neural networks (ANNs) to quantify the relationships between the SFC and its influencing factors [4][5][6][7][8][9][10][11]. Leifsson et al. combined physical knowledge with ANN to build a grey-box model [12]. Ioannis et al. combined a type of neural network named long short-term memory (LSTM) with an Elman neural network (ENN) to forecast fuel consumption of passenger ships [13]. Mou et al. conducted a theoretical analysis to ascertain the principal fuel consumption influencing factors and used random forest regression (RFR) to model inland water SFC [14]. Ran et al. also adopted RFR to establish a model of SFC prediction [15]. Gkerekos et al. used several machine learning methods, such as support vector regression (SVR), extra tree regression (ETR) and ANN to model SFC, and found that ANN showed better performance results than ETR and SVR [16]. Yun et al. adopted models such as gradient boosting regression (GBR), RFR, linear regression (LR) and k-nearest neighbor regression [17]. Several studies have considered domain knowledge of fuel consumption to build SFC models. Meng and Du used two experience formulas to model fuel consumption and estimated the formula coefficients using the trust region algorithm [18]. Igor et al. adopted the numeric fitting method for the recorded SFC data [19]. Some researchers used multiple linear regression analysis, ridge regression and Lasso regression to model SFC and obtained excellent experimental results [20,21]. Omer et al. compared Lasso with ridge regression and found that Lasso had a better performance [22]. Bocchetti et al. used maximum likelihood estimation to estimate the experience formula coefficients [23]. Bialystocki and Konovessis applied polynomial regression analysis to depict the relations between SFC and its influence factors, such as speed and wind [24]. With the recent development of sensor technologies, the kinds and amounts of collected SFC data are growing rapidly. However, the existing SFC models, especially those based on machine learning techniques, cannot easily parse such unstructured data. The multisource and heterogeneous characteristics of novel fuel consumption data brings challenges to the data tailoring process and feature extraction.
In order to improve the predictive ability of SFC models, this paper proposes an integrated model that includes three modules: a multisource data collection module, a heterogeneous data feature fusion module and a fuel consumption estimation module. First, in the multisource data collection module, data related to SFC estimation are collected by multiple sensors attached to ships. Second, the heterogeneous data feature fusion module employs a series of moving overlapped frames to merge the different sensor data into small frames, so that common features can be extracted from various sensors with different sampling frequencies in the time domain. Finally, in the fuel consumption estimation module, several machine learning methods, such as LR, SVR and ANN, are adopted to train the SFC models based on the fusion features with an increased accuracy rate of 8.5%, 0.35%, and 51.5% respectively. The main contribution of this paper is to consider the multi-source and heterogeneous problem of sensor-based SFC data and propose an integrated model. This model merged the time domain of various sensors, performed feature extraction to exploit information of SFC data and greatly improved the prediction accuracy. Moreover, the integrated model could enable more sensor-based SFC data to be used in fuel consumption estimation.
The remainder of this paper is structured as follows. Section 2 provides a literature review of SFC data processing methods. Section 3 introduces the proposed model including multisource data collection, heterogeneous feature fusion and modeling of fuel consumption. This is followed by comparative experiments and result discussions in Section 4. Section 5 presents the conclusions.

Advances of SFC Estimation
In recent years, scholars have been committed to the innovative modeling of fuel consumption models, but rarely paid attention to the issue of data processing. In the practice of marine navigation, SFC-related data can be divided into two categories: logbased data and sensor-based data.

Log-Based SFC Data Collection and Modeling
For log-based data, Luan et al. performed outlier elimination with SFC data by considering three outlier types, namely univariate, multivariate and statistical model noises [9]. After data preprocessing, the various influencing factors were combined with different machine learning methods, such as multiple linear regression and multilayer perception artificial neural network, to model SFC estimation. Tayfun et al. removed the abnormal SFC data related to human error [21]. An SVR, tree-based algorithm, a boosting algorithm, multiple linear regression and ridge regression were used. Ioannis et al. conducted a correlation analysis among SFC influence factors and combined LSTM with ENN to perform SFC predictions [13]. Ran et al. removed the SFC data for speed of less than five knots or no cargo loaded [15]. The RFR was then applied to build the models. With the built model, the navigation speed was optimized subject to minimum fuel consumption and punctual arrival. The experiments showed fuel consumption could be reduced by 2-7% with the proposed methods. Gkerekos et al. removed the recorded anomalies of engine transient data [16]. Domain knowledge was then used to generate new features. For example, forward and aft draught could be transformed into draft amidships and trim. Next, feature standardization was conducted. Processed data were put into various machine learning models to validate the effectiveness of the models. Several other scholars attempted to find a mathematical relationship between fuel consumption and its influencing factors, including draft and displacement. Bialystocki and Konovessis performed three initial corrections to correct recorded SFC data, including draft, weather and hull roughness [24]. Polynomial regression analysis was then adopted to depict the relationships between fuel consumption and speed under different weather conditions. The built SFC model could offer decision support for ship owners and crew members in voyage planning. Igor et al. also conducted noise removal [19]. In order to tackle the problem of nonuniform time SFC data, a moving average was adopted. After data preprocessing, numeric fitting of the recorded data was carried out. Various fitting methods were compared. Lu et al. used an empirical theory to model SFC [25]. Fuel consumption could be obtained from engine load at varying speeds and sea states. Several other investigators provided formulas for SFC modeling. Meng and Du proposed a procedure for outlier removal based on prior domain knowledge [18]. When the changing rate of SFC data did not match the sailing speed and wind force scale, the data points were classified as outliers. In the experiments, the trust region algorithm was used to estimate the parameters of the SFC experience formula. For data processing, Wang et al. conducted Z-scores for all feature vectors [20], then Lasso regression was carried out to estimate formula parameters and implement feature selection to eliminate the high correlations among feature variables. Compared with SVR and ANN, it was found to have better prediction accuracy. Yang et al. used a genetic algorithm (GA) to estimate formula parameters [26]. First, due to shipping companies having specific recording requirements, some information needed for SFC modeling was not recorded and was calculated from known items. GA was then applied to determine the formula parameters and the estimation accuracy was good at the frequent operating conditions of ships.

Sensor-Based SFC Data Collection and Modeling
For sensor-based data, many researchers used ANN to predict SFC. The data of [4] were obtained from an automatic identification system (AIS). First, data normalization was performed to accelerate the convergence of the ANN. The built model was then used to minimize the fuel consumption of a voyage. Petersen took the multisource problem of SFC data into account [5]. The data beyond the frequent operation conditions were regarded as outliers and removed. Windows and feature extraction were then applied to handle asynchronous SFC data. The mean, variance and mean difference were taken as features. Next, the processed data were put into the ANN model to predict fuel consumption. Moreover, Petersen et al. considered several influencing factors such as pitch, sailing speed, trim, draught, and wind [6]. They changed some of the factors, such as pitch and sailing speed, to survey the changing trends of SFC. Ye et al. and Yin et al. also used an ANN model to predict SFC [7,8]. Ye et al. adopted global error and batch gradient descent to train the ANN [7]. Considering that the speed changes sharply during the arrival and departure of ships, Yin and Xu rejected the data for speed less than 15 knots, and normalization was carried out for feature variables [8]. The processed data were then set as inputs to the ANN model. Yin and Xu adopted a dynamic programming (DP) algorithm to optimize the navigation speed with punctual arrival. The experiment results showed that the ship could save 0.71% fuel by following the planned navigation speed. Yasser removed the missing information and noise in raw SFC data with Z-score and Mahalanobis distance for univariate and multivariate noises, respectively [10]. The RFR algorithm was adopted to rank the importance of influencing factors. ANN and multiple regression analysis were performed for SFC modeling. Yun et al. used the GBR, RFR and LR to build a prediction model and discussed two SFC reduction strategies [17]. Moreira used ANN to establish relations between the ship speed and the respective propulsion configuration [11]. Leifsson et al. combined physical knowledge with ANN to generate a grey-box model of SFC estimation, combining the methods either in series or in parallel [12]. For data processing, missing data were removed. The data were then resampled and resolved with a period of 15 s. The experimental results revealed that the prediction accuracy was significantly improved compared with the pure white-box model. Mou et al. applied RFR to SFC prediction [14]. The singular value and noise of the raw SFC data were removed, and the denoised data were numbered and subjected to equidistant sampling and normalization. The processed data were used as input for the RFR. A partial correlation analysis was carried out to survey the importance of different influencing factors. Several researchers have used formulas to depict fuel consumption and estimating formula parameters with some algorithms. Omer et al. applied Lasso and ridge regression and discussed the influence of the penalty factor on the prediction accuracy [22]. Bocchetti et al. used the maximum likelihood estimation algorithm [23]. They carried out variable redefinitions, such as wind direction being transferred into head wind and cross wind. Feature selection was then adopted to ensure appropriate regressors were used for the SFC estimation, and maximum likelihood estimation was used to estimate the regression coefficients. Lokukaluge et al. presented a Gaussian mixture model (GMM) to divide fuel consumption into three clusters, and principal component analysis (PCA) was applied to investigate the impacts of each variable, such as speed, trim and wind [27]. In contrast with SFC prediction, Troden et al. proposed a method to associate fuel consumption with ship operation activities [28]. They used Kalman filters to clear dirty data, such as data when the ship was not underway. According to the changing rate of speed and fuel consumption, the ship operation was divided into different states. Considering the storage and transfer of massive SFC data, Perera and Mo proposed a data compression and recovery system [29]. First, the outliers were removed and normalization was performed. Next, an autoencoder system consisting of PCA was used for data compression. Experiments showed that the fuel data information was well maintained after compression, which offered a way to process SFC data online.

Limitation of SFC Modeling
It can be noted from the literature that although the structure of log-based data is simple, log-based data cannot describe the fuel consumption situation accurately because of its low sampling frequency compared with sensor-based data. Machine learning methods can greatly improve the estimation accuracy but have a high requirement for the integration of multisensor cross-module heterogeneous data. Therefore, the sensor-based SFC data usually need to be processed by complicated feature extraction methods before use as data for training machine learning models.
Therefore, this study developed an integrated model for SFC estimation: • Multisource data collection module; • Heterogeneous data feature fusion module; • Fuel consumption estimation module.

Overview of SFC Estimation
The process of the integrated model is shown in Figure 1. As described, the integrated model consisted of three modules. First, data were collected by multiple sensors. Next, features of heterogeneous data were extracted and fused. Finally, some machine learning methods were adopted to train the SFC mode.

Multisource Data Collection Module
In accordance with the previous literature review, it can be noted that the collection methods of SFC data are mainly divided into log-based data and sensor-based data. Logbased data are filled in by crew members with low sampling frequency. Because it is manually filled in, errors or subjective factors are inevitable, such as the judgment of wind and wave levels. The SFC data and multisource data are collected by multiple sensors installed aboard ships as shown in Figure 2a.
With the rapid development of sensor technologies, multiple types of sensors are installed onboard ships. Much navigation-related information can be precisely measured and obtained. These sensor-based data are closely correlated with the SFC estimation. For instance, the propeller pitch can affect the thrust efficiency and the speed through water. The speed through water goes with SFC. The draught and trim angle indicates the loading conditions of the vessel, which can affect the ship's water resistance. Therefore, these sensor-based data are important and useful for SFC estimation. However, owing to the different sampling frequencies of various sensors, the collected data are unstructured and heterogeneous, which impedes data processing and utilization. In order to deal with such heterogeneous data, feature extraction and fusion were carried out as shown in Figure 2b Because the sampling frequency of various sensors is different, the data are not unified in time domain. To unify the time domain of different sensors, the method of framing was adopted. After framing, the frame was set as a new time unit. For framing methods, the traditional way is a nonoverlapped frame [5] without considering the continuity of time series data, as shown in Figure 3a. This study adopted a moving overlapped frame to maintain the continuity of time series data, as shown in Figure 3b.  Using a moving overlapped frame, the processed data have an overlap section between two adjacent frames, which maintains the coherence of the time series data.

Feature Extraction
The input dimension of machine learning methods is usually set as a constant. However, the amount of data for different types of sensors was not constant in a frame. To solve this problem, common features were extracted for different types of sensor data in a frame. The well-extracted features provide more information for SFC estimation. Previous research has mainly extracted the statistical features such as mean, variance and mean difference (MVD). The mean value can indicate the average intensity of data, variance reflects changing magnitude, and mean difference gives the variation tendencies [5]. In this paper, two feature extraction methods are proposed.
(i) Statistics features Two types of statistical features are used in this paper: statistics feature A (SF. A) comprising mean, variance, mean difference, mode and median; statistics feature B (SF. B) comprising lower margin (Min), lower quartiles (Q 1 ), median, upper quartile (Q 3 ), and upper margin (Max).
The feature extraction adopted in [5] is the mean, variance and mean difference of data in a frame as shown in Figure 4a. The calculation formulas are shown in Equations (1-3). The variables x and m are data value and data size, respectively, and t is the time step of the sampled points. I is the interval of frames. However, when the data in the frame do not strictly satisfy the normal distribution, these three values are not sufficient to describe the characteristics of data in a frame. For instance, when the data in the frame were left-skewed or right-skewed, even though the mean was the same, the distributions were completely different.
Mean di f f erence( Therefore, mode and median were introduced. The median and mode could be used to reflect the skewed distribution of data in the frame. The mean, variance, mean difference, mode and median of frame data were extracted as SF. A, as shown in Figure 4b.
The extraction steps of SF. A are listed as follows: (1) Divide the sensor data into frames. Data from different sensors were divided into different sensor frames.
(2) Calculate mean, variance, mean difference, mode and median of data in the respective sensor frame.
(3) Use the mean, variance, mean difference, mode and median as the SF. A of the data in the sensor frame.
For feature extraction, the principal characteristics of the data were extracted in a frame. The mean and variance of SF. A could be easily affected by outliers in the frame. Therefore, SF. B was introduced as shown in Figure 4c. Min, Q 1 , median, Q 3 and Max were adopted as SF. B [30]. Data larger than Max or smaller than Min were treated as outliers in each frame.
The extraction steps of SF. B are listed as follows: (1) Divide the sensor data into frames. Data from different sensors were divided into different sensor frames.
(2) Sort the data in the respective sensor frame according to the data value. Find the median, Q 3 and Q 1 of the data.
(4) Calculate Max and Min of the data in the sensor frame. Min = Q 1 − 1.5IQR; Max = Q 3 + 1.5IQR.
(5) Use Min, Q 1 , median, Q 3 , Max calculated above as the SF. B of the data in the sensor frame. (ii) Time sequence feature (TSF) The prementioned statistics feature only considered the distribution of the data. However, the collected data were time-series data. Therefore, a method for extracting TSF was proposed based on hierarchical clustering.
The data value and time step were adopted as clustering attributes to extract the TSF of data in a frame using the following steps. First, the number of TSF points k needs to be set up before hierarchical clustering. Then, the Euclidean distance of every two adjacent sampled points is calculated in the time domain. The two adjacent data points with minimum distance are combined by taking the mean value of those two points. The prementioned process is repeated until the predefined number of feature points k is obtained. The pseudo code for extracting TSF is shown in Algorithm 1.

Data Structure and Feature Fusion
In the previous section, SFC-related data were collected. Considering that the time domain of these sensors was not unified, framing and feature extraction were applied. Two types of feature extraction methods were proposed, statistical features and TSF. In this section, data structure and fusion features were are presented.
(i) Data structure In this section, the input and output matrices of the SFC models are introduced. As shown in Equation (4), this integrated model tries to find a relationship f between input matrix X and the output matrix Y. The influence factors of the i th frame were adopted to predict the SFC of the (i + 1) th frame.
For the i th frame, m types of SFC-related information were collected by m sensors installed on board ships as shown in Equation (5).
For every sensor's i th frame, the feature extraction algorithm was applied to obtain the data feature of that frame. The features of m sensors were adopted to represent the i th frame ship status. Then the influence factors X i were used to estimate the SFC Y i+1 .
As mentioned previously, two feature extraction methods were proposed. For SF. A, the mean, Var, Dif, Med and Mode of data were extracted for i th frame of every kind of sensor as shown in Equation (6).
Considering the effect of outliers in sensor frame, the SF. B was proposed. Min, Q 1 , Med, Q 3 and Max were extracted as SF. B as shown in Equation (7).
To extract the time sequence characteristics of the data in a frame, the TSF feature was proposed based on a hierarchical clustering algorithm. The cluster center c 1 , c 2 , · · · , c k was adopted as the TSF as shown in Equation (8).
(ii) Feature fusion The statistics feature only considered the data distribution in the frame. However, the TSF can also reflect the time sequence characteristics of the data in the frame. Therefore, in this part, different types of features are fused together. For sensor j = 1 to m, the SF. A and SF. B were fused as shown in Equations (9) The statistics feature was combined with TSF for the purpose of considering both the distribution and time sequence characteristics of sensor-based data in the frame. For sensor j = 1 to m, the SF. A and TSF were fused as shown in Equations (11) and (12). The SF. B and TSF were fused as shown in Equations (13) and (14). Then SF. A, SF. B and TSF were fused together as shown in Equations (15) and (16).

Fuel Consumption Estimation Module
The fuel consumption estimation was used to interpret the relationships between the influencing factors x {s(1), · · · , s(j), · · · , s(m)}, and the SFC as y. In this module, three machine learning methods were applied to SFC estimation based on the influencing factors x.

LR-Based SFC Estimation
LR provided that the relations between the influencing factors x and SFC y was linear. The hypothesis and cost function of LR could be written as Equations (17) and (18).
The optimization objective was to find out the parameters θ 0 , θ 1 , θ 2 , . . . , θ m , which could minimize the error between the predicted SFC and real SFC.

SVR-Based SFC Estimation
SVR was extended by a support vector machine (SVM). In SVR, the influencing factors x {s(1), . . . , s(j), . . . , s(m)} were mapped into a higher dimensional feature space by a kernel function, ϕ, as shown in Equation (19). The widely adopted kernel functions were linear kernel, polynomial kernel and radial basis function (RBF) kernel.
In the higher dimensional feature space, a hyper plane was estimated that minimized the largest distance between the mapped points and the hyper plane, subject to the distance from all mapped points to the hyper plane being less than ε. The optimization objective of SVR is shown in Equation (20).

ANN-Based SFC Estimation
In this paper, a deep neural network was applied on SFC estimation. The deep neural network consisted of three layers, namely input layer, hidden layer and output layer. For the input layer, the influencing factor x was transferred into hidden layer as shown in Equation (21).
In the hidden layer, the hidden layer activation function σ was adopted to increase the nonlinear characteristics of network as shown in Equation (22).
Finally, in the output layer, the output layer activation function δ was applied resulting in the network output as shown in Equation (23).
The optimization objective is shown in Equation (24), which was to find weights W and bias b that minimized the error between the predicted and real SFC.

Data Description
The data used in this article was provided by the Danish University of Technology and comes from a passenger roll-on roll-off (ro-ro) ship operating from the port of Thorshavn, capital of the Faroe Islands to Suduroy [5]. A single voyage takes approximately two hours, with two to three round trips per day. Its main routes are shown via the Elane route data in Figure 6. Route 1 (R1) is the main route, and Route 2 (R2) is the backup route when R1 is experiencing heavy weather and sea conditions. The experimental data contained fifty-two voyages, including 40 voyages using the R1 route and 12 voyages using the R2 route. The main particulars of the case ship are listed in Table 1.

Multisource Data Analysis
The experimental data were collected by the nine sensors installed aboard ship, namely speed (V) by the Doppler log stern, headwind (H) and crosswind (C) by a wind sensor on the mast, port and starboard rudder angle (Rpor, Rsta) by a rudder angle indicator astern, port and starboard propeller pitch (Ppor, Psta) by a propeller sensor astern, fuel consumption (Fuel) by a fuel flow meter astern, port and starboard draught (Dpor, Dsta) by a radar level meter amidships, trim (T) by pitching adjustors stem and stern. The sampling periods are shown in Table 2. Statistical analysis was carried out for these sensor-based data, as shown in Figure 7. The notation µ and σ in Figure 7 are the mean value and variance of the normal distribution. The headwind satisfied the normal distribution almost exactly. In total, 83.8% of the data were distributed between 0 to 25 m/s. For crosswind, the distribution was not normally distributed and had two peaks, at −10 m/s and 0 m/s. A total of 77.73% of the data were between −10 and 10 m/s, showing that the crosswind from the port and starboard sides was nearly equal (see Figure 7a).
The draught also satisfied the normal distribution. The draught had little change, from 5 to 6 m. This reflected that the ship was a passenger ship and did not need to load cargo, so loading conditions did not change significantly. The draught of the port and starboard were nearly equal, indicating that the ship did not list (see Figure 7b).

Optimization of Frame Size and Overlap Size
As previously mentioned, the moving-overlapped frame can better maintain the continuity of the data in comparison with the nonoverlapped frame. To verify this, an indicator termed the mean interval error (MIE) was defined as the mean value of the data interval error.
The formula for calculating MIE was shown as follows: In this section, the influence of frame size and overlap size on MIE is discussed. This paper compares two different framing methods, namely nonoverlapped frames and moving overlapped frames. The data processing's pseudo code is shown in Algorithm 2. The frame size was set to 30, 50, 70, 90 and 110 s and the overlap size was set as 20, 40 and 60% of the frame size, respectively. The results of the MIE are shown in Table 3 and Figure 8.   As can be seen from Figure 8, when the frame size changed from 30 to 110, the MIE of the moving overlapped frame was always lower than that of the nonoverlapped frame. With the increments of frame size, the MIE of both the moving-overlapped frame and nonoverlapped frame increased. For the overlap size, the larger the overlap size, the smaller the MIE. It can be seen that the moving overlapped framing method better maintained the continuity characteristics of the original sensor data.

Optimization of TSF Clusters
In this section, the cluster number of the TSF is discussed. An indicator called area error (AE) is given in Equation (26). The AE is the deviation between the area under the original data curve and the extracted TSF feature. The smaller the AE, the better the extracted TSF feature preserves the original data information.
Experimentally, the cluster number was set to 2, 3, 4 and 5. The results are shown in Table 4. The experimental results showed that with the increase in cluster number, the AE decreased, which means that the larger the cluster number, the better the TSF maintained the information of the original data.

Setting of Feature Fusion
In this section, the AE of features after fusion is discussed. From Section 4.3.2, it is concluded that the AE can indicate the ability of features to preserve original data information. The ability of the fused features such as (SF. A, TSF), (SF. B, TSF) and (SF. A, SF. B, TSF) were evaluated, and the results are shown in Table 5. The results show that the AE was decreased when statistics features were fused with TSF. The (SF. A, TSF) fusion decreased the AE by 11.9%, while the (SF. A, SF. B, TSF) fusion decreased the AE by 35.6%. The best performance was obtained by (SF. B, TSF) with a decrease of 41.4% in the AE. In general, with fused features, the AE is decreased and data information was better preserved.

Evaluation Indicator
The root mean squared error (RMSE) is the square root of the mean square deviation of the estimated SFC from the real SFC for a number of observations m. A smaller value of RMSE means that the estimation results obtained by the integrated SFC estimation model better approximated the real value. The calculation formula of the RMSE is shown in Equation (27), whereŷ i+1 = f (x) is the estimated value and y i+1 . is the real value.

Comparison of SFC Models
In order to investigate the application of different feature extraction methods on various machine learning methods, three machine learning methods were adopted to train SFC models, namely LR, SVR and ANN. Different feature extraction methods were compared, namely SF. A, SF. B and TSF. Moreover, statistical features were combined with TSF, giving combinations (SF. A, TSF), (SF. B, TSF) and (SF. A, SF. B, TSF). The SF. A and SF. B were also combined. The parameters of the machine learning methods are discussed below.
The penalty factor in SVR had little effect on the estimation accuracy and was set as 1.0 in the experiments. Three kernel functions were adopted, namely radial basis function (RBF), linear and polynomial. For the polynomial kernel function, the polynomial degree varied between 3 and 11. The results are illustrated in Figure 9a, which shows that the polynomial kernel function of degree 9 obtained the best estimation accuracy. In the ANN, two types of activation functions were adopted, namely Tanh and Sigmoid. The best estimation results were obtained by the Tanh activation function with two hidden layers, the first of 50 neurons, and the second of 30 neurons, as shown in Figure 9b. With the optimized parameters, the proposed feature extraction methods were compared with the method adopted in [5]. The results are shown in Table 6 and Figure 10. As can be seen from Table 6 and Figure 10, the accuracy of ANN was the best, followed by LR and SVR (polynomial  In summary, with the proposed model, the estimation accuracy could be improved for some frequently adopted machine learning methods. The best estimation result was obtained using (SF. B, TSF) with ANN. For LR, better performance was achieved by (SF. A, SF. B, TSF). The SF. A was more useful for SVR (polynomial).

Fuel Consumption Estimation of Real Voyages
After finishing the model establishment procedure, the SFC of a real ship on voyages could be estimated. As mentioned in Section 4.1, there were two routes for the case ship, forty voyages for R1 and 12 voyages for R2. From the machine learning method and feature comparison of Section 4.4, the ANN combined with (SF. B, TSF) features were adopted.
For both R1 and R2, a boxplot of the estimation error was given, and the results are shown in Figure 11. The RMSE of R1 and R2 were extremely close to each other, varying from 0.002 to 0.009. For both R1 and R2, the model performed well in providing an estimation result. This proved that the model can perform a favorably for different routes.
The estimation accuracy of different MIE was also discussed and the R-squared of different with different MIE for both routes is shown in Figure 12. For the R2, the fuel consumption mainly distributed between 0.60 to 0.70. The R1 also mainly distributed between 0.60 to 0.70, but still had a big part within 0.50 to 0.60. The experimental results showed that with the increasement of MIE, the R-squared of both routes decreased. The larger MIE made it difficult to estimate the fuel consumption rate exactly.

Conclusions
This study proposed an integrated SFC model consisting of three parts: a multisource data collection module, a heterogeneous data feature fusion module and a fuel consumption estimation module. In the multisource data collection module, data types and collection methods were introduced. In heterogeneous data feature fusion module, to fuse the heterogeneous data, feature extraction and fusion were adopted. In the fuel consumption estimation module, three machine learning methods were used to train the integrated SFC model.
The academic contributions of this paper can be summarized as follows. (a) For unifying the time domain of multisource data, the framing method was adopted. Two framing methods were compared, and it was found that the moving overlapped frame was more effective. (b) The TSF feature was proposed to consider the time sequence characteristics of data. Statistical features and TSF were fused for considering both data distribution and time sequence. (c) The fused (SF. B, TSF) feature had a better estimation accuracy with ANN, especially for the R1.
For further studies, we will continue to improve the framing methods and feature fusion methods to obtain a smaller MIE and better prediction accuracy of fuel consumption estimation. We will also examine the application of the proposed model on other real-world cases of SFC analysis and prediction.