Big Data Analytics and Machine Learning of Harbour Craft Vessels to Achieve Fuel Efﬁciency: A Review

: The global greenhouse gas emitted from shipping activities is one of the factors contributing to global warming; thus, there is an urgent need to mitigate the adverse effect of climate change. One of the key strategies is to build a vibrant maritime industry with the use of innovation and digital technologies as well as intelligent systems. The digitization of the shipping industry not only provides a competitive edge to the shipping business model but also enhances ship operational and energy efﬁciency. This review paper focuses on the big data analytics and machine learning applied to harbour craft vessels with the aim to achieve fuel efﬁciency. The paper reviews the telemetry system requires for the digitalization of harbour craft vessels, its challenges in installation, the vessel monitoring and data transmission system. The commonly used methods for data cleaning are also presented. Last but not least, the paper considers two types of the machine learning systems, i.e., supervised and unsupervised machine learning systems. The multi-linear regression and hidden Markov model for supervised machine learning system and the artiﬁcial neural network, grey box model and long short-term memory model for unsupervised machine learning are discussed, and their pros and cons are presented.


Impact of Industrial Revolution on Shipping Industry
The shipping industry has undergone a transformation from the use of steam engines in the First Industrial Revolution, followed by the widespread use of electrical and combustion engine-powered vessels in the Second Industrial Revolution. The shipping industry has also experienced a digital transformation in the Third Industrial Revolution, and is now entering the dawning of the Fourth Industrial Revolution that focuses on smart shipping based upon the integration of the internet of things (IoT), intelligent systems and innovative solutions. The transformation of the shipping industry from the First to the Third Industrial Revolutions has brought the maritime and shipping businesses a competitive edge. One of the digitalization technologies is the use of big data analytics (BDA) and machine learning (ML) to achieve fuel efficiency. The improvement in energy efficiency plays a role in reducing emissions intensity, which is part of the main goals in the 2018 Paris Climate Accord [1]. In addition, the United Nations Shipping Agency (UNSA) has reached an agreement to cut carbon emissions by at least 50 per cent by 2050, compared with the 2008 levels [2]. Evidenced from the significant benefits the First to Third Industrial Revolutions brought to the shipping industry, the Fourth Industrial Revolution that focuses on harnessing the infoCOMM technologies, networks and big data to create tech-enabled solutions has encouraged shipping companies to embrace innovative digital solutions. Several means in place to mitigate the carbon emission from shipping activities are: • Energy Efficiency Measurement Index The IMO has implemented the Ship Energy Efficiency in Annex VI of the MARPOL [7] which include the Energy Efficiency Design Index (EEDI) in the ship design state, Ship Energy Efficiency Management Programme (SEEMP) in the ship operational planning stage and Energy Efficiency Operational Index (EEOI) in monitoring the energy efficiency and collection of data for continuous improvement in terms of carbon emission. The proposed ship energy efficiency concept aims to minimize GHG emissions by developing ways to lower fuel usage, more efficient ship design and switching to alternative fuels that emit lesser GHG [8].
• Alternative Marine Fuels To meet the increasingly strict emission regulation, alternative fuels such as liquified natural gas (LNG), ammonia, methanol and liquid hydrogen have become a more important part of the energy mix. LNG trade has expanded dramatically from 100 million tonnes in 2000 to approximately 300 million tonnes in 2017. However, LNG still generates carbon emissions but is significantly lesser than diesel. Ammonia and hydrogen could be produced from hydrocarbons, and green ammonia and green hydrogen which are produced from electrolysis powered by renewables or nuclear are excellent sources of zero-emission fuel [9]. Methanol on the other hand is easier to store and handle than LNG. However, ammonia, hydrogen and methanol have a lower energy content than conventional fuel.
• Electrification Electrification of marine vessels is becoming more commercially viable due to increasingly declining battery costs fueled by the growth of electric cars. Several commercial electric vessels have also been built. For example, the 4.3 MWh all-electric ferry, Ellen (Figure 1a), was built in the framework of the EU's Horizon 2020 program and is estimated to save 2000 tons of CO 2 per year in its operation. A small electric cruise ship, Brime Explorer (Figure 1b), and the Grimaldi GGSG ro-ro freighter ( Figure 1c) were built to operate in Norway's fjords and the Mediterranean, respectively [10]. Nevertheless, there are several challenges in marine electrification, especially in the charging infrastructure, voyage distance and weight issues [11].
( Figure 1c) were built to operate in Norway's fjords and the Mediterranean, respectively [10]. Nevertheless, there are several challenges in marine electrification, especially in the charging infrastructure, voyage distance and weight issues [11]. There are also other means for mitigating carbon emissions such as wind-assisted propulsion ship [12], use of Flettner rotor [13], slow steaming [14] and many more. This review paper focuses on the digitalization of marine vessels, in particular, the harbour craft vessels (HCV) utilizing BDA and ML to achieve fuel efficiency. The paper is arranged as follows: Section 2 describes the various means of digitalization technologies applied in the maritime industry; Section 3 describes the general framework for BDA and ML systems applied to ship to achieve fuel efficiency; Section 4 covers the types of data acquisition systems; Section 5 discusses data filtering and preparation; Section 6 covers the BDA and ML models commonly used for HCV. Last but not least, the Conclusion is provided in Section 7.

Digitalization in Maritime Industry
Digitalization using BDA and ML has been utilized in the shipping industry to improve operational efficiency, productivity and to enhance fuel efficiency. BDA and ML are used to reorganize huge amounts of unstructured data and analyze these data to establish the correlations between diverse aspects that are difficult for human analysts to identify. ML helps to accelerate the process of BDA and is used to uncover trends and patterns of the data. Research conducted across industries has shown significant improvements in industries that are in pursuit of digitalization where these improvements have enabled industries to experience better economic performance [15]. Although digitalization in the shipping industry has been relatively slow compared with other industries, big players such as Rolls Royce and Wärtsilä have set up research and development centres to explore remote and autonomous shipping [16,17].
Machinery that is built for ships usually does not live up to its expected lifespan. This might be due to the lack of maintenance or the incapability of detecting faults sooner to prevent catastrophic damages. To reduce such risks, preventive maintenance that involves the evaluation of equipment condition via periodic check (i.e., BDA) and continuous equipment condition monitoring (i.e., IoT) enables the maintenance process to be much more efficient as remote diagnostics of ships' machinery will be made available [18]. When faults are discovered immediately, further engine damage can be prevented, thereby reducing the amount of fuel consumed. This in turn reduces the amount of GHG emissions and also results in 10-35 per cent more cost-effective operations [19]. According to a report in [20], predictive maintenance can reduce unexpected failures by 55% and maintenance costs are expected to reduce by an estimated 25% to 30%. Companies are also utilizing advanced technology such as weather routing, allowing ships adequate time to avoid bad weather [21,22]. In addition to that, the technology ensures that its seagoing assets' gas emissions and cargo temperatures are monitored from shore, thereby reducing There are also other means for mitigating carbon emissions such as wind-assisted propulsion ship [12], use of Flettner rotor [13], slow steaming [14] and many more. This review paper focuses on the digitalization of marine vessels, in particular, the harbour craft vessels (HCV) utilizing BDA and ML to achieve fuel efficiency. The paper is arranged as follows: Section 2 describes the various means of digitalization technologies applied in the maritime industry; Section 3 describes the general framework for BDA and ML systems applied to ship to achieve fuel efficiency; Section 4 covers the types of data acquisition systems; Section 5 discusses data filtering and preparation; Section 6 covers the BDA and ML models commonly used for HCV. Last but not least, the Conclusion is provided in Section 7.

Digitalization in Maritime Industry
Digitalization using BDA and ML has been utilized in the shipping industry to improve operational efficiency, productivity and to enhance fuel efficiency. BDA and ML are used to reorganize huge amounts of unstructured data and analyze these data to establish the correlations between diverse aspects that are difficult for human analysts to identify. ML helps to accelerate the process of BDA and is used to uncover trends and patterns of the data. Research conducted across industries has shown significant improvements in industries that are in pursuit of digitalization where these improvements have enabled industries to experience better economic performance [15]. Although digitalization in the shipping industry has been relatively slow compared with other industries, big players such as Rolls Royce and Wärtsilä have set up research and development centres to explore remote and autonomous shipping [16,17].
Machinery that is built for ships usually does not live up to its expected lifespan. This might be due to the lack of maintenance or the incapability of detecting faults sooner to prevent catastrophic damages. To reduce such risks, preventive maintenance that involves the evaluation of equipment condition via periodic check (i.e., BDA) and continuous equipment condition monitoring (i.e., IoT) enables the maintenance process to be much more efficient as remote diagnostics of ships' machinery will be made available [18]. When faults are discovered immediately, further engine damage can be prevented, thereby reducing the amount of fuel consumed. This in turn reduces the amount of GHG emissions and also results in 10-35 per cent more cost-effective operations [19]. According to a report in [20], predictive maintenance can reduce unexpected failures by 55% and maintenance costs are expected to reduce by an estimated 25% to 30%. Companies are also utilizing advanced technology such as weather routing, allowing ships adequate time to avoid bad weather [21,22]. In addition to that, the technology ensures that its seagoing assets' gas emissions and cargo temperatures are monitored from shore, thereby reducing maintenance costs and the risk of failure due to negligence [23]. Another type of technology that helps to reduce fuel consumption is the Marine Growth Prevention System (MGPS). The MGPS aims to combat marine organism growth and prevent the organism from depositing on the ship's systems, thus helping to eliminate corrosion. The MGPS aids in the efficient operation of the seawater-supplied system and machinery [24], and reduces the ship's resistance. This in turns increases the energy savings and reduces the fuel consumption of a ship [24].
To stay ahead of an ever-evolving environment, there is a growing emphasis on the adoption of digital technologies in the maritime industry, employing BDA and ML [25][26][27]. It enables a large amount of data to be collected, stored and processed where many aspects of marine operations could be conducted via digital platforms, efficiently and effectively. BDA and ML to achieve fuel efficiency in ships are widely implemented in commercial vessels such as container ships, oil tankers and cruise liners. However, little work has been carried out on HCV such as tugboats, patrol vessels and ferries. The carbon emission from HCV should not be overlooked; therefore, countries such as Singapore have allocated a substantial amount of funding in encouraging the harbour craft sectors to invest in digital solutions [28]. In the next section, the BDA and ML frameworks applied to HCV are presented.

State-of-the-Art
Presented herein is state-of-the-art literature for BDA and ML in the recent two decades used for achieving fuel energy efficiency in a ship. The latest development of BDA and ML utilized in maritime related research is obtained from sources published in scientific journals and conferences, and the practical application of BDA and ML in the industry is also reviewed from information found in the public domain.

Big Data Analytics
The BDA could be used for identifying the pattern and correlation of fuel consumption with respect to the environment and ship data recorded, to improve the fuel efficiency through optimal vessel speed and voyage route. A research study on vessels fuel consumption by the use of BDA has been conducted in [29,30], where it is reported that the power, and thus the fuel required to propel the ship through water depends on the trim of the vessels. The optimum ship speed during the time of vessel delivery for fuel consumption changes over time due to a variety of factors such as engine wear, coating of vessels, etc. The use of BDA can help shipowners determine the optimum speed for fuel consumption, taking into consideration factors such as bunker cost, freight rates and schedules [31]. The use of BDA to calculate potential fuel savings can provide ship owners with detailed insight on all aspects of vessel operations impacting the fuel efficiency to inform on the return of investment (ROI) decision. Additionally, the Fujitsu Laboratory has developed technology that uses BDA for large ships to estimate fuel efficiency, speed and other performance in actual sea conditions in 2016 [32]. Other research on suggesting optimal vessel speed decisions in maritime logistics using weather big data has also been conducted [33,34]. It is also evidenced by a report in [35] that BDA has helped shipping lines such as Maersk in cutting its fuel consumption by 13%. Big data analytics are also used to investigate the speed optimization process for large container ships [36].

Machine Learning
Machine learning can be defined as the application of artificial intelligence and computer systems to learn from the environment, improve itself from experience without the need for any explicit programming and call for action that does not require human intervention. Machine learning focuses on enabling algorithms to learn from the data provided, gather insights and make predictions on previously unanalyzed data using the information gathered [37].
The shipping industry has been keeping an eye on the development of ML that can customize container freight and overcome tough operational problems encountered in everyday operations. For instance, ML can be used to forecast estimated travel time (ETA), even if there are congestions at transshipments points, weather-related difficulties, overbooking issues, and equipment paucity. The computation learns from the past data, thus providing a much more definite forecast [38]. To improve fuel efficiency and vessel performance at sea, an algorithm was developed for an intelligent fuel oil consumption monitoring system that can propose an optimal trim condition to minimize the ship resistance during voyage [39]. Fujitsu Laboratories has developed a new artificial technology and teamed up with Mitsui O.S.K. Line, Ltd. (MOL) in improving fuel efficiency and reducing CO 2 emission of a ship through the use of operational big data [40]. ML is also used for fouling analysis in predicting a more accurate dry docking, cleaning and coating schedules as the vessel fuel consumptions are affected by the vessel speed and the fouling of the ship [41]. An ML approach was also developed by the Technical University of Denmark in predicting the main energy consumption under realistic operational conditions [42]. An artificial neural network (ANN)-based decision support system has also been developed for cargo vessel operation by employing a combination of traditional statistical analysis and ANNs [27].

Ship Energy Efficiency
The ship energy efficiency depends significantly on the ship resistance and propulsion. Theoretically, for a newly built ship, the total ship hull resistance is obtained from the bare hull resistance test, where the ship without propeller is run in calm water in the towing tank at a constant speed V and the ship hull resistance is measured by the computer in the towing carriage. The ship hull resistance R T is then used to calculate the effective power, P E , of the marine engine as follows [43]: As the ship bare hull resistance test does not take into consideration the effect due to the propeller and wave as well as mechanical losses such as shaft and gears, the final ship engine power which is measured as the Total Engine Brake Power, P TEB , has to take into account these losses coefficients, η Losses , as shown in Equation (2) [44] The fuel consumption of the ship depends significantly on P TEB , where higher fuel consumption is required when a greater amount of P TEB is needed, and vice versa. However, in practice when the ship is operating in the sea, the amount of fuel required to achieve the same V might differ from ship to ship depending on the performance of the engine and also due to the changes of the ship resistance affected by factors such as hull fouling, weather conditions and water depth. Thus, a more reliable method to measure the fuel consumption efficiency is by direct measurement from boats in the real sea and by the use of the Energy Efficiency Index (EEI), measured as [45] where ∆ is the ship displacement, FC the fuel consumption and V the vessel speed. If Equations (1)-(3) are arranged into a single equation, the relationship between the EEI with the P TEB , R T , V, ∆, FC is Equation (4) shows that the vessel speed, mean draft and trim are the few parameters if properly adjusted, may reduce fuel consumption and carbon emissions, thereby increasing the energy efficiency. For HCV, the draft and trim do not significantly change due to their scale, thus one of the methods to improve the EEI is via the adjustment to the vessel speed. In addition, the speed of the vessel also depends on the shipping route and the environmental conditions such as the wind and current velocity. Thus, the influence of the vessel speed, wind and current velocity on energy efficiency has to be taken into account.

Digitalisation Framework
The BDA and ML framework for a tugboat is described here. Although the framework described here is targeted for discovering the knowledge domain of the fuel consumption in tugboats to achieve fuel efficiency, it is applicable for other applications such as preventive maintenance, route optimization, etc., as described in Section 2. Figure 2 shows the schematic diagram of the digitalization process from the data collected in the ship and then transmitted to the land-based system via the network middleware. The information process via BDA is then transmitted back to the control bridge for ship route decision making. The data is continually fed to the ML system in improving the decision-making capability to improve energy efficiency. Machine learning with the BDA process involves several components, i.e.,: the vessel speed. In addition, the speed of the vessel also depends on the shipping route and the environmental conditions such as the wind and current velocity. Thus, the influence of the vessel speed, wind and current velocity on energy efficiency has to be taken into account.

Digitalisation Framework
The BDA and ML framework for a tugboat is described here. Although the framework described here is targeted for discovering the knowledge domain of the fuel consumption in tugboats to achieve fuel efficiency, it is applicable for other applications such as preventive maintenance, route optimization, etc., as described in Section 2. Figure  2 shows the schematic diagram of the digitalization process from the data collected in the ship and then transmitted to the land-based system via the network middleware. The information process via BDA is then transmitted back to the control bridge for ship route decision making. The data is continually fed to the ML system in improving the decisionmaking capability to improve energy efficiency. Machine learning with the BDA process involves several components, i.e.: • Telemetry: Sensors and data acquisition; • Vessel monitoring system; • Network middleware; • ML with BDA system. The ML with BDA involves two states, i.e., STAGE 1 descriptive analytics; STAGE 2 predictive and prescriptive analytics (see Figure 3). The descriptive analytic is to find the patterns or correlation between the with the various factors that have been identified such as the vessel speed , vessel displacement ∆, vessel route, fuel consumption , wind velocity and current flow. Once the patterns and correlations have been identified, predictive analytics are used to predict the behavior of the vessel from the metocean data, when the vessel is travelling under different scenarios such as at a specified vessel route and vessel speed. These possible scenarios or solutions are then fed to the situation room for decision making. The ML with BDA involves two states, i.e., STAGE 1 descriptive analytics; STAGE 2 predictive and prescriptive analytics (see Figure 3). The descriptive analytic is to find the patterns or correlation between the EEI with the various factors that have been identified such as the vessel speed V, vessel displacement ∆, vessel route, fuel consumption FC, wind velocity and current flow. Once the patterns and correlations have been identified, predictive analytics are used to predict the behavior of the vessel from the metocean data, when the vessel is travelling under different scenarios such as at a specified vessel route and vessel speed. These possible scenarios or solutions are then fed to the situation room for decision making.  As presented in Figure 3, STAGE 1 only involves Descriptive Analytics where data are collected from the ship and fed into the BDA software to establish the patterns and correlations, as specified in Equations (1)-(3). In STAGE 2, the information is then transferred to the situation room to suggest the vessel on the optimized speed and route to be taken that could improve the . This is a continuous loop process, where data will be collected continuously from the ship, and to be used in training the ML system with BDA, so that the decision-making process could be continuously improved.

Telemetry
Depending on the application of ML, various types of sensors must be installed on the ship to collect the necessary data.

Mass Flowmeter
For ship energy efficiency, one of the most important data is the fuel consumption, which could be collected by the flowmeter. There are several flowmeters in the market such as the volumetric flowmeter that measures the volume of the fuel consumed and the more accurate Coriolis mass flowmeter that measures the mass of the fuel consumed. The installation of Coriolis mass flowmeter was made mandatory by the Singapore Maritime Port Authority (MPA) from 1 July 2019 [46] to increase fuel quality and reliability, and also to prepare the sector for the rise in distillate bunker fuel deliveries after the IMO implemented a 0.5 per cent worldwide Sulphur cap on 1 January 2020 [47]. A Coriolis meter is based on motion mechanics principles as shown in Figure 4. As fluids enter the device, a driving coil induces the tubes to vibrate in opposition at their natural resonant frequency thereby creating sine waves. The fluid induces Coriolis force which causes the flow tubes to twist. The density of the fluid (mass) is measured by analyzing the frequency of the sine waves and the readings are highly accurate with typical measurement errors of ±0.2 per cent. Coriolis mass flowmeter is equipped with built-in sensors for temperature, pressure, and density measurements with a display system [48]; therefore, the results could be stored and transmitted readily. Moreover, the Coriolis mass flowmeter requires low maintenance as there are no moving parts. The volumetric flowmeter on the other hand has moving parts that can be degraded over time which will result in inaccurate readings. Additionally, volumetric type meters contain separate temperature and pressure gauges that might be readily tampered with or gauges that are inaccurate.  As presented in Figure 3, STAGE 1 only involves Descriptive Analytics where data are collected from the ship and fed into the BDA software to establish the patterns and correlations, as specified in Equations (1)-(3). In STAGE 2, the information is then transferred to the situation room to suggest the vessel on the optimized speed and route to be taken that could improve the EEI. This is a continuous loop process, where data will be collected continuously from the ship, and to be used in training the ML system with BDA, so that the decision-making process could be continuously improved.

Telemetry
Depending on the application of ML, various types of sensors must be installed on the ship to collect the necessary data.

Mass Flowmeter
For ship energy efficiency, one of the most important data is the fuel consumption, which could be collected by the flowmeter. There are several flowmeters in the market such as the volumetric flowmeter that measures the volume of the fuel consumed and the more accurate Coriolis mass flowmeter that measures the mass of the fuel consumed. The installation of Coriolis mass flowmeter was made mandatory by the Singapore Maritime Port Authority (MPA) from 1 July 2019 [46] to increase fuel quality and reliability, and also to prepare the sector for the rise in distillate bunker fuel deliveries after the IMO implemented a 0.5 per cent worldwide Sulphur cap on 1 January 2020 [47]. A Coriolis meter is based on motion mechanics principles as shown in Figure 4. As fluids enter the device, a driving coil induces the tubes to vibrate in opposition at their natural resonant frequency thereby creating sine waves. The fluid induces Coriolis force which causes the flow tubes to twist. The density of the fluid (mass) is measured by analyzing the frequency of the sine waves and the readings are highly accurate with typical measurement errors of ±0.2 per cent. Coriolis mass flowmeter is equipped with built-in sensors for temperature, pressure, and density measurements with a display system [48]; therefore, the results could be stored and transmitted readily. Moreover, the Coriolis mass flowmeter requires low maintenance as there are no moving parts. The volumetric flowmeter on the other hand has moving parts that can be degraded over time which will result in inaccurate readings. Additionally, volumetric type meters contain separate temperature and pressure gauges that might be readily tampered with or gauges that are inaccurate.

Wind Sensor
The ship energy efficiency depends on the ship resistance (Equation (4)) which in turn is affected by the environmental data such as wind, wave and current. Wind sensors, as shown in Figure 4, are used to collect data on the wind speed and direction the ship experienced. The mechanical wind sensor is shown in Figure 5a is also used to measure wind data; however, it operates with moving parts. The mechanical sensor operates by having a rotating cup and vane in measuring the wind speed and direction. The time it takes a mechanical sensor to physically start-up or record a change in wind direction causes observed variances in recorded wind speed [49]. For example, if a storm passes through a region and the wind abruptly changes direction, the sensor must slow down, stop, then resume to keep up with the shift. The inaccuracy in the mechanical wind sensor could be overcome by the ultrasonic wind sensor shown in Figure 5b. The Ultrasonic wind sensor [49], also known as a sonic anemometer, uses a microcontroller to measure the travelling time of the ultrasonic pulse in computing the wind speed and does not require any moving parts to operate, thus requires lesser maintenance and have a longer lifespan. Inertia does affect the ultrasonic sensor as it is capable of measuring changes of wind direction or high gust immediately and in real-time.

Wind Sensor
The ship energy efficiency depends on the ship resistance (Equation (4)) which in turn is affected by the environmental data such as wind, wave and current. Wind sensors, as shown in Figure 4, are used to collect data on the wind speed and direction the ship experienced. The mechanical wind sensor is shown in Figure 5a is also used to measure wind data; however, it operates with moving parts. The mechanical sensor operates by having a rotating cup and vane in measuring the wind speed and direction. The time it takes a mechanical sensor to physically start-up or record a change in wind direction causes observed variances in recorded wind speed [49]. For example, if a storm passes through a region and the wind abruptly changes direction, the sensor must slow down, stop, then resume to keep up with the shift. The inaccuracy in the mechanical wind sensor could be overcome by the ultrasonic wind sensor shown in Figure 5b. The Ultrasonic wind sensor [49], also known as a sonic anemometer, uses a microcontroller to measure the travelling time of the ultrasonic pulse in computing the wind speed and does not require any moving parts to operate, thus requires lesser maintenance and have a longer lifespan. Inertia does affect the ultrasonic sensor as it is capable of measuring changes of wind direction or high gust immediately and in real-time.

Wind Sensor
The ship energy efficiency depends on the ship resistance (Equation (4)) which in turn is affected by the environmental data such as wind, wave and current. Wind sensors, as shown in Figure 4, are used to collect data on the wind speed and direction the ship experienced. The mechanical wind sensor is shown in Figure 5a is also used to measure wind data; however, it operates with moving parts. The mechanical sensor operates by having a rotating cup and vane in measuring the wind speed and direction. The time it takes a mechanical sensor to physically start-up or record a change in wind direction causes observed variances in recorded wind speed [49]. For example, if a storm passes through a region and the wind abruptly changes direction, the sensor must slow down, stop, then resume to keep up with the shift. The inaccuracy in the mechanical wind sensor could be overcome by the ultrasonic wind sensor shown in Figure 5b. The Ultrasonic wind sensor [49], also known as a sonic anemometer, uses a microcontroller to measure the travelling time of the ultrasonic pulse in computing the wind speed and does not require any moving parts to operate, thus requires lesser maintenance and have a longer lifespan. Inertia does affect the ultrasonic sensor as it is capable of measuring changes of wind direction or high gust immediately and in real-time.

Other Sensors
Other sensors to be installed onboard the ship for better accuracy in the prediction of fuel consumption are such as the acoustic Doppler profiler (ADP) to measure the current flow as the resistance of the ship is significantly affected by the current flow. However, the installation of the ADP is expensive and requires modification of ship structural configuration to fit the ADP. Tidal tables are thus used for predicting the average current flow when the real-time current profile is not available. The driveshaft RPM sensor is also used to collect the rotational speed of the propeller shaft. This RPM data could be used to categorize the operational activities of the HCV by comparison with the fuel consumptions and environmental data in the ML process.

Simulated/Online Data
Other than data accumulated from sensors, data could also be obtained from online platforms to aid in the analysis. Such examples include weather forecasts data and vessels' route data which could be used to plan voyages and avoid any routes that may include bad weather forecasts. Sea traffic such as shown in Figure 6 is available by providers such as MarineTraffic [50] and VesselFinder [51]. Data collected through simulations can also be of use to optimize the voyage and efficiency of the ship where multiple simulations can be run to determine the safest and fuel conserving route to take. Additionally, simulations can be run to test the optimal speed of the ship to reduce fuel consumption and carbon emissions. An example of ship resistance estimation is given in Figure 7.

Other Sensors
Other sensors to be installed onboard the ship for better accuracy in the prediction of fuel consumption are such as the acoustic Doppler profiler (ADP) to measure the current flow as the resistance of the ship is significantly affected by the current flow. However, the installation of the ADP is expensive and requires modification of ship structural configuration to fit the ADP. Tidal tables are thus used for predicting the average current flow when the real-time current profile is not available. The driveshaft RPM sensor is also used to collect the rotational speed of the propeller shaft. This RPM data could be used to categorize the operational activities of the HCV by comparison with the fuel consumptions and environmental data in the ML process.

Simulated/Online Data
Other than data accumulated from sensors, data could also be obtained from online platforms to aid in the analysis. Such examples include weather forecasts data and vessels' route data which could be used to plan voyages and avoid any routes that may include bad weather forecasts. Sea traffic such as shown in Figure 6 is available by providers such as MarineTraffic [50] and VesselFinder [51]. Data collected through simulations can also be of use to optimize the voyage and efficiency of the ship where multiple simulations can be run to determine the safest and fuel conserving route to take. Additionally, simulations can be run to test the optimal speed of the ship to reduce fuel consumption and carbon emissions. An example of ship resistance estimation is given in Figure 7.

Challenges in Sensors Installation
There are several challenges in the installation of sensors, such as the capital, operation, maintenance and troubleshooting costs. The installation and troubleshooting faults to have the sensors operating again. The inclement weather makes the marine environment less appealing; therefore, any troubleshooting could only be carried out onshore. Additionally, conducting research experiments on board a ship may interfere with the ship operations, so may become prohibitively expensive due to the interference with the schedule of the ship operations [52]. Therefore, it is recommended that the ship crew should be educated and trained to troubleshoot and rectify the faults to minimize the downtime and potential inconvenience caused. The conventional approach to ship cyber-security is based on the premise of keeping ship systems isolated from the Internet and ship/company intranets, which is still the standard for many safeties and security-sensitive ship owners. Even when the proprietary interface is accessible, data transmission is often serial and unidirectional [52]. With so many technologies onboard a ship, cyber security poses a significant threat. There are possibilities of cyber-attacks that may trigger ecological disasters such as oil spills by activating remotely-controlled or automated discharge valves, or by maliciously manipulating GPS signals and receivers to create groundings or accidents [23]. For prevention, companies have to invest heavily in cyber security as incidents of such magnitude could backfire on the efforts to reduce the effects of climate change.

Vessel Monitoring System and Data Transmission
Data collected from the ship, or the environment are recorded and stored in the data acquisition system where monitoring systems are usually installed onboard the ship for continuous monitoring of the operational conditions. The network middleware is used for communication and management of data where these data are transmitted to the shore for further analysis. There are several challenges in the transmission of data to shore as vessels are often operating at regions out of coverage of the shore station thereby having to deal with unstable connections. Sometimes, synchronizing big files between shore and the vessels becomes a significant issue due to the possibility of duplicate data and time lapse between machines. Some technologies can assist in the replication/synchronization of data between the vessel and the land by establishing a private cloud between the firm and its vessels. The most commonly used method to transmit data is through electromagnetic wave transmission technology which is also known as very high

Challenges in Sensors Installation
There are several challenges in the installation of sensors, such as the capital, operation, maintenance and troubleshooting costs. The installation and troubleshooting of the sensors must also accommodate the busy operational schedules of the HCV to minimize disruption. In some cases, the control system or sensor may or may not offer a warning if there is an intermittent malfunction. There may also be cases where the system is unaware of the program failure, thereby unable to offer feedback to the user. This is where troubleshooting comes in and performs the necessary procedures to figure out the faults to have the sensors operating again. The inclement weather makes the marine environment less appealing; therefore, any troubleshooting could only be carried out onshore. Additionally, conducting research experiments on board a ship may interfere with the ship operations, so may become prohibitively expensive due to the interference with the schedule of the ship operations [52]. Therefore, it is recommended that the ship crew should be educated and trained to troubleshoot and rectify the faults to minimize the downtime and potential inconvenience caused.
The conventional approach to ship cyber-security is based on the premise of keeping ship systems isolated from the Internet and ship/company intranets, which is still the standard for many safeties and security-sensitive ship owners. Even when the proprietary interface is accessible, data transmission is often serial and unidirectional [52]. With so many technologies onboard a ship, cyber security poses a significant threat. There are possibilities of cyber-attacks that may trigger ecological disasters such as oil spills by activating remotely-controlled or automated discharge valves, or by maliciously manipulating GPS signals and receivers to create groundings or accidents [23]. For prevention, companies have to invest heavily in cyber security as incidents of such magnitude could backfire on the efforts to reduce the effects of climate change.

Vessel Monitoring System and Data Transmission
Data collected from the ship, or the environment are recorded and stored in the data acquisition system where monitoring systems are usually installed onboard the ship for continuous monitoring of the operational conditions. The network middleware is used for communication and management of data where these data are transmitted to the shore for further analysis. There are several challenges in the transmission of data to shore as vessels are often operating at regions out of coverage of the shore station thereby having to deal with unstable connections. Sometimes, synchronizing big files between shore and the vessels becomes a significant issue due to the possibility of duplicate data and time lapse between machines. Some technologies can assist in the replication/synchronization of data between the vessel and the land by establishing a private cloud between the firm and its vessels. The most commonly used method to transmit data is through electromagnetic wave transmission technology which is also known as very high frequency (VHF) radio transmission. Satellites are commonly used among ships as satellite communications (such as INMARSAT and COSPAS-SARSAT) are reliable for offshore operations and emergency communications. However, they tend to be costly; therefore, 4G is used as an alternative for the transmission of information that is not time sensitive.

Data Preparation and Filtering
Raw data collected from sensors tend to have errors and distortions. These data have to be pre-processed/filtered before they could be used for further analysis. The raw data has to be denoised and cleaned to transform them into useful information. Different types of errors may be accumulated throughout the process of data collection and this section aims to bring awareness to the most common errors retrieved from the data acquisition system and the data filtering methodology.

Measurement Error
The measurement error is the error that occurs in the data gathering chain. This could be due to the flaws in the measuring instruments which result in the difference between the real value and the actual data recorded. An example of different measurement errors is shown in Figure 8. Figure 8 shows an example of different RPM values with respect to time for the shaft of a propeller. The significance of the measurement error could be quantified by computing the mean squared error (MSE) given as where n is the number of data points, Y i the observed values andŶ i the predicted/recorded values. The MSE indicates the distance a regression line (black lines given in Figure 8) is to a set of points (observed values) where the square of the bracket terms in Equation (1) is to eliminate any negative signs and to give more weight to larger differences. Therefore, data with smaller MSE imply a higher accuracy in the recorded/predicted data. It is to note that the lower the MSE, the closer it is to determine the optimum fit line. frequency (VHF) radio transmission. Satellites are commonly used among ships as satellite communications (such as INMARSAT and COSPAS-SARSAT) are reliable for offshore operations and emergency communications. However, they tend to be costly; therefore, 4G is used as an alternative for the transmission of information that is not time sensitive.

Data Preparation and Filtering
Raw data collected from sensors tend to have errors and distortions. These data have to be pre-processed/filtered before they could be used for further analysis. The raw data has to be denoised and cleaned to transform them into useful information. Different types of errors may be accumulated throughout the process of data collection and this section aims to bring awareness to the most common errors retrieved from the data acquisition system and the data filtering methodology.

Measurement Error
The measurement error is the error that occurs in the data gathering chain. This could be due to the flaws in the measuring instruments which result in the difference between the real value and the actual data recorded. An example of different measurement errors is shown in Figure 8. Figure 8 shows an example of different RPM values with respect to time for the shaft of a propeller. The significance of the measurement error could be quantified by computing the mean squared error (MSE) given as where is the number of data points, the observed values and the predicted/recorded values. The MSE indicates the distance a regression line (black lines given in Figure 8) is to a set of points (observed values) where the square of the bracket terms in Equation (1) is to eliminate any negative signs and to give more weight to larger differences. Therefore, data with smaller MSE imply a higher accuracy in the recorded/predicted data. It is to note that the lower the MSE, the closer it is to determine the optimum fit line.

Inconsistent Data
In some cases, the data recorded by the sensors may be inconsistent where duplicate data, contradictive data and outliers may be recorded.

Duplicate Data
Duplicates of records may arise due to error in the handling of data while moving data between systems. In such cases, the duplicate data are easy to spot and appear as clones or duplicates as shown in Figure 8a. The two units of duplicated data by the sensor are denoted by the open and close square symbols in the red circle. The inclusion of duplicate data in the data analytic may indicate inaccurate or stale data. Fortunately, the duplicate data could be removed by simply deleting these clones, but one has to be careful to not deduplicate the original data.

Contradictive Data
Contradictive data occurs when multiple data are being recorded at a certain time. An example of contradictive data is shown in Figure 8b where there are two RPM values recorded at unit time 5. There are several methodologies proposed to remove contradictory data [53,54], but the removal of contradictory data contributes to the incompleteness in the dataset, thereby reducing the soundness of any information from such set of data [55]. Nwagwu et al. [55] proposed a novel approach for visually identifying contradictory data in a large and noisy dataset by applying a mutual exclusion rule in identifying contradictory data.

Outliers
Outliers are values in a random population sample that deviates abnormally from other values. An example of outliers is shown in Figure 9 presented by a spike in the temperature between 50 and 51 days. Outliers could be due to anomalies resulting from a faulty instrument. These outliers should be removed from the dataset but if a gathered value is extremely rare, it might cause the mean or standard deviation to drift drastically. As a result, removing such values is a crucial component of the data filtering process [56]. In some ways, this definition of outliers is delegated to the analyst in deciding the data point that constitutes abnormality. It is important to describe normal observations before aberrant observations may be identified. One way to do that is to examine the graphed data's overall form for significant aspects such as symmetry and deviations from assumptions. Another way is to examine the data for outlier findings that are not found in the rest of the data. Scatter plots and box plots are two graphical approaches for finding outliers, as well as an analytical procedure for detecting outliers when the distribution is normal.

Filtering of Raw Operational Data Techniques
Several methods could be utilized to remove the errors as presented in the previous sections. Some of the commonly used filtering techniques are given in the following sections. The pros and cons for each technique are summarized at the end of this section in Table 2.

Filtering of Raw Operational Data Techniques
Several methods could be utilized to remove the errors as presented in the previous sections. Some of the commonly used filtering techniques are given in the following sections. The pros and cons for each technique are summarized at the end of this section in Table 2. The control chart technique (CCT) is a statistic mathematic formulated algorithm implemented in time series data to detect irregularities by sliding a predefined window along a stream of data points. This technique acts as a condition statement, i.e., accepting values within a threshold boundary and considers the data point as an outlier beyond this threshold boundary. Through this condition, the outliers could be eliminated as presented in Equation (6). The threshold boundary is taken as 3σ exclusively for data that follows a Gaussian distribution.
where χ j represents the data point for the jth number of observations, µ refers to the average of the observation sum data and σ is the standard deviation. This technique is applied in [57,58] to filter out the outliers in the fuel oil consumption (FOC) time series. The detected outliers would be either replaced with previous data values or removal depending on the consistency of the data's resolution to be retained. This approach is intuitive and straightforward; however, the downside of the method is that it is only applicable to normally distributed data and does not comprise noisy data captured initially in the raw data as presented in Figures 10 and 11. average of the observation sum data and is the standard deviation.
This technique is applied in [57,58] to filter out the outliers in the fuel oil consumption (FOC) time series. The detected outliers would be either replaced with previous data values or removal depending on the consistency of the data's resolution to be retained. This approach is intuitive and straightforward; however, the downside of the method is that it is only applicable to normally distributed data and does not comprise noisy data captured initially in the raw data as presented in Figures 10 and 11.

Haar Wavelet Transformation
The Haar wavelet transformation (HWT) is a discrete wavelet transform system used to denoise and detects outliers. Subasi [60] suggested the use of HWT to decompose signals into several levels of signal components that yield a better result in training the ANN for accurate classification and diagnosis [60]. The HWT follows a straightforward technique in breaking down the signal into coefficients by sliding a fixed duration size of a window along the signal represented as average of the observation sum data and is the standard deviation.
This technique is applied in [57,58] to filter out the outliers in the fuel oil consumption (FOC) time series. The detected outliers would be either replaced with previous data values or removal depending on the consistency of the data's resolution to be retained. This approach is intuitive and straightforward; however, the downside of the method is that it is only applicable to normally distributed data and does not comprise noisy data captured initially in the raw data as presented in Figures 10 and 11.

Haar Wavelet Transformation
The Haar wavelet transformation (HWT) is a discrete wavelet transform system used to denoise and detects outliers. Subasi [60] suggested the use of HWT to decompose signals into several levels of signal components that yield a better result in training the ANN for accurate classification and diagnosis [60]. The HWT follows a straightforward technique in breaking down the signal into coefficients by sliding a fixed duration size of a window along the signal represented as

Haar Wavelet Transformation
The Haar wavelet transformation (HWT) is a discrete wavelet transform system used to denoise and detects outliers. Subasi [60] suggested the use of HWT to decompose signals into several levels of signal components that yield a better result in training the ANN for accurate classification and diagnosis [60]. The HWT follows a straightforward technique in breaking down the signal into coefficients by sliding a fixed duration size of a window along the signal represented as where ψ(t) is the Haar wavelet and t the time. By sliding the window through the signal, it decomposes the signal into multiple levels to produce details and approximation coefficients equation, as shown in Figure 12. The decomposition levels depend on the degree of signal refinement to be accomplished by incrementing the level accordingly. The higher the level introduced, the more sensitive the fluctuation of data points will be filtered, contributing to the likelihood of relevant data points being filtered, resulting in an overall smoother signal. Therefore, the selection of the level is an important factor while modelling the HWT algorithm. Figures 12 and 13 show the decomposition of the signal data points in many subsets of increments of the levels. The newly form collection of wavelet coefficients functions as a threshold in evaluating the signal data points' acceptance or rejection. Tay et al. [61] applied the HWT wavelet decomposition to the fuel consumption data collected from an HCV. An example of the filtering data is given in Figure 14.
where ( ) is the Haar wavelet and the time.
By sliding the window through the signal, it decomposes the signal into multiple levels to produce details and approximation coefficients equation, as shown in Figure 12. The decomposition levels depend on the degree of signal refinement to be accomplished by incrementing the level accordingly. The higher the level introduced, the more sensitive the fluctuation of data points will be filtered, contributing to the likelihood of relevant data points being filtered, resulting in an overall smoother signal. Therefore, the selection of the level is an important factor while modelling the HWT algorithm. Figures 12 and 13 show the decomposition of the signal data points in many subsets of increments of the levels. The newly form collection of wavelet coefficients functions as a threshold in evaluating the signal data points' acceptance or rejection. Tay et al. [61] applied the HWT wavelet decomposition to the fuel consumption data collected from an HCV. An example of the filtering data is given in Figure 14.

Fast Fourier Transform
The Fast Fourier Transform (FFT) technique is commonly used in a wide variety of applications, including audio and image compression formats, among others. Most signals are highly compressible in the FFT domain, which represents transform scales as a function of frequency in the detection of noise to be removed. Furthermore, FFTs can accelerate the detection and filtering processes, making them useful in digital signal processing.
In FFT, the signal was transformed from time-series to frequency domain represented as power spectral density (PSD) [62]. This PSD represents the signal's intensity magnitude where the PSD's peak represents the noise to be filtered out as depicted in Figure 14. To denoise, a threshold is set based on the optimal response to be retained. The PSD could be applied to ship operational data such as the fuel consumption recorded by mass flowmeters as the sensor measurement noise may vary irregularly, making denoising challenging, as simple filtering techniques do not perform well.

Kalman Filter
The Kalman filter (KF) is a low pass filter that acts as an optimal estimator to minimize the MSE based on the measurement and estimated data. The filtering behavior depends on the hyperparameter known as the covariance matrix. If the covariance matrix is set to a low value, it will obtain a smooth function by removing a sudden spike in values;

Fast Fourier Transform
The Fast Fourier Transform (FFT) technique is commonly used in a wide variety of applications, including audio and image compression formats, among others. Most signals are highly compressible in the FFT domain, which represents transform scales as a function of frequency in the detection of noise to be removed. Furthermore, FFTs can accelerate the detection and filtering processes, making them useful in digital signal processing.
In FFT, the signal was transformed from time-series to frequency domain represented as power spectral density (PSD) [62]. This PSD represents the signal's intensity magnitude where the PSD's peak represents the noise to be filtered out as depicted in Figure 14. To denoise, a threshold is set based on the optimal response to be retained. The PSD could be applied to ship operational data such as the fuel consumption recorded by mass flowmeters as the sensor measurement noise may vary irregularly, making denoising challenging, as simple filtering techniques do not perform well.

Kalman Filter
The Kalman filter (KF) is a low pass filter that acts as an optimal estimator to minimize the MSE based on the measurement and estimated data. The filtering behavior depends on the hyperparameter known as the covariance matrix. If the covariance matrix is set to a low value, it will obtain a smooth function by removing a sudden spike in values; otherwise, it will trust the measurement data input if the covariance values are set as a high value. This approach was used in [63] to denoise all the ship operational data provided that the trend remains close to speed over ground (SOG) data shown in Figure 15. otherwise, it will trust the measurement data input if the covariance values are set as a high value. This approach was used in [63] to denoise all the ship operational data provided that the trend remains close to speed over ground (SOG) data shown in Figure 15.

Supervised and Unsupervised Machine Learning Model
A substantial research effort was carried out on various statistical models in forecasting fuel consumption in the past few years. Most of the statistical models widely utilized in the recent literature papers are generalized into two major categories, regression models (RM) and ML models.
As for the development of RM in predicting fuel consumption, various operational and environmental factors of the ship have been taken into accounts to improve the polynomial regression's accuracy to predict fuel consumption with varying speed conditions [64]. Kee et al. [57] proposed a multilinear regression model to analyze the tugboats' service performance to ensure optimum fuel efficiency is met. However, RM has some disadvantages as there is quite a fair bit of an inference made due to ambiguity in collected data and exposure to the effect of sudden spikes and noisy data signals. Moreover, the operational data's noon reports are based on the operator's findings, leading to immense errors in developing the RM.
As data acquisition (DAQ) is well established in collecting real-time operational data rapidly, this eventually led to extensive research exploring ML models' capability for fuel consumption prediction. The ML model's key factors are its distinct benefit of generalizing the relationship between multiple dimensional operational data obtained from DAQ and allowing more accurate predictions than the RM. Literature papers on ML models shown in [65] showed that the ANN ML model is proven to outperform RM. However, achieving a stable ML model takes a substantial amount of time delegated to pre-process raw data to ensure no noise and outliers are captured during the models' training. Therefore, the following sections will study the various machine learning methods in enhancing the prediction of the machine. The flow chart describing the methodology for supervised and unsupervised ML is shown in Figure 16. The major procedures involving the filtering of raw data, creation of score dataset, K-mean clustering and activity labels are described in detail in [61]. The details for unsupervised and supervised ML are in the following sections.
The comparison of the different machine learning techniques is given in Table 6 at the end of this section.

Supervised and Unsupervised Machine Learning Model
A substantial research effort was carried out on various statistical models in forecasting fuel consumption in the past few years. Most of the statistical models widely utilized in the recent literature papers are generalized into two major categories, regression models (RM) and ML models.
As for the development of RM in predicting fuel consumption, various operational and environmental factors of the ship have been taken into accounts to improve the polynomial regression's accuracy to predict fuel consumption with varying speed conditions [64]. Kee et al. [57] proposed a multilinear regression model to analyze the tugboats' service performance to ensure optimum fuel efficiency is met. However, RM has some disadvantages as there is quite a fair bit of an inference made due to ambiguity in collected data and exposure to the effect of sudden spikes and noisy data signals. Moreover, the operational data's noon reports are based on the operator's findings, leading to immense errors in developing the RM.
As data acquisition (DAQ) is well established in collecting real-time operational data rapidly, this eventually led to extensive research exploring ML models' capability for fuel consumption prediction. The ML model's key factors are its distinct benefit of generalizing the relationship between multiple dimensional operational data obtained from DAQ and allowing more accurate predictions than the RM. Literature papers on ML models shown in [65] showed that the ANN ML model is proven to outperform RM. However, achieving a stable ML model takes a substantial amount of time delegated to pre-process raw data to ensure no noise and outliers are captured during the models' training. Therefore, the following sections will study the various machine learning methods in enhancing the prediction of the machine. The flow chart describing the methodology for supervised and unsupervised ML is shown in Figure 16. The major procedures involving the filtering of raw data, creation of score dataset, K-mean clustering and activity labels are described in detail in [61]. The details for unsupervised and supervised ML are in the following sections.
The comparison of the different machine learning techniques is given in Table 6 at the end of this section.
primary goal was to build an MLR model capable of predicting recommended vessel speed, thereby enabling the operator to maximize fuel performance. The input factors, i.e., travelled distance, travelled hours, vessel speed, vessel deadweight and wind speed, that influence the fuel consumption are well-established; therefore, the MLR model was able to achieve an R2 score of as high as 0.91, indicating that the output has a major impact on the variables. This model was validated using the fuel consumption analysis method, which offers a ground truth to support the model's prediction capability.

Hidden Markov Model
The hidden Markov model (HMM) uses a Markov Chain in which a certain set of states could be partially observable (hidden) or observable. An example of the Markov Chain with five hidden states (HS), denoted by , , , and , is shown in Figure 17 [58]. The transition information is quantified in terms of livelihood or the transition probability value, denoted as , where is the original state and the subsequent state. The state information could be deduced within the state, i.e., or between two different states, i.e., . The transition probability may be arranged in a Stochastic Matrix known as the Transition Matrix (TM). Similarly, the probability of the Markov Chain for Observable States (OS) could also be obtained, where its Stochastic Matrix is known as the Emission Matrix (EM). The HMM is applicable for a time-series dataset where the HMM utilizes Markov Chain to create a probabilistic correlation between states. By using the HMM, the ML can predict the fuel consumption based on the environmental condition given. A comparison of the prediction data (PR) for fuel consumption of a tugboat with the ground truth (FR) is shown in Figure 18. Note that vs. represents the vessel speed.  The MLR models are simple to understand and incorporate in applications. It necessitates awareness of the input variables that are positively associated with the target variable, which is FOC. It employs the least-squares method to construct a regression line for predictions based on the relationship between the input and output variables. To achieve quality prediction, specific requirements such as multicollinearity and autocorrelation relationships between variables must be avoided when modelling the MLR. Kee et al. [57] suggested using MLR to estimate the fuel consumption of towing tugboats operating between laden and ballast conditions along the Malacca Straits. The primary goal was to build an MLR model capable of predicting recommended vessel speed, thereby enabling the operator to maximize fuel performance. The input factors, i.e., travelled distance, travelled hours, vessel speed, vessel deadweight and wind speed, that influence the fuel consumption are well-established; therefore, the MLR model was able to achieve an R2 score of as high as 0.91, indicating that the output has a major impact on the variables. This model was validated using the fuel consumption analysis method, which offers a ground truth to support the model's prediction capability.

Hidden Markov Model
The hidden Markov model (HMM) uses a Markov Chain in which a certain set of states could be partially observable (hidden) or observable. An example of the Markov Chain with five hidden states (HS), denoted by s 1 , s 2 , s 3 , s 4 and s 5 , is shown in Figure 17 [58]. The transition information is quantified in terms of livelihood or the transition probability value, denoted as a ij , where i is the original state and j the subsequent state. The state information could be deduced within the state, i.e., a ii or between two different states, i.e., a ij . The transition probability may be arranged in a Stochastic Matrix known as the Transition Matrix (TM). Similarly, the probability of the Markov Chain for Observable States (OS) could also be obtained, where its Stochastic Matrix is known as the Emission Matrix (EM). The HMM is applicable for a time-series dataset where the HMM utilizes Markov Chain to create a probabilistic correlation between states. By using the HMM, the ML can predict the fuel consumption based on the environmental condition given. A comparison of the prediction data (PR) for fuel consumption of a tugboat with the ground truth (FR) is shown in Figure 18. Note that vs. represents the vessel speed. Figure 18 indicates the correlation between the vessel speed and the fuel consumption, thereby could be used for assisted decision making on the optimal vs. to achieve fuel efficiency. The fuel consumption profile for a tugboat differs from other harbour craft vessels such as the ferries or patrol vessels, where the latter have their fuel consumption influenced significantly by the vessel speed, i.e., an increase in fuel consumption increases with an increase in vessel speed. On the other hand, the tugboat working operations involve tugging and cruising in which both operations could lead to high fuel consumption. During the tugging operation, there is a significant drop in the vessel speed as shown in Figure 18, but the fuel consumption (FR) remains high. One way to deduce the correlation between the FR and vs. is by taking into consideration the vessel shaft RPM. However, if the vessel shaft RPM is not available, the classification model based on the K-mean clustering method [61] could be used to classify the different operational activities of the tugboat based on their FR and VS. This classification model can then be utilized to train the machine learning model, as described in Section 6.2.2.
J. Mar. Sci. Eng. 2021, 9, x FOR PEER REVIEW 20 of 29 Figure 18 indicates the correlation between the vessel speed and the fuel consumption, thereby could be used for assisted decision making on the optimal vs. to achieve fuel efficiency. The fuel consumption profile for a tugboat differs from other harbour craft vessels such as the ferries or patrol vessels, where the latter have their fuel consumption influenced significantly by the vessel speed, i.e., an increase in fuel consumption increases with an increase in vessel speed. On the other hand, the tugboat working operations involve tugging and cruising in which both operations could lead to high fuel consumption. During the tugging operation, there is a significant drop in the vessel speed as shown in Figure 18, but the fuel consumption (FR) remains high. One way to deduce the correlation between the FR and vs. is by taking into consideration the vessel shaft RPM. However, if the vessel shaft RPM is not available, the classification model based on the K-mean clustering method [61] could be used to classify the different operational activities of the tugboat based on their FR and VS. This classification model can then be utilized to train the machine learning model, as described in Section 6.2.2.

Artificial Neural Network Model (Black Box Model)
The ANN is an ML model based on the black box model (BBM) that works similarly to human neurons in neural network architecture to recognize and decode the dataset's  Figure 18 indicates the correlation between the vessel speed and the fuel consumption, thereby could be used for assisted decision making on the optimal vs. to achieve fuel efficiency. The fuel consumption profile for a tugboat differs from other harbour craft vessels such as the ferries or patrol vessels, where the latter have their fuel consumption influenced significantly by the vessel speed, i.e., an increase in fuel consumption increases with an increase in vessel speed. On the other hand, the tugboat working operations involve tugging and cruising in which both operations could lead to high fuel consumption. During the tugging operation, there is a significant drop in the vessel speed as shown in Figure 18, but the fuel consumption (FR) remains high. One way to deduce the correlation between the FR and vs. is by taking into consideration the vessel shaft RPM. However, if the vessel shaft RPM is not available, the classification model based on the K-mean clustering method [61] could be used to classify the different operational activities of the tugboat based on their FR and VS. This classification model can then be utilized to train the machine learning model, as described in Section 6.2.2.

Artificial Neural Network Model (Black Box Model)
The ANN is an ML model based on the black box model (BBM) that works similarly to human neurons in neural network architecture to recognize and decode the dataset's The ANN is an ML model based on the black box model (BBM) that works similarly to human neurons in neural network architecture to recognize and decode the dataset's underlying relationship. Furthermore, ANN can learn nonlinear functions that are appropriate for high fluctuation datasets such as those used in onboard ship measurements to predict the desired outcome. As a result, it is commonly used to improve ship powering performance by forecasting FOC based on a variety of operational data factors. Jeon et al. [25] proposed using the ANN model to analyze and forecast ship fuel consumption based on the table's dataset in [66]. Pre-processing raw operational data is more important than tuning hyperparameters in the neural network to improve prediction accuracy to model the neural network accurately. Furthermore, tuning neural network hyperparameters does not outperform the data pre-processing in terms of enhancing the robustness of the model learning capability. Several works in [25] focused on pre-processing the raw input data. Such pre-processing involves smoothing spline filtering algorithms applied to the entire signal in denoising outliers as shown in Figure 19, and Gaussian mixture model (GMM) clustering techniques shown in Figure 17, with the data parameters listed in Table 3. underlying relationship. Furthermore, ANN can learn nonlinear functions that are appropriate for high fluctuation datasets such as those used in onboard ship measurements to predict the desired outcome. As a result, it is commonly used to improve ship powering performance by forecasting FOC based on a variety of operational data factors. Jeon et al. [25] proposed using the ANN model to analyze and forecast ship fuel consumption based on the table's dataset in [66]. Pre-processing raw operational data is more important than tuning hyperparameters in the neural network to improve prediction accuracy to model the neural network accurately. Furthermore, tuning neural network hyperparameters does not outperform the data pre-processing in terms of enhancing the robustness of the model learning capability. Several works in [25] focused on pre-processing the raw input data. Such pre-processing involves smoothing spline filtering algorithms applied to the entire signal in denoising outliers as shown in Figure 19, and Gaussian mixture model (GMM) clustering techniques shown in Figure 17, with the data parameters listed in Table 3. Figure 19. Denoised data for SOG using spline filtering [25]. The clustering technique classifies the operation with a high-frequency signal as a single cluster to indicate that the data cluster had somewhat similar operation states, which might aid the neural network's learning by reducing computation time and focusing the cluster's effect on FOC. The analysis also emphasized the significance of clustering data parameters that significantly influenced the FOC reported in [63] study. A clustering methodology applied to the engine power of a vessel is shown in Figure 20.
The ANN model was trained on post-processed datasets by varying hyperparameters such as the predefined range of hidden layers and neurons within the selected activation feature. The post-processed data is divided into three sets: a training set for learning, a validating set to prevent overfitting, and a testing set to validate model functionality. Through this method, the model will arrive at a converging solution in which raising the hyperparameter hidden layers and neurons represented as configuration have no significant impact on model accuracy, as shown in Table 4, and values do not vary much with node increment. Figure 19. Denoised data for SOG using spline filtering [25]. Table 3. Data information used for ship prediction [25].

Data
Parameter Remarks The clustering technique classifies the operation with a high-frequency signal as a single cluster to indicate that the data cluster had somewhat similar operation states, which might aid the neural network's learning by reducing computation time and focusing the cluster's effect on FOC. The analysis also emphasized the significance of clustering data parameters that significantly influenced the FOC reported in [63] study. A clustering methodology applied to the engine power of a vessel is shown in Figure 20.

Input
The ANN model was trained on post-processed datasets by varying hyperparameters such as the predefined range of hidden layers and neurons within the selected activation feature. The post-processed data is divided into three sets: a training set for learning, a validating set to prevent overfitting, and a testing set to validate model functionality. Through this method, the model will arrive at a converging solution in which raising the hyperparameter hidden layers and neurons represented as configuration have no significant impact on model accuracy, as shown in Table 4, and R values do not vary much with node increment. In some cases, where there is a lack of data collected from the ships and environment due to difficulty in data collection or incomplete dataset, the grey box model (GBM), which is a hybrid of a white-box model (WBM) and BBM, could be utilized in the data prediction. To predict the fuel consumption of a tugboat, for instance, the BBM is trained by using historical operational data to forecast the ship FOC. To improve the accuracy in the data prediction, one of the possibilities is to include the ship resistance as part of the data training. However, as the ship resistance is not easy to obtain (recorded) by sensors,   Table 5 shows the comparison of the R values and computation time for three different ML techniques, i.e., ANN, regression method (RM) and support vector (SV) summarized in [25]. Table 5 shows that in general, the ANN outperformed the RM and SV with the exponential sigmoid function having the highest prediction accuracy (R value). In comparison, the SV is able to predict relatively higher accuracy than the RM method with the quadratic SV method having the higher accuracy. However, the R values for RM and SV are very small, i.e., lesser than 0.5, and the computational time for the quadratic SV method is significantly high compared with other methods compared in the table. In some cases, where there is a lack of data collected from the ships and environment due to difficulty in data collection or incomplete dataset, the grey box model (GBM), which is a hybrid of a white-box model (WBM) and BBM, could be utilized in the data prediction. To predict the fuel consumption of a tugboat, for instance, the BBM is trained by using historical operational data to forecast the ship FOC. To improve the accuracy in the data prediction, one of the possibilities is to include the ship resistance as part of the data training. However, as the ship resistance is not easy to obtain (recorded) by sensors, this could be carried out by using a WBM where the ship resistance under various sea conditions could be obtained via numerical simulation or experimental test. Thus, the inclusion of operational data and simulated ship resistance used in the ML produce the GBM that may increase the accuracy of the predicted data.
Coraddu et al. [29] conducted a study to examine the capability of the GBM for a Handymax product tanker. The research considered the parametric calculation to forecast ship resistance in calm water for the WBM. Although the calculated outcome is marginally off compared with its counterpart predicted by the computational fluid dynamic (CFD) model, the slightly inaccurate resistance data complement the historical operational data to optimize the model performance to a greater extent. Due to the domain knowledge of the vessel characteristics preserved by the WBM, the GBM can achieve an equally high precision as the BBM in predicting FOC with limited historical operational data available. Figure 21 shows the comparison of the least mean per centage absolute error (MAPE) between the GBM and BBM. It can be seen that the MAPE reduces significantly for the Naïve (N)-GBM and Advanced (A)-GBM, thus implying that the WBM helps in increasing the accuracy of the predicted data. this could be carried out by using a WBM where the ship resistance under various sea conditions could be obtained via numerical simulation or experimental test. Thus, the inclusion of operational data and simulated ship resistance used in the ML produce the GBM that may increase the accuracy of the predicted data. Coraddu et al. [29] conducted a study to examine the capability of the GBM for a Handymax product tanker. The research considered the parametric calculation to forecast ship resistance in calm water for the WBM. Although the calculated outcome is marginally off compared with its counterpart predicted by the computational fluid dynamic (CFD) model, the slightly inaccurate resistance data complement the historical operational data to optimize the model performance to a greater extent. Due to the domain knowledge of the vessel characteristics preserved by the WBM, the GBM can achieve an equally high precision as the BBM in predicting FOC with limited historical operational data available. Figure 21 shows the comparison of the least mean per centage absolute error (MAPE) between the GBM and BBM. It can be seen that the MAPE reduces significantly for the Naïve (N)-GBM and Advanced (A)-GBM, thus implying that the WBM helps in increasing the accuracy of the predicted data. Figure 21. Comparison of performance between GBM and BBM [29]. Note: 1. N-GBM: output of WBM is used as a new feature that the BBM can use for training the model. 2. A-GBM: regularization process is changed to include some a priori information.

Long Short-Term Memory Model
The long short-term memory (LSTM) model networks (see Figure 22) are well-suited to classifying, processing and making predictions based on time-series data that have lags of unknown duration between important events in a time series [67]. The LSTM architecture comprises three distinct gates, namely the forget gate (A), input gate (B) and output gate (C). Unlike the ANN model, the LSTM model has its specified embedded activation neural network layers where the number of neural network layers could be added repeatedly. The LSTM model is effective when knowledge about the previous values has a substantial effect on the present values [68].   [29]. Note: 1. N-GBM: output of WBM is used as a new feature that the BBM can use for training the model. 2. A-GBM: regularization process is changed to include some a priori information.

Long Short-Term Memory Model
The long short-term memory (LSTM) model networks (see Figure 22) are well-suited to classifying, processing and making predictions based on time-series data that have lags of unknown duration between important events in a time series [67]. The LSTM architecture comprises three distinct gates, namely the forget gate (A), input gate (B) and output gate (C). Unlike the ANN model, the LSTM model has its specified embedded activation neural network layers where the number of neural network layers could be added repeatedly. The LSTM model is effective when knowledge about the previous values has a substantial effect on the present values [68]. this could be carried out by using a WBM where the ship resistance under various sea conditions could be obtained via numerical simulation or experimental test. Thus, the inclusion of operational data and simulated ship resistance used in the ML produce the GBM that may increase the accuracy of the predicted data. Coraddu et al. [29] conducted a study to examine the capability of the GBM for a Handymax product tanker. The research considered the parametric calculation to forecast ship resistance in calm water for the WBM. Although the calculated outcome is marginally off compared with its counterpart predicted by the computational fluid dynamic (CFD) model, the slightly inaccurate resistance data complement the historical operational data to optimize the model performance to a greater extent. Due to the domain knowledge of the vessel characteristics preserved by the WBM, the GBM can achieve an equally high precision as the BBM in predicting FOC with limited historical operational data available. Figure 21 shows the comparison of the least mean per centage absolute error (MAPE) between the GBM and BBM. It can be seen that the MAPE reduces significantly for the Naïve (N)-GBM and Advanced (A)-GBM, thus implying that the WBM helps in increasing the accuracy of the predicted data. Figure 21. Comparison of performance between GBM and BBM [29]. Note: 1. N-GBM: output of WBM is used as a new feature that the BBM can use for training the model. 2. A-GBM: regularization process is changed to include some a priori information.

Long Short-Term Memory Model
The long short-term memory (LSTM) model networks (see Figure 22) are well-suited to classifying, processing and making predictions based on time-series data that have lags of unknown duration between important events in a time series [67]. The LSTM architecture comprises three distinct gates, namely the forget gate (A), input gate (B) and output gate (C). Unlike the ANN model, the LSTM model has its specified embedded activation neural network layers where the number of neural network layers could be added repeatedly. The LSTM model is effective when knowledge about the previous values has a substantial effect on the present values [68]. Figure 22. LSTM network architecture model [69]. Figure 22. LSTM network architecture model [69].
For datasets with missing values or near zero data such as those shown in Figure 18, the ensembled method such as the LSTM is useful in regenerating the input data. The authors have considered the LSTM model that combines the time-series data collected from tugboat with classification data obtained from a K-mean clustering process. Their research outcome found that the combined model has a better capability in forecasting fuel consumption by using lesser historical data with a faster convergence duration, Table 6. An example of the utilization of LSTM model in predicting the fuel consumption is given in Figure 23. Figure 23 compares the accuracy of the LSTM method in terms of R2 score and shows that an R2 of up to 0.94 could be achieved when a four-month fuel data is used in training the ensembled system. It is to note here that the accuracy in predicting the fuel consumption drops when it is used to predict the consumption for larger future values. This accuracy could be further improved when a combined LSTM model that takes input from the time-series operational data, i.e., fuel rate, vessel speed and wind effect are used with the classification model (see Figure 24). The classification model is obtained from the K-mean clustering method [60] where the operational activity is clustered based on its various states. The utilization of the time-series operational data and classification model in the LSTM model allows a more accurate prediction of the fuel consumption where an R2 close to 1.0 could be achieved as shown in Figure 24. The LSTM model, therefore, outperforms the HMM and ANN in the fuel prediction when there are missing data in the dataset collected from the sensors. therefore, outperforms the HMM and ANN in the fuel prediction when there are missing data in the dataset collected from the sensors.

Conclusions
This paper reviewed the big data analytics and machine learning techniques applied to harbour craft vessels with the aim to achieve ship energy efficiency. The numerous filtering techniques, i.e., CCT, HWT, FFT and KF, used in filtering data collected from HCV are presented where simple filtering technique such as the HWT is suitable for filtering fewer complex data whereas data with varying signal frequencies could be effectively filtered out by the FFT. The machine learning technique classified into the supervised and unsupervised techniques, were also presented where their pros and cons are compared. These machine learning techniques could be used in predicting fuel consumption given the environmental loadings. The supervised technique considers the

Conclusions
This paper reviewed the big data analytics and machine learning techniques applied to harbour craft vessels with the aim to achieve ship energy efficiency. The numerous filtering techniques, i.e., CCT, HWT, FFT and KF, used in filtering data collected from HCV are presented where simple filtering technique such as the HWT is suitable for filtering fewer complex data whereas data with varying signal frequencies could be effectively filtered out by the FFT. The machine learning technique classified into the supervised and unsupervised techniques, were also presented where their pros and cons are compared. These machine learning techniques could be used in predicting fuel consumption given the environmental loadings. The supervised technique considers the

Conclusions
This paper reviewed the big data analytics and machine learning techniques applied to harbour craft vessels with the aim to achieve ship energy efficiency. The numerous filtering techniques, i.e., CCT, HWT, FFT and KF, used in filtering data collected from HCV are presented where simple filtering technique such as the HWT is suitable for filtering fewer complex data whereas data with varying signal frequencies could be effectively filtered out by the FFT. The machine learning technique classified into the supervised and unsupervised techniques, were also presented where their pros and cons are compared. These machine learning techniques could be used in predicting fuel consumption given the environmental loadings. The supervised technique considers the MLG model and the HMM where the former is simple to use for data with fewer variables, whereas the latter uses the probabilistic correlation between different states to predict the fuel consumption. In unsupervised machine learning, the ANN, GBM and LSTM models are considered. The GBM utilized simulated data to achieve higher accuracy in the prediction compared with the ANN. The LSTM is well-suited for classifying, processing and making predictions based on time-series data with an unknown duration between important events in a time series where the deep learning LSTM combined with autoencoder ensemble learning outperforms the conventional machine learning methods such as ANN. The autoencoder ANN is useful for analyzing fuel consumption data obtained from tugboats, which are prone to missing data and a drop in vessel speed during tugging operation, in which the autoencoder is able to regenerate the input data by encoding sets of data collected.