You are currently viewing a new version of our website. To view the old version click .
Electronics
  • Editor’s Choice
  • Article
  • Open Access

20 April 2023

Design of Vessel Data Lakehouse with Big Data and AI Analysis Technology for Vessel Monitoring System

,
and
1
AI Graduate School, Gwangju Institute of Science and Technology, Gwangju 61005, Republic of Korea
2
Marine Security and Safety Research Center, Korea Institute of Ocean Science & Technology, Busan 49111, Republic of Korea
*
Authors to whom correspondence should be addressed.
This article belongs to the Special Issue Artificial Intelligence and Future Implications of an ICT Convergence System and Network

Abstract

The amount of data in the maritime domain is rapidly increasing due to the increase in devices that can collect marine information, such as sensors, buoys, ships, and satellites. Maritime data is growing at an unprecedented rate, with terabytes of marine data being collected every month and petabytes of data already being made public. Heterogeneous marine data collected through various devices can be used in various fields such as environmental protection, defect prediction, transportation route optimization, and energy efficiency. However, it is difficult to manage vessel related data due to high heterogeneity of such marine big data. Additionally, due to the high heterogeneity of these data sources and some of the challenges associated with big data, such applications are still underdeveloped and fragmented. In this paper, we propose the Vessel Data Lakehouse architecture consisting of the Vessel Data Lake layer that can handle marine big data, the Vessel Data Warehouse layer that supports marine big data processing and AI, and the Vessel Application Services layer that supports marine application services. Our proposed a Vessel Data Lakehouse that can efficiently manage heterogeneous vessel related data. It can be integrated and managed at low cost by structuring various types of heterogeneous data using an open source-based big data framework. In addition, various types of vessel big data stored in the Data Lakehouse can be directly utilized in various types of vessel analysis services. In this paper, we present an actual use case of a vessel analysis service in a Vessel Data Lakehouse by using AIS data in Busan area.

1. Introduction

Big data is an enormous amount of data that is difficult to collect, store, analyze, and process using legacy application software. Big data technology is showing efficiency by processing big data into a form that users can understand and utilize. The concept of a data lake has emerged to efficiently store, process, and protect big data. Data lakes have the advantage of being cheaper than legacy databases. Data lakes provide a view of raw data that can be used by analytics technologies independent of traditional data storage or systems of record. However, data lakes require ongoing maintenance and a plan for how data is used and accessed. Without ongoing data lake maintenance, data management is difficult and expensive. There is also a risk of inaccessible junk data. This inaccessible data lake is called a data swamp. To solve this problem, the concept of a data lakehouse has emerged. A data lakehouse is the implementation of data structures and data management functions similar to a data warehouse on the low-cost storage used in a data lake [1].
Data Lakehouse is a new method to analytics structure that aims to combine traditional Data Lakes and Data Warehouses to serve different analytics needs. Data Lakehouse allows structured queries and enhanced analytics to run on best structured data for a given purpose while hiding the system complexity to users. Data Lakehouse alleviates common issues with Data Warehouse and Data Lake while allowing you to use the benefits of both structures. This architecture can use structured, semi-structured and unstructured data which supports streaming workloads, machine learning and business intelligence. Data Lakehouse can be utilized as a foundation for constructing entirely novel systems and fusing the Data Lakes and the Data Warehouses by using platforms and frameworks [2].
A VMS (Vessel Monitoring System) is a generic word describing systems utilized in commercial fishing to help fisheries regulators track and monitor the fishing activity of ships. It is a core role of MCS (Monitoring Control and Surveillance) at national levels. VMS systems are utilized to enhance the operation and sustainability of the vessel environment by ensuring good fishing habits and preventing illegal fishing to protect and improve the fishermen’s livelihood [3].
72% of the planet is covered by seas, oceans and other marine regions, 95% of which is hardly explored. The marine area is one of the most used economic areas by mankind. In other words, the marine area carries out economic activities through fishing, tourism, transportation, and logistics, and is used as a renewable energy resource such as wind and tidal power. For this reason, the maritime industry is an important and strategic industry and is continuously growing. Additionally, the marine domain has recently begun to provide a large, diverse and heterogeneous data. Marine domain data is growing at a rapid rate. Marine terabytes data is collected every month, and marine petabytes data is used in public already. Big data from heterogeneous sources such as satellites, buoys, ship, and sensors can be used as material for applications for environmental protection, security, error prediction, transportation route optimization and energy production. However, marine big data has a problem of high heterogeneity of data sources, which the marine solutions are still fragmented and underdeveloped [4,5].
In this paper, we propose a Vessel Data Lakehouse with Big Data and AI analysis technology for Vessel Monitoring System that can efficiently manage various heterogeneous vessel related data. The advantages of the proposed method in this paper are as follows. First, by using the Vessel Data Lake, it is possible to store vessel related big data in various formats at low cost. By placing the Vessel Data Lakehouse layer on top of the Data Lake, heterogeneous vessel related big data can be managed and controlled. Various types of vessel related big data stored in the Data Lake can be directly used for various types of analysis services.
The remainder of the paper is organized as follows. Section 2 describes the related studies on Data Lakehouse and VMS with vessel applications, and Section 3 presents the proposed the Vessel Data Lakehouse for Vessel Monitoring System. Then Section 4 shows the experiment results, and Section 5 concludes the paper.

3. Vessel Data Lakehouse for Vessel Monitoring System

As shown in Figure 1, Vessel Data Lakehouse consists of Extraction and Ingestion layer, Vessel Data Lake layer, Vessel Data Warehouse Model, and Vessel Application Services. This paper focuses on the implementation of Vessel Data Lakehouse for VMS and the application example of the implemented system. The Extraction and Ingestion layer in Figure 1a extracts vessel-related data from data sources using a push or pull approach based on Message Queuing technology and stores it in the Vessel Data Lake layer of Figure 1b. The Vessel Big Data layer of the Vessel Data Warehouse Model in Figure 1c manages the data in the Vessel Data Lake layer with data loaded and transformed directly from the resource system, analyzes Vessel Big Data, and supports the data to enable AI analysis. The Vessel AI layer supports AI analysis and prediction for vessel application services. Vessel Application Service in Figure 1d supports Visualization service, Big Data Analysis service, and AI Analysis service for vessels.
Figure 1. Conceptual Diagram of Vessel Data Lakehouse. (a) Extraction & Ingestion Layer; (b) Vessel Data Lake layer; (c) Vessel Data Warehouse Model; (d) Vessel Application Service.

3.1. Extraction and Ingestion Layer

In the Extraction and Ingestion layer, data is imported from an external source in the format specified by the Vessel Big Data Management module of the Vessel Big Data layer in Figure 1c and stored in the Vessel Data Lake layer in Figure 1b. In this paper, vessel-related data received from the Korea Institute of Ocean Science and Technology Maritime Safety Research Center are stored in the Vessel Data Lake by using the Extraction and Ingestion layer. The ETL (Extract Transform Load) function was implemented by using Python to load the vessel-related csv file to the Data Lake. Table 1 shows the size and number of rows for each type of vessel related data used in this paper.
Table 1. Kind of Vessel-related Data import to Marine Data Lake.

3.2. Vessel Data Lake Layer

In this subsection, a Hadoop [28] cluster-based Vessel Data Lake was implemented to enable storage and processing of large volume vessel data in the Vessel Data Lake layer. Apache Hadoop is open source software for reliable and scalable distributed computing. The Apache Hadoop software library is a framework that allows distributed processing of large data sets across clusters of computers using a simple programming model. Figure 2 shows a conceptual diagram of the software stack of the Vessel Data Lake based on Hadoop cluster. Hadoop HDFS (Hadoop Distributed File System) is a file system that stores large files of tens of terabytes or petabytes or more in distributed servers and enables fast processing of the stored data. Hadoop YARN (Yet Another Resource Negotiator) manages numerous tasks in clusters composed of dozens or more nodes, which it manages distributed resources such as resources (i.e., CPU, RAM) to be used for specific tasks. Hadoop MapReduce is a data processing model designed to process large amounts of data in a distributed/parallel computing environment. When large data is received, it divides the data into blocks of a specific size and executes Map Task and Reduce Task for each block. The hardware cluster of Vessel Data Lake consists of Vessel Big Data cluster and Vessel AI cluster. The Vessel Big Data cluster is a Hadoop-based Data Lake hardware node which it consists of 1 master node and 3 slave nodes, and detailed specifications are shown in Table 2. The Vessel AI cluster supports big data analytics and AI analytics associated with GPUs. Each node has a 1G network card and a 10G network card. The 10G network card is used for internal data movement, and the 1G network card is used for external control of nodes.
Figure 2. Conceptual Diagram of Hadoop-based Vessel Data Lake (Software stack).
Table 2. Hardware Specifications of Vessel Data Lake.

3.3. Vessel Data Warehouse Model

Vessel Data Warehouse Model consists of Vessel Big Data layer and Vessel AI layer. The Vessel Big Data layer transforms the source data and loads it into the Data Lake to enable vessel big data analysis or vessel AI analysis. Vessel AI layer supports basic models and analysis tools for vessel AI analysis.

3.3.1. Vessel Big Data Layer

The Vessel Big Data layer consists of Vessel Big Data Management module, Supporting AI Models module, and Vessel Big Data Analysis module. Vessel Big Data Management module performs original data collection, data purification/modeling, data imposing/data export, data insertion/deletion, and data saving/loading for big data analysis. It also transforms source data to enable direct big data processing. Supporting AI Models module performs data preprocessing, missing data processing, categorical data processing, and feature scaling for AI analysis. It also is possible to support data clearing, labeling, storage, convergence analysis for all types of vessel structured and unstructured data. Vessel Big Data Analysis as shown in Figure 1c provides Ship Detection based on Marine Related Information, Classification of Vessel Type based on Marine Association Information, Unidentified Ship Detection, Ship Detection and Type Classification based on Marine Geographic Information, Marine Information Convergence, and others analysis function. Figure 3b shows the functions of the Vessel Big Data layer and the implementation software stack based on open source.
Figure 3. Open Source Software stack of Vessel Big Data layer.
The software stack of the Vessel Big Data layer is composed of the following. The Hue [29] of the Vessel Big Data layer in Figure 3b is used to process the functions of the Vessel Big Data Management module or the Support AI Models module with SQL on the dashboard. Hue is an open-source SQL Assistant for databases and data warehouses which supported by dashboards form. In this paper, we use Impala [30] and Kudu [31] in Figure 3b to build a Data Warehouse on the Hadoop file system (e.g., Data Lake), which it handles the functions of the Vessel Big Data Management module or the Support AI Models module. Apache Impala is a query processing engine. Apache Kudu is an open-source distributed data storage engine that makes it easy to do fast analysis on fast-changing data. Unlike many other columnar storage, Kudu provides a primary key which enabling millisecond-level random access. Since Kudu supports both OLAP (online analytical processing) and OLTP (online transaction processing) queries, which the structure of the big data analysis system can be simplified. The Analytic Application module in Figure 3b supports tools to program and implement functions of each module that cannot be processed with SQL. It supports programming languages such as Java, Python, Scala, and R that can program each function based on the Spark [32] module and the TEZ [33] module. Apache Spark is a unified computing open source engine and set of libraries for processing data in parallel in a clustered environment. Spark supports Python, Java, Scala, and R, and provides a wide range of libraries from SQL to streaming and machine learning. Apache TEZ is a MapReduce alternative data processing framework that runs on top of Hadoop Yarn. TEZ saves the processing results of the Map phase in memory and directly transfers them to the Reduce phase to improve speed by reducing IO overhead.
  • Building a Vessel Data Lakehouse
This subsection shows how to build a Data Lakehouse in Data Lake with vessel-related data from Table 1 using the Vessel Big Data layer. Figure 4 shows the schema for building AIS, V-Pass, VBD, and Observation data in Table 1 into a Data Lakehouse. Figure 5 describes the meaning of the field names in the schema of Figure 4. A table is created in the schema format defined in Figure 4 by using Impala SQL in Vessel Big Data layer. There are two ways to import the original csv file data into a Data Lakehouse table: using Impala SQL and using the Import Python module of the Extraction and Ingestion layer. In this paper, Impala SQL is used as an import method. Figure 6 shows the tables of Data Lakehouse by using Impala SQL command of “show tables;”. In Figure 6, a table with a different name than Table 1 shows an intermediate table created for analysis. Figure 7 shows the contents of the AIS Static table using the “SELECT * FROM ais.staticais” command in Impala SQL.
Figure 4. Data Lakehouse schema for AIS, V-Pass, VBD, and Observation data.
Figure 5. The meaning description of field names in the schema of Figure 4.
Figure 6. Data Lakehouse tables for AIS, V-Pass, VBD, and Observation data.
Figure 7. Contents of the AIS Static table.

3.3.2. Vessel AI (Artificial Intelligence) Layer

The Vessel AI layer in Figure 8 consists of the Vessel AI model module, ML (Machine Learning)/DL (Deep Learning) Algorithm module, and AI Framework module. The Vessel AI model module in Figure 8a consists of 5 basic models for ship AI analysis and other extended models. In the Vessel AI model module, an analysis model is created, trained, and tested based on a model suitable for each AI analysis purpose of vessel-related data stored in the data layer, and then AI analysis or prediction is performed. This module supports basic models such as Vessel Track Prediction Model, Abnormal Ship Detection Model, Ship Activity Analysis Model, Ship Distribution Prediction Model and Fish Habitat Suitability Analysis Model. If a special analysis other than the basic model is required, an extended model can be created by selecting an appropriate algorithm from the ML/DL Algorithm module. The ML/DL algorithm module in Figure 8b supports the machine learning algorithm or deep learning algorithm used in the Vessel AI Model module. This algorithm module provides HMM, Association Rule, K-means, Decision Tree, Random Forest, CNN (Convolutional Neural Network), and RNN (Recurrent Neural Network) as basic algorithms. If an extended algorithm is required, it is supported by the AI framework in Figure 8c. The AI framework module provides Anaconda, an integrated development environment for Python, and TensorFlow, PyTorch, and Keras, ML development frameworks.
Figure 8. Conceptual Diagram of Vessel AI layer. (a) Vessel AI Model; (b) ML/DL Algorithm; (c) AI Framework.

3.4. Vessel Application Services

Vessel Application Services in Figure 1d consists of Visualization Services and Operation Services. Operation Services provides Vessel Big Data Analysis results of the Vessel Big Data layer and AI Analysis results of the Vessel AI layer. Visualization Services provides visualization of Big Data Analysis results and AI Analysis results. Since this paper focused on building a Vessel Data Lakehouse, only a few Vessel Application Services were implemented as use cases. The next subsection presents a method of vessel distribution and activity intensity using Vessel Big Data layer and a method of predicting fishing activity using Vessel AI layer.

3.4.1. Identification of Distribution of Ship Types

This subsection shows how to identify the distribution by ship type based on AIS in Busan and visualize the activity intensity of each ship. The calculating of distribution location by ship type consists of three steps. In the first step, only the data of the Busan area is extracted from the AIS data, and the extracted data is preprocessed. The second step calculates the activity intensity of each ship. The last step is to visualize the activity intensity of the ship on the map. Table 3 shows the characteristics of the AIS data built in the Vessel Data Lakehouse.
Table 3. Characteristics of AIS Data.
In the preprocessing step, Impala SQL is used to extract AIS data within the range of Busan in Figure 9. Table 4a shows the number of data for each ship type in Busan area. Table 4b shows the number of data extracted from the data in Table 4a at 2-min intervals. From the data in Table 4a, MMSI (Maritime Mobile Service Identity) in AIS is used as a key, clustered at daily intervals, 10 SOGs (Speed Over Ground) in AIS are extracted from daily data, and the COG (Course Over Ground) in AIS is normalized by using Equation (1). Classify the category labels as follows: 0 is a fishing ship, 1 is a non-fishing ship, 2 is a ferry, and 3 is a cargo. Figure 10 shows the preprocessing results of AIS data for fishing ships.
n c o g = k = 1 n c o g i + 1 c o g i n
here, ncog (i.e., scog) is a normalization of cog in AIS, cog in AIS is a course over ground.
Figure 9. AIS Data Extraction Range.
Table 4. Preprocessing of AIS Data.
Figure 10. Preprocessing results of AIS data for fishing ships.
In the second step, Equation (2) is used to calculate the cluster strength for each MMSI. The higher the cluster intensity, the higher the vessel’s activity. Figure 11 shows the calculation result of cluster strength by MMSI.
S C = n s × a s c o g × 0.2 + s c o g × 0.5 + a s o g × 0.3
here, SC is a strength of clusters, ns is a number of sampling of MMSI, ascog is an average of sum of abs of cog, scog is a normalization of cog, and asog is an average of sog.
Figure 11. Calculation result of cluster strength by MMSI.
In the last step, using the data in Figure 11, it is visualized and displayed on Google Earth as shown in Figure 12.
Figure 12. Visualization result of ship’s activity intensity on Google Earth.
The Algorithm 1 for identification of distribution of ship types is as follows.
Algorithm 1. ShipActivity(A)
Input: the AIS data set A, cog in AIS is a course over ground, ns is a number of sampling of mmsi, SC is a strength of clusters, ascog is an average of sum of abs of cog, scog is a normalization of cog, and asog is an average of sog
Output: the preprocessing data set T1, the calculated cluster strength data set T2
Method:
01: T1 ← Preprocessing(A);
02: T2 ← Clusterstrength(T1);
03: Visualization(T2);

Preprocessing(A)
04: for i ← 1 to n do
05: if 33 <= Ai,latutude <= 38 and 124 <= Ai,longitude <= 132
06: then Temp1i ← Ai
07: end
08: Temp2jkmeans(Temp1i,mmis)
09: extract Temp2j ← 2-minute intervals Temp2j
10: extract Temp2j ← #10 sogs Temp2j
11: for j ← 1 to m do
12: Temp3j k = 1 n c o g k + 1 c o g k n
13: end
14: for l ← 1 to p do
15: Temp4l ← classify(Temp3j,type)
16: end
17: Return Temp4l

Clusterstrength(T1)
18: for i ← 1 to n do
19: Temp5i n s × a s c o g × 0.2 + s c o g × 0.5 + a s o g × 0.3
20: end
21: Return Temp5i

Visualization(T2)
22: for i ← 1 to n do
23: display(T2i,mmsi);
24: display(T2i,SC)
25: end
In line 4 to 16, preprocessing phase extract 10 sogs (speed over ground) in AIS are extracted from daily data, then calculate a normalization of cog. In line 18 to 21, calculating of cluster strength phase compute the cluster strength for each MMSI. In line 22 to 25, visualization phase, the cluster strength, which is the activity intensity based on the ship’s MMSI, is displayed on the google Earth.

3.4.2. Predicting Fishing Activity

This subsection shows how to predict the fishing activity using LSTM and visualize the fishing activity of each ship in Google Earth. Fishing activity was predicted using the data in Figure 10, which is the result of the preprocessing algorithm in Section 3.4.1. The fishery activity was predicted by designing the input and output with the ship speed of 10 intervals and ncog (normalization of course over ground) in the data in Figure 10 suitable for the LSTM (Long Short-Term Memory) algorithm. Figure 13 shows the results of fishing activities using LSTM designed for input and output. The red line represents the fishing activity and the yellow line represents the sailing of the vessel. A green triangle indicates the current ship’s position.
Figure 13. Visualization result of Fishing Activity Prediction on Google Earth.

4. Experimental Results

In this section, the experimental results are presented to Data Lakehouse and analysis results of the implemented system.

4.1. Data Lakehouse Performance Evaluation

In order to measure the performance of the Data Lakehouse, this subsection compared the query processing results of PostgreSQL, a relational database, and Impala, the query engine of the Data Lake. The Marine Data Lake data in Table 1 were used to evaluate the performance of the Data Lakehouse. Figure 14 shows the comparison result of query processing between PostgreSQL and Impala SQL. The result measure is the time it takes to process the count query, which counts the number of rows. The shorter the query processing time, the better the performance. Query processing for the same data between a Data Lake cluster with 4 nodes and PostgreSQL with 1 node was evaluated. The hardware specifications of the Data Lake node and the PostgreSQL node are configured. The evaluation query is “SELECT count(*) FROM table-name”.
Figure 14. Query Response Time (i.e., second) Comparison Results between PostgreSQL and Impala.
Figure 14a compares and evaluates the AIS Static data, which is the size of 1.3 MB and has 13,856 rows. PostgreSQL query processing time for AIS static data is 0.081 s and Impala query processing time is 0.0054 s, impala is 15.06 times faster than PostgreSQL. Impala’s query response rate for AIS static data is about 88.14% faster than PostgreSQL’s. Figure 14b compares and evaluates the AIS Dynamic data, which is the size of 31 GB and has 354,857,410 rows. PostgresSQL query processing time for AIS Dynamic data is 87 s and Impala query processing time is 2.28 s, impala is 38.20 times faster than PostgresSQL. Impala’s query response rate for AIS Dynamic data is about 93.33% faster than PostgreSQL’s. Figure 14c compares and evaluates the V-Pass data, which is the size of 2.8 GB and has 35,053,969 rows. PostgresSQL query processing time for AIS Dynamic data is 8.759 s and Impala query processing time is 1.039 s, impala is 8.43 times faster than PostgresSQL. Impala’s query response rate for V-Pass data is about 97.38% faster than PostgreSQL’s.

4.2. Marine Analysis Performance Evaluation

This subsection evaluates the performance of marine analysis of Vessel AI layer for fishing activity prediction and fishing vessel type forecasting. The data in Table 4c preprocessed in Section 3.4.1 is used for training and testing the prediction model. 9872 rows of data are used for training and testing in a 70:30 ratios. Decision Trees (DT), Random Forest (RF), LSTM (Long Short-Term Memory), and HMM (Hidden Markov Model) algorithms are used for predictive models. Since this paper is focused on the data point of view of building a Lakehouse using actual maritime observation data, the algorithms of the prediction model used the basic models provided by TensorFlow [34] and Keras [35] without tuning. Data processing within the predictive model used Pandas [36]. The input and output parts of each predictive model were modified to fit the preprocessed data. The type of ship is predicted using the speed as an input value for each prediction model. Fishing activity prediction forecasts only one type of fishing ship with speed and ncog as input values to each prediction model.
Figure 15 shows the procedure of Fishing Activity and Ship Type Prediction. Figure 16 shows the results of comparison of the accuracy rate of fishing activity and vessel type prediction. Figure 16a shows the vessel type prediction results of data with mixed ships such as fishing ship, cargo, and ferry. In Figure 16a, we compared the prediction accuracy of vessel classification for four prediction models: DT, RF, LSTM, and HMM. The prediction accuracy of vessel classification of RF is approximately 1.10% higher than that of LSTM, 10.97% higher that of DT, 18.11% higher than that of HMM. Figure 16b shows the prediction results of fishing activities from only fishing ship data. In Figure 16b, we compared the prediction accuracy of fishing activity for four pre-diction models: DT, RF, LSTM, and HMM. The prediction accuracy of fishing activity of LSTM is approximately 0% higher than that of RF, 2.8% higher that of DT, 16.8% higher than that of HMM. In Figure 16, it can be seen that prediction from somewhat classified data shows better results than prediction from a mixture of different types of data.
Figure 15. Procedures of Fishing Activity and Ship Type Prediction. (a) Selection models phase; (b) Training model phase; (c) Test models phase; (d) Inferences.
Figure 16. Comparison of Accuracy in Prediction Fishing Activities and Vessel Types.

5. Conclusions

Various challenges are currently affecting the development of large-scale marine data services, limiting users’ ability to use the full potential of this data ecosystem. From a technical point of view, these challenges are mainly related to the big data nature and high level of heterogeneity of marine data sources. In this paper, we designed and implemented the architecture of Vessel Data Lakehouse, which can efficiently manage various types (i.e., heterogeneous) of vessel-related data. The proposed Vessel Data Lakehouse consists of Extraction and Ingestion layer that can collect and store data, Vessel Data Lake layer that can handle marine big data, Vessel Data Warehouse Model that supports marine big data processing and AI, and Vessel Application Services that supports marine application services. The Extraction and Ingestion layer extracts vessel-related data from data sources and stores it in the Vessel Data Lake layer of Data Warehouse Model. The Vessel Data Lake layer constructed a Data Lake for AIS, VPSS, VBD, Observation data based on Apache Hadoop. The Vessel AI layer of the Data Warehouse Model supports AI analysis and prediction for vessel application services. Vessel Application Service supports Visualization service, Big Data Analysis service, and AI Analysis service for vessels. In this paper, a use case of constructing a Vessel Data Lakehouse using actual vessel-related data and a use case of analyzing vessel distribution and fishing activities with Vessel Application Service were shown, respectively. As a result of the experiment from about 34 GB of data of AIS and VPSS, the Data Lakehouse showed 92.95% higher average query response rate than the relational database, demonstrating the efficiency of the proposed Data Lakehouse. Since the Data Lakehouse in this paper focuses on structured AIS data or observational time series data, it is still insufficient for processing large-scale ocean image data. We plan to expand our current Data Lakehouse using Delta Lake [37] and satellite imageries (i.e., satellite AIS data, satellite SAR data, satellite EO/IR data) to handle large amounts of image data in future work.

Author Contributions

Conceptualization, S.P. and C.-S.Y.; methodology, S.P.; software, S.P.; validation, S.P., C.-S.Y. and J.K.; formal analysis, S.P.; investigation, S.P.; resources, C.-S.Y.; data curation, S.P.; writing—original draft preparation, S.P.; writing—review and editing, C.-S.Y. and J.K.; visualization, S.P. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2019-0-01842, Artificial Intelligence Graduate School Program (GIST)). This work was supported by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2021-0-02068, Artificial Intelligence Innovation Hub). This work was supported by the project “Monitoring System of Spilled Oils Using Multiple Remote Sensing Techniques” funded by the Korea Coast Guard.

Data Availability Statement

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Data Lakehouse. Available online: https://databricks.com/glossary/data-lakehouse (accessed on 11 January 2023).
  2. Orescanin, D.; Hlupic, T. Data Lakehouse—A Novel Step in Analytics Architecture. In Proceedings of the 44th International Convention on Information, Communication and Electronic Technology, Opatija, Croatia, 27 September 2021. [Google Scholar]
  3. Vessel Monitoring System. Available online: https://en.wikipedia.org/wiki/Vessel_monitoring_system (accessed on 11 January 2023).
  4. Lytra, I.; Vidal, M.E.; Orlandi, F.; Attard, J. A Big Data Architecture for Managing Oceans of Data and Maritime Applications. In Proceedings of the International Conference on Engineering, Technology and Innovation, Madeira, Portugal, 27 June 2017. [Google Scholar]
  5. Lin, B. Overview of High Performance Computing Power Building for the Big Data of Marine Forecasting. In Proceedings of the 2020 International Conference on Big Data and Informatization Education (ICBDIE), Zhangjiajie, China, 23 April 2020. [Google Scholar]
  6. Armbrust, M.; Ghodsi, A.; Xin, R.; Zaharia, M. Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics. In Proceedings of the 11th Annual Conference on Innovative Data System Research, Online, 11 January 2021. [Google Scholar]
  7. Begoli, E.; Goethert, I.; Knight, K. A Lakehouse Architecture for the Management and Analysis of Heterogeneous Data for Biomedical Research and Mega-biobanks. In Proceedings of the 2021 IEEE International Conference on Big Data, Online, 15 December 2021. [Google Scholar]
  8. Park, S.; Cha, B.R.; Kim, J.W. Designing Marine Data Lakehouse Architecture for Managing Maritime Analytics Application. In Proceedings of the 9th International Conference on Advanced Engineering and ICT-Convergence, Jeju Island, Republic of Korea, 13 July 2022. [Google Scholar]
  9. Harby, A.A.; Zulkernine, F. From Data Warehouse to Lakehouse: A Comprarative Review. In Proceedings of the 2022 IEEE International Conference on Big Data, Osaka, Japan, 17 January 2022. [Google Scholar]
  10. Kumar, D.; Li, S. Separating Storage and Compute with the Databricks Lakehouse Platform. In Proceedings of the 2022 IEEE 9th International Conference on Data Science and Advanced Analytics (DSAA), Shenzhen, China, 12 October 2022. [Google Scholar]
  11. Hery, H.; Lukas, S.; Yugopuspito, P.; Murwantara, I.M.; Krisnadi, D. Website Design for Locating Tuna Fishing Spot Using Naïve Bayes and SVM Based on VMS Data on Indonesian Sea. In Proceedings of the 3rd International Seminar on Research of Information Technology and Intelligent System, Yogyakarta, Indonesia, 10 December 2020. [Google Scholar]
  12. Zhao, Z.; Tian, Y.; Hong, F.; Huang, H.; Zhou, S. Trawler Fishing Track Interpolation using LSTM for Satellite-based VMS Traces. In Proceedings of the Global Oceans, U.S. Gulf Coast, Singapore, 5 October 2020. [Google Scholar]
  13. Ahmed, I.; Jun, M.; Ding, Y. A Spatio-Temporal Track Association Algorithm Based on Marine Vessel Automatic Identification System Data. IEEE Trans. Intell. Transp. Syst. 2022, 23, 20783–20797. [Google Scholar] [CrossRef]
  14. Beek, R.V.; Gaol, J.L.; Agus, S.B. Analysis of Fishing with Led Lights in and around MPA and No Take Zones at Natuna Indonesia through VMS and VIIRS Data. In Proceedings of the IEEE Asia-Pacific Conference on Geoscience, Electronics and Remote Sensing Technology, Jakarta, Indonesia, 7 December 2020. [Google Scholar]
  15. Huang, J.; Wan, J.; Yu, J.; Zhu, F.; Ren, Y. Edge Computing-Based Adaptable Trajectory Transmission Policy for Vessels Monitoring Systems of Marine Fishery. IEEE Access 2020, 7, 50684–50695. [Google Scholar] [CrossRef]
  16. Li, X.; Xia, Y.; Su, F.; Wu, W.; Zhou, L. AIS and VBD Data Fusion for Marine Fishing Intensity Mapping and Analysis in the Northern Part of the South China Sea. Int. J. Geo-Inf. 2021, 10, 277. [Google Scholar] [CrossRef]
  17. Souza, E.N.; Boerder, K.; Matwin, S.; Worm, B. Improving Fishing Pattern Detection from Satellite AIS Using Data Mining and Machine Learning. PLoS ONE 2016, 11, e0163760. [Google Scholar]
  18. Alba, J.M.M.; Dy, G.C.; Virina, N.I.M.; Samonte, M.J.C. Localized Monitoring Mobile Application for Automatic Identification System (AIS) for Sea Vessels. In Proceedings of the IEEE 7th International Conference on Industrial Engineering and Applications, Paris, France, 4 January 2020. [Google Scholar]
  19. Prasad, P.; Vatsal, V.; Chowdhury, R.R. Maritime Vessel Route Extraction and Automatic Information System (AIS) Spoofing Detection. In Proceedings of the 2021 International Conference on Advances in Electrical Computing, Communication and Sustainable Technologies (ICAECT), Bhilai, India, 19 February 2021. [Google Scholar]
  20. Evmides, N.; Odysseos, L.; Michaelides, M.P. An Intelligent Framework for Vessel Traffic Monitoring using AIS Data. In Proceedings of the 23rd IEEE International Conference on Mobile Data Management, Online, 6 June 2022. [Google Scholar]
  21. Liu, R.W.; Liang, M.; Nie, J.; Garg, S.; Zhang, Y.; Xiong, Z. Extraction of Hottest Shipping Routes: From Positioning Data to Intelligent Surveillance. In Proceedings of the IEEE 22nd International Conference on Information Reuse and Integration for Data Science, Las Vegas, NV, USA, 10 August 2021. [Google Scholar]
  22. Huang, H.; Cui, X.; Bi, X.; Liu, C.; Hong, F.; Guo, S. FVRD: Fishing Vessels Relationships Discovery System Through Vessel Trajectory. IEEE Access 2020, 8, 112530–112538. [Google Scholar] [CrossRef]
  23. Xiao, Z.; Fu, X.; Zhao, L.; Zhag, L.; Teo, T.K.; Li, N.; Zhang, W.; Qin, Z. Next-Generation Vessel Traffic Services Systems—From “Passive” to “Proactive”. IEEE Intell. Transp. Syst. Mag. 2022, 15, 363–377. [Google Scholar] [CrossRef]
  24. Tampakis, P.; Chondrodima, E.; Pikrakis, A.; Theodoridis, Y.; Pristouris, K.; Nakos, H.; Petra, E.; Dalamagas, T.; Kandiros, A.; Markakis, G. Sea Area Monitoring and Analysis of Fishing Vessels Activity: The i4sea Big Data Platform. In Proceedings of the 21st IEEE International Conference on Mobile Data Management, Versailles, France, 30 June 2020. [Google Scholar]
  25. Han, J.R.; KIM, T.H.; Choi, E.Y.; Choi, H.W. A Study on the Mapping of Fishing Activity using V-Pass Data—Focusing on the Southeast Sea of Korea. J. Korean Assoc. Geogr. Inf. Stud. 2021, 24, 112–125. [Google Scholar]
  26. Weather Data Open Portal. Available online: https://data.kma.go.kr/cmmn/main.do (accessed on 6 January 2023).
  27. Ocean Data in Grid Framework. Available online: http://www.khoa.go.kr/oceangrid/khoa/intro.do (accessed on 6 January 2023).
  28. Apache Hadoop. Available online: https://hadoop.apache.org/ (accessed on 9 January 2023).
  29. Hue. Available online: https://gethue.com/ (accessed on 9 January 2023).
  30. Apache Impala. Available online: https://impala.apache.org/ (accessed on 9 January 2023).
  31. Apache Kudu. Available online: https://kudu.apache.org/ (accessed on 9 January 2023).
  32. Apache Spark. Available online: https://spark.apache.org/ (accessed on 9 January 2023).
  33. Apache TEZ. Available online: https://tez.apache.org/ (accessed on 9 January 2023).
  34. TensorFlow. Available online: https://tensorflow.org/ (accessed on 9 April 2023).
  35. Keras. Available online: https://keras.io/ (accessed on 9 April 2023).
  36. Pandas. Available online: https://pandas.pydata.org/ (accessed on 9 April 2023).
  37. Delta Lake. Available online: https://delta.io/ (accessed on 29 March 2023).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.