With the realm of big data as a source of new knowledge extraction through data analysis and mining techniques, machine learning, correlation, and cluster analysis techniques, data heterogeneity and interoperability are common challenges. Ontologies and Finable, Accessible, Interoperable, and Reusable (FAIR) systems are presently able to handle these challenges effectively [1
]. However, another challenge is rising, concerning the data volume and storage lifespan. In the past few decades, due to the vast amount of data being generated each second, data storage systems and analytical tools play a vital role in the big data ecosystem. They facilitate the processes of storing, manipulating, analyzing, and accessing structured and unstructured data (J. Dixon. Pentaho, Hadoop, and data lakes. https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/
Among modern data storage systems and repositories, we are primarily interested here in data lakes, designed to store a large volume of data in any format and structure. The data lake is a recent generation of storage systems conceived as data repositories to propose a flexible platform for data storage, access exploration, and analysis [2
]. Because their existing features can handle data heterogeneity, they provide means to generate new knowledge and identify data patterns from large amounts of data, independently of their format and structure. According to Fang, a data lake is a cost-efficient data storage system enabled by the new generation of data management technologies to master big data problems and improve data analyzing process from ingestion, storage, exploration and exploitation of data in their native format to mining information and extract new knowledge from massive unstructured data [4
]. A data lake uses a flat architecture to collect and store data, on a platform initially based on Apache Hadoop (Highly Available Object-Oriented Data Platform) which is a beneficial big data tool [5
As mentioned above, a data lake operates as a central repository which loads data in no-schema approach. This means that the data is ingested into the data lake without predefined structure and the schema is defined only at the time of data usage and data querying. This approach is known as schema-on-read or “late bindings
” and is the opposite of schema-on-write which is common in data warehouse [4
]. In data lakes, the “extreme volume” of raw data is stored and processed at the lowest possible cost, unlike data warehouses that load large scale of “cleansed” data in a more costly manner [4
]. According to Sawadogo and Darmont, a data lake could be viewed as a form of data warehouse that collects multiple structured data with minimum operational cost before extract-transform-Load (ETL) process or as a global storage system that contains a data warehouse for enhancing data life cycle monitoring with “cross-reference analyses
A successful data lake must satisfy properties concerning data handling and management such as: cost-effective and flexible ingestion, storage, processing, data access, and “applicable data governance policies” [8
In general, data lakes contain heterogeneous and multi-modal data that renders its analysis complex, and sets requirements for rigorous processes to maintain and ensure data integrity from its storage to its exploitation. This will allow us to improve the data quality for data scientists and to decrease the cost of data storage and risk. Hence the concept of data governance has resurfaced to support the mastering of data management, to control data quality, and improve business intelligence in insurable manners [9
]. Nevertheless, the life cycle of data that enters a data lake is seldom accounted for. There is a strong need to conceive, define, and implement data governance mechanisms, to handle proper data retention and minimize the risk of data swamps.
According to Madera and Laurent, “data governance is concerned with the data life cycle, quality and security of data“ [2
] in any storage system. Hence it is a fundamental issue in relation to the data lake authenticity. Data governance disciplines and strategies aim to prevent data lakes from becoming data swamps or maintaining poor-quality data [10
]. These disciplines control or fix data quality dimensions, like: “accuracy”, “completeness”, “consistency”, “currency” and “uniqueness” to guarantee validity of data [11
] and complement data management [12
]. In the big data ecosystem, many governance mechanisms have been proposed to guarantee the veracity and accuracy of the data value. In particular, Abraham, Schneider, and Vom Brocke [14
] distinguish three categories for data governance mechanisms which are frequently implemented for data management:
Structural mechanism in references to the governance structures
Procedural mechanism related to the policies for data management
Relational mechanism concerned with stockholder communications
With reference to this specification, some researchers define some standards or guidelines to manage data in data repositories [15
]. Others like Yebenes and Zorrilla [16
] propose frameworks for big data management. Some researchers put the emphasis on communication agreements to deploy feasible data governance [17
]. Since data access can be a strong competitive advantage for any organization and is shared to exchange information, Van den Broek and Van Veenstra [18
] presented some regulations to govern and balance data contributions. Data governance plays an important role in improving self-service business intelligence in the big data era [19
]. Consequently, beneficial data governance guidelines could minimize the risk of poor data quality in data mining processes and improve its accuracy [10
]. Data governance assessment improves the strategy frameworks for deploying successful data governance with respect to relevant focus areas [20
]. A practical data governance framework, with a focus on data quality, increases confidence in exploration and exploitation of the data [13
] and monitors data quality efforts
in a sustainable manner [21
All proposed mechanisms for data governance, concentrate on setting principles, roles, and structures to improve data quality and data lake security. The data life cycle management is one of the most important reasons for applying data governance in each data repository. However, the influence of data lifespan on proper data quality strategies for deleting or preserving data in the data lake has received little attention, even though mortality and life expectancy of data in the data lake is a serious issue when it comes to increasing the productivity of the data lake.
In this paper, we start from the assumption and claim that data governance implementation concerning the data life cycle, could influence the general purification of data repositories from useless data.
The concept of data lake is defined as a system with multiple components that are derived from natural lake definition. Hence, some data governance policies or regulation methodologies have been extracted from systematic approaches or natural mechanisms to preserve and destroy data throughout their life cycle. This viewpoint provides enormous capabilities to govern a data lake effectively. In this article, we propose two solutions that are respectively derived from drawing analogies with (1) nature ecosystems and (2) the concept of the supply chain to address data lakes and their governance issues. Our approach is based on a comparison of the dynamics, life cycles, and operations within those two systems with those needed for data lakes. We show that such perspectives provide paradigms for optimizing data lake performance, and we describe some methods for sustainable data governance.
Nature ecosystem analogy. Let us consider living organisms and particularly the DNA. The information is determined by the activity of the “reader”. The data which is not read is not used (the principle). The data is not systematically destroyed after a “not-being-read” period, but if such a period becomes long, then the data is weakened and may disappear. Also, if the data is frequently read, then it is consolidated and solidified even if this can have a penalizing effect, later on.
What can happen in such situations, is that the individual or the species can disappear. However, at the same time, chance can create new data and multiply it. These are the characteristics of living things that can generate new data automatically. This “natural” mechanism can be implemented on data governance in the data lake. Please note that the notions of “long” or “chance” would then clearly need to be instantiated and specified. Systematic approach analogy
. The goal of a systematic approach is to identify the most efficient means to generate consistent and optimum results [22
]. Such approaches, implemented in the supply chain domain, are another analogy we draw to address our objectives. For instance, Chen and Huang [22
] use a systematic approach to recognize the interaction between supply chain’s members as system elements. To do so, they decompose the supply chain participants into sub-groups and sub-system elements and enhance the supply chain structure to represent a complex system that will improve coordination and integration among supply chain elements.
Indeed, the strategies and methodologies which are frequently used in supply chain management bring practical paradigms for promoting service quality and resolving customer affairs issues. If we consider a data lake as a supply chain and, consequently data as a product, we could define a set of hybrid policies for improving data quality and thus reach an optimal data lake state. For example, lean management strategies provide some approaches to minimize additional costs and eliminate wastes in a data lake just by defining the costly activities or non-valuable data [23
]. Similarly, a strategy frequently used in the supply chain such as “agile management”, will improve the responsibility and flexibility of the data lake with regard to user requirements with high quality of service even in critical situations [23
Those two frameworks can be viewed as effective paradigms for managing the data life cycle, and its governance to ensure its viability. In the following, we postulate that with the assumption that data lakes are comparable to natural lakes and to supply chains, the processes derived from nature and supply chain management can be extrapolated to data management in repositories such as data lakes. Based on this positioning, we present a general analogy and comparison between supply chain management, natural lakes, and data lakes and identify similar aspects and components. Then, based on those similarities we propose new methodologies to improve data lake’s validity.
2. Our Approach and Contribution
Based on the definition of a “system”, the ecosystem and supply chain are both considered intelligent systems that contain several components and are governed by specific rules and disciplines. A data lake is conceptually inspired by a natural lake. Consequently, all concepts that are frequently used in the data lake, originate from a natural lake ecosystem. From another perspective, supply chain management provides some appropriate concepts and processes that are also applicable to data lake management and data governance.
Our study is based on the position that a data lake as a system, has many common and comparable elements with supply chains and natural ecosystems. Dealing with diverse and heterogeneous data in data lakes—like products in supply chains or species in ecosystems—requires hybrid solutions and methods for data management which can be accurately determined. In line with our focus on the data life cycle, we put the emphasis on designing practical methods to preserve valid data in the data lake and remove invalid or obsolete ones from the data lake. For instance, it is logical that some data will be separated from the data lake, like a defective product in supply chain. The data can also be brought back to the data lake or kept after its usage, like a reusable or recyclable product in close-loop or reverse supply chain [24
] or like the information in backward flow across the chain. This addresses which data is concerned.
In addition, a key challenge lies in the evaluation of data usage during their lifespan, because a data lake stores data that may be retrieved or queried in the future, rather than serving an immediate need [25
We would assume that data acts like products in the supply chain or water in a natural lake. Hence, they have a probabilistic lifespan and may be valid and useful (i.e., have high value for exploration and exploitation) or invalid and obsolete (i.e., have no value and increase the risk of data swamp). Therefore, to avoid storing invalid data and managing the data life cycle, we tackle the challenge of the data life expectancy by drawing analogies with processes used in the supply chain and ecosystem to govern data lakes.
The questions are then:
Which aspects of the data lake are comparable with natural lake and supply chain?
Which strategies should be derived from nature and supply chain for data governance?
How should these strategies be generalized to a data lake?
In this article, we contribute some first research positions, by:
Providing comparisons between a data lake, an ecosystem, and a supply chain (Element by Element);
Relying on supply chain management strategies for data governance (Systematic Manner);
Imitating nature principals to manage the data life cycle (Natural Manner).
3. Comparing Data Lake, Ecosystem and Supply Chain
Each system consists of different components that work effectively together to achieve certain goals under deterministic or probabilistic restrictions and conditions. Furthermore, each system applies strategies to optimize different objective functions and improve overall performance. The performance of this system is evaluated according to several criteria to examine how many optimal levels have been fulfilled.
As previously mentioned, an ecosystem and the supply chain inherently act as a system, and in many aspects are comparable with each other. Similarly, a data lake, as a centralized storage system, behaves in accordance with analogous systematic paradigms. Regarding this point, we have elaborated tables that compare supply chain, data lake, and ecosystem with each other, thus explaining the relationship we identified between a set of concepts. Table 1
and Table 2
present a general comparison of the three systems.
Following the structure of Table 1
and Table 2
, we develop the different analogies further.
3.1. Supply Chain and Data Lake
Formally, a supply chain is a corporation of different entities such as manufacturers, suppliers, distributors, and retailers that cooperate to provide specific products or services for consumers [26
]. To create a profitable supply chain
, all members of the chain should be vertically integrated with all parties being coordinated across the optimal goal of the chain [27
]. One of the major considerations in supply chain management is the integration of all members towards a global goal, and the improvement of the product flow and information across the chain. According to Simchi-Levi, Kaminsky & Simchi-Levi, Delfmann & Albers and Harland [28
], supply chain management includes managerial techniques and processes to integrate all members of the chain, from suppliers to retailers, to minimize whole system expenditures, improve chain profit and increase service levels satisfaction. The first step in supply chain management is to define the objective functions
of the chain that optimize the decision variables
which are characterized by the supply chain manager.
Typically, supply chain objective functions intend to minimize expenditures [31
], wastes, maintenance and storage cost, inventory cost, lead time and customer service time [33
], and to maximize profit, coverage demand, and service levels [32
]. The fundamental goal of supply chain management is to add value and provide a clear competitive advantage to enhance chain productivity and efficiency. Meanwhile, to design, manage and evaluate an integrated supply chain, some major modules need to be accurately defined [34
As Table 1
and Table 2
show all these modules which have been characterized for the supply chain could also be defined for a data lake if we consider it as an integrated system with certain components and stages.
With respect to the first module, each member (level) of the supply chain is responsible for the specific task of enhancing the value of the whole chain. Suppliers must provide the best raw material, manufacturers must produce high-quality products, distributors are responsible for logistics management and retailers improve the service levels for the final customers. The result of the member’s collaboration is the optimal and integrated supply chain with high customer satisfaction. Likewise, according to LaPlante & Sharma [35
], four major functions are described for data lakes, from data entry to its preparation for the final user (typically data scientists). These functions are divided into four principal stages: ingestion, storage, processing, and access stages which organize data in levels. Ingestion management, controls of data sources (where data come from), data storage (where data are stored), and the data arrival time (when data arrive). Ravat & Zhao [36
] also proposed a “data lake functional architecture”, which is structured with four main zones: Raw data zone, process zone, access zone, and governance zone. Regarding these proposed architectures for data lake, we can consider data lake as a supply chain that collects, generates, transfers, and delivers data from several resources to the final users.
The second module is product. The major products in the supply chain are commodities in forward flow and information in backward flow. However, in the data lake, the products are data that can be considered to be commodities or information in the supply chain. Considering this point, the main products of the data lake are the data with an appropriate management plan from their ingestion level to the information extraction.
In the third module of Table 1
, the main purpose of this comparison is management strategies
, that are defined as a set of improvement plans and patterns which are used for enhancing system performance and providing the specific principles and objectives to reach the goals [37
]. Consequently, all other modules in the supply chain, like: parameters, objective function, decision variables, and constraints of the chain, will be determined based on a relevant strategy. For example, green strategies are applied to the supply chain, to minimize the environmental cost and maximize the green-conscious customer satisfaction [39
]. Similarly, for data lake management, some strategies, like data governance and metadata management, are frequently used to accomplish definitive goals and increase data quality.
As mentioned in Table 1
, objective function
is the important module that impacts on subsequent decisions in the supply chain management [34
]. Accordingly, cost minimization and profit maximization are two important objectives that the whole supply chain seeks to reach. Similar objectives are also common in data lakes. The goal of maximizing or minimizing the objective function is to obtain the optimal value for the decision variables with respect to the constraints of the problem. The type of these decision variables differs in the supply chain and in the data lake, but they have the same meaning.
As we can see from Table 2
, the number of facilities or warehouses are the critical decision variables
in each supply chain, and making decisions about them is a strategic and long-term decision [34
]. Similarly, in data lakes, the optimal number of repositories or sets must be estimated accurately. Risk management
plays a vital role in system management and is determined according to the internal and external conditions of each system [41
]. In general, the risk of machine failure or defective product in the supply chain and risk of data swamp and unreliable data in the data lake are the most prominent risks. Finally, performance evaluation
is essential for system development. Therefore, some evaluation standards are specified according to the characteristics of the systems [27
From both tables showcasing our comparison between the supply chain management and data lake, it is obvious that both systems have been generated for similar purposes which are:
Therefore, there are very similar points between the supply chain and the data lake. Thus, it seems logical that supply chain tools and strategies can be efficient to enhance the data lake performance and productivity. In this article, we propose to use one of the most successful assessment methods, presented in Section 4
, used to monitor the environmental performance of the supply chain. We intend to use it to implement data governance according to the life and death of data in data lakes.
3.2. Ecosystem and Data Lake
In this second analogy, we are considering the lake as an ecosystem filled with numerous living species. These species are the members of our system. They have different functions. For example, some species eat others. All species have a common feature: they reproduce and survive. However, the system is more than the sum of its members, and that is what we will detail.
The ecosystem is seen as an autonomous system whose regulations are not necessarily aimed at the survival of all species, but to guarantee the homeostasis and resilience of the system. Homeostasis is permitted by sets of regulations [45
]. Biologists consider that resilience is linked to the complexity of the system, the number of species, and the number of internal regulations [46
]. Thus, biologists consider that the more complex the system is, the healthier it is.
In our comparison, the essential point is homeostasis (decision variable in the table), and we will consider resilience as an underlying property of the system. On the scale of a living organism, homeostasis operates through a complex set of regulations according to a simple principle of three functions: a receiver, a control center, and an effector. In the case of an ecosystem, the mechanisms are more complex [48
], and the ecologists are currently just able to analyze precisely the relations between homeostasis and resilience and their role in the stability of the system. For our study, we will retain that the ecosystem has internal regulatory functions that maximize its survival and good health. These functions are not determined by a system supervisor, but by the system itself.
The results of the comparison sections demonstrate that a data lake could be defined as an integrated system based on supply chain terms and ecosystem regulations in which all related members act coherently. In the next sections, concerning the table interpretation, we distinguish the methods of data governance in the data lake in two manners: supply chain-based method or systematic manner and ecosystem-based method or natural manner, to suggest two multidisciplinary solutions for managing the data life cycle in data repositories such as data lakes.
We provide here some detailed examples to illustrate our contribution and to point out to some further research we will carry out that will rely on the tables we exhibit as a result.
The supply chain is a connection of multiple dependent or independent members or levels that contribute to each other with a common goal of adding value to a product or service from sources to destinations [27
]. For example, in a three-level supply chain, the three principal members are: manufacture, distributor, and retailer [28
]. In data lakes, each stage acts as a member in a supply chain to provide (APIs, data and service endpoints), transport (IP addressing, …), store (HDFS file system, …) and make data accessible for the final users [35
]. In biological systems, the levels are those of life, from DNA sequences to cells, bacteria, species, … which are called ecosystem components. Therefore, these members, whether they belong to a supply chain, a data lake, or an ecosystem, are responsible for product quality and service levels.
A broad range of products exists in the supply chain network, for instance, seasonal products like clothing, alimentary products like canned food or industrial products like machines, which are logistically managed with specific standards and fixed lifespan [28
In biological systems and natural lakes, products can be DNA sequences, species, or biomass which are reproduced and preserved by certain mechanisms in nature.
Similarly, in data lakes, data is a targeted product which could be sensor data, web log data, financial data, human or machine-generated data that must be stored and managed, with a given logistics.
In all these three systems, products can be considered at different levels of granularity, as components or complex systems.
3.3.3. Management Strategies
Each supply chain regarding its objective, type of product, structure, and market demand, is managed with a specific strategy [28
]. For example, seasonal or perishable products like clothing or fresh food respectively, with a very short life cycle, do need concrete planning to increase the product sale during their lifespan hence agile strategy could be an effective solution [23
On the other hand for ecological products in the environmental supply chain, some specifications like a recyclable product or not, and some other considerations along the logistical process lead the green supply chain to derive numerous solution strategies [39
In the ecosystem, the main strategies for species evolution are mutation, recombination, selection, and drift.
In analogy, data in data lake concerning their structure and utility, need to follow certain regulations, relative to entering a data lake and its possible usages. The goal is to ensure the quality of the data mining process, by deriving suitable data governance as a management strategy responsible for guaranteeing data quality.
3.3.4. Objective Functions
Objective functions are defined and aligned with opted management strategies for designing supply chain networks. For example in the supply chain with seasonal products, the objective function could be service level maximization or response time minimization; or in the green supply chain, we would define the minimization of
emission or total cost minimization [50
On the other hand, the maximization of species reproduction and resilience of the ecosystem are major considerations of the ecological system.
Thus, the main objective function is related to minimizing poor data quality and maximizing the customer’s usage rate.
3.3.5. Decision Variables
Regarding the definition of decision variables, a set of decision variables is commonly considered in supply chain optimization models. For instance, in the seasonal product supply chain, a decision variable can be the amount ordered, and in the green supply chain, the degree of environmental protection [50
Similarly, in the ecosystem, homeostasis is a key decision variable, for which we seek the optimal value.
In analogy, important decision variables in data lake management are defined as the total amount of satisfied demands or the number of users that are permitted to access the data lake.
Constraints distinguish the scope of the optimization model. For example, the lead time is a critical constraint in the supply chain with seasonal products, and environmental level constraints are essential to reason about green supply chains [50
In the ecosystem, critical constraints like global changes that are induced by some drivers like
enrichment and biotic invasions
, could restrain optimal interactions between species [51
In a data lake, the laws of gravity and data governance principles are the most important limitations that describe the problem boundaries.
For seasonal products, the risk of losing the customer is definitely of high impact due to the short lifespan of products. In green supply chains, the risk of the data with destructive effects is significant [31
Some remarkable risks like hydrologic perturbations which are derived from climate change could have a serious impact on ecological systems [52
Thus, in a data lake, storing unreliable data or data failure are major risks.
3.3.8. Qualitative Performance Measurement
Qualitative performance measurement is essential for evaluating any system efficiency, and to examine actual gaps between the existing and the desired system [27
]. For example, customer satisfaction or rate of Flexibility are characterized as qualitative performance measurements for seasonal product supply chain, and the degree of adaptation of the chain to environmental standards for green supply chain [40
Resilience and optimal ecosystem functionality
are important quantitative qualifications which are determined by diversity measures like response diversity
Similarly, agility, data quality, and data lake flexibility could be determined as fundamental qualitative measurements to evaluate data lake performance.
4. Data Governance in Supply Chain
Supply chain management is related to strategies and rules that integrate all upstream and downstream relationships across the chain to generate high levels of value for direct and indirect participants [54
]. Recently, environmental responsibility has received increasing attention as an inseparable element for every supply chain to remove or reduce the non-biological products that have a dangerous impact on the environment and natural cycle. Based on these requirements, several strategies and disciplines, like green supply chain management [39
] and environmental supply chain management [55
], are defined.
Environmental supply chain management (ESCM) is related to the sustainable strategies that use life cycle assessments (LCA) from raw materials to final customers and the reverse flow of products (recycle or disposal) [55
]. The LCA is an instrument based on an environmental consideration that monitors and restricts the destructive environmental effects of a product’s entire life cycle in the supply chain with specific standards [54
]. Based on such instruments, other completed assessment codes and procedures, like: PLCA (product life cycle assessment) [58
], SLCA (social life cycle assessment) [59
] and LCSA (life cycle sustainability assessment) [58
], are proposed by different organizations [61
The purpose of all proposed assessment codes is to regulate the whole procedure throughout the supply chain, in order to eliminate or minimize the harmful impacts on the environment. Each one of these standards assesses a specific aspect of the product life cycle, such as social or cost aspects, for instance. The monitoring of a product’s life cycle with such protocols improves the internal performance and productivity of the supply chain and consequently expands ecological and social care with cost-effective products.
Due to the data life cycle in the data lake, such assessment codes could serve as infrastructure for data governance legislation. Based on this cognition, data assessment is implemented from data collection to data interpretation, and all poor-quality or useless data, which have no value for data lake or data mining, will be limited or prevented from entering data lake. From our point of view, by regulating specific codes for data life cycle assessments (DLCA), data lake will be purified from life to death of data under strict disciplines.
The International Organization for Standardization (https://www.iso.org
) defines ISO 14040 as “Code of Practice” for life cycle assessment which includes four major phases in LCA study [62
The goal and scope definition phase
The inventory analysis phase
The impact assessment phase
The interpretation phase
These phases could be extended for data in the data lake to implement data life cycle assessments (DLCA). According to the goal and scope definition phase, we should determine which data with which qualifications is targeted, in order to address target users, system boundary, data category, and targets for data quality [63
]. In the inventory analysis phase, all information about the quality of input and output data is collected and validated under the life cycle assessment study of data. Then in the impact phase, all information about the effects of various data quality on the data lake, based on impact categories and life cycle inventory results, is evaluated. Finally, the impacts of different data quality on the data lake are interpreted with respect to some features like “validity”, “sensitivity” and “consistency” of data. The final results are concluded or reported in the interpretation phase. Consequently, this approach ensures data quality for the data’s lifespan with accurate assessment protocol [64
5. Data Governance in Natural Ecosystem
For most biologists, the basic building block is the gene. It is the unit that contains living information. Richard Dawkins [66
] explains that living things are made up of genes that reproduce through envelopes, organisms, which are simple avatars of genes. One may wonder why there are so many different life forms. We share identical genes with many species (97% homology with great apes, like chimpanzees or gorillas). Certain fundamental genes, like for example the one that codes for hemoglobin, for instance, is almost identical in very many species. However, ecosystems are very diverse, and they appear to us to be relatively stable structures where information seems to be constantly organized, distributed, and redistributed.
Considering that even before the appearance of the first cells, self-replicating molecules have existed and living things reproduce with a prolixity far above the level of acceptance of the system. There are therefore regulations that are carried out by the mechanism of natural selection (only the ablest survive). Natural selection is the constraint of living things. During reproduction, sexuality allows the mixing of genes and introduces a factor of chance (in addition to other phenomena such as, for example, mutations). Thus, the two forces which frame living beings are chance and necessity (constraint).
Chance does not produce information. It only produces complexity. Necessity is what produces information [67
]. Take moving animals for example, elephants cross a forest to seek a resource. This action will be repeated over the generations. The first animal makes its way “at random”, the second also, and so on. Soon enough, paths will exist and will be taken by the following elephants because it is less expensive in terms of energy to follow a path than to create a new one. Then there will be a selection of the most practical paths (to bypass natural obstacles, for example). In the end, there remains a reduced path network that forms an optimum choice for the shortest and least costly path. This network results from the effect of necessity (go to the least costly). The combined action of chance and necessity, conditions not only information as it is observed, but also its evolution [67
]. If once again, we take the paths created by the elephants, we can consider that at any moment, the chance can engage the evolution in a new way, while the necessity will force the new way to remain functional.
A data lake, as a complex storage system, needs a variety of methods to govern heterogeneous data accurately and in a timely manner. In this article, we have proposed some multidisciplinary approaches, which are natural manner and systematic manner, for data governance in data lake and argued that supply chain strategies and natural principals could be the effective sources of inspiration for data governance in order to assess the life cycle of data from the moment they enter the data lake until they are destroyed.
First, we provided a comparison table to indicate that the data lake acts as a system and has some aspects similar to the supply chain and ecosystem, both referred to as a complex system. Therefore, we considered data in data lakes, like products in the supply chain or species in nature, to draw similarities and identify proper strategies for data governance.
Then, we proposed two different methods based on systematic methods and natural behaviors to suggest a new perspective of data governance in the big data environment. Our methodology and comparative analysis showed that life cycle assessment codes as a systematic approach and revival of the laws of nature were ideal multidisciplinary approaches to implement sustainable data management with respect to life and death of data.
Proposed methods are derived from different disciplines and our contribution for comparing and aligning concepts impacts all data lake components and processes, from data collection to data exploitation. For these reasons, there are some limitations to examine their concrete exploitation for data lakes within one single work. We rather consider that this work opens many research avenues to consider every comparison and every data lake component one by one.
Therefore, with regard to our conclusion, we propose some future case studies for implementing our work in the real world and evaluating the obtained results.
One study will consider, for instance, data lake performance optimization. For this reason, we will use the design of Supply Chain Network strategies to define a mathematical model that maximizes the service level of the data lake; since supply chain management optimizes the profit. For this reason, a proper strategy like agile strategy will be opted and objective function(s) which maximizes the service, decision variable(s) such as the amount of satisfied demand, and constraint(s) such as the capacity or budget will be determined accordingly. Choosing a suitable strategy and designing the components of a mathematical model is an important challenge that must be carefully considered.
We will implement our proposed framework using real data lake software. We will consider and evaluate several aspects of data collection, data storage, and data processing in data lakes. As mentioned in Section 4
, for implementing this approach for data life cycle assessment, four major phases should be determined. In future work, we will develop these four phases in a data lake to deploy a practical perspective on data governance. However, it is essential to distinguish qualitative or quantitative measurements for describing valuable and destructive data in order to monitor good-or poor-quality data in the data lake.
Another work with regard to the principal objectives of this article could be inspired by the lean strategy in the supply chain to minimize the total cost of poor-quality data in data lakes. For this purpose, we aim to define the cost in the objective function, reducing the impact of all data that have no value for data lake or increase the risk of the data swamp. Decision variable(s) and constraint(s) for this mathematical model will be determined, respectively.
Finally, based on our analogy table, we will use biological models to recommend and manage relevant and promising data localization in the file systems, data crossings, etc., as DNA and biological materials do in nature.