Next Article in Journal
Building Prototype Evolution Pathway for Emotion Recognition in User-Generated Videos
Previous Article in Journal
A Convergent Method for Energy Optimization in Modern Hopfield Networks
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Automating Data Product Discovery with Large Language Models and Metadata Reasoning

Department of Electrical Engineering, Computer Engineering and Informatics, Faculty of Engineering and Technology, Cyprus University of Technology, Limassol 3036, Cyprus
*
Author to whom correspondence should be addressed.
Big Data Cogn. Comput. 2026, 10(3), 72; https://doi.org/10.3390/bdcc10030072
Submission received: 2 January 2026 / Revised: 23 February 2026 / Accepted: 25 February 2026 / Published: 28 February 2026

Abstract

The exponential growth of data over the past decade has created new challenges in transforming raw information into actionable knowledge, particularly through the development of data products. The latter is essentially the result of querying and retrieving specific portions of data from a data storage architecture at various levels of granularity. Traditionally, this transformation depends on domain experts manually analyzing datasets and providing feedback to effectively describe or annotate data that facilitates data retrieval. Nevertheless, this is a very time-consuming process that highlights the need for its potential automation. To address this challenge, the present paper proposes a framework which utilizes Large Language Models to support data product discovery through semantic metadata reasoning and executable query prototyping. The framework is evaluated across two domains and three levels of concept complexity to assess the LLM’s ability to identify relevant datasets and generate executable data product queries under varying analytical demands. The findings indicate that LLMs perform effectively in simpler scenarios, but their performance declines as conceptual complexity and dataset volume increase.

1. Introduction

Recent years have witnessed rapid technological advancements which have created significant challenges in how information is stored, managed, and analyzed [1]. This phenomenon, commonly referred to as Big Data, is defined by its increasing volume, variety, and velocity, as well as other recently introduced characteristics, such as value and veracity [2]. As organizations collect ever more complex and diverse datasets, traditional data management systems often prove insufficient to handle the scale and complexity of modern data environments [3]. In response, researchers and industry practitioners have explored innovative solutions to efficiently organize and manage this data.
One of the most promising solutions is the data lake, a flexible and scalable storage architecture capable of accommodating vast amounts of structured, semi-structured, and unstructured data from heterogeneous sources [4]. However, despite their versatility, data lakes face persistent issues related to data organization. Without proper management, they risk devolving into so-called data swamps, where stored data becomes difficult to discover, retrieve, interpret, and use effectively. To mitigate this, research in this area has focused on metadata enrichment, the process of adding descriptive information to datasets to transform raw data into discoverable and understandable assets [5]. By enriching semantic descriptions through metadata, organizations can enhance data discoverability, retrieval efficiency, and overall usability, while preventing data lakes from losing their intended value. Several frameworks have emerged to support metadata enrichment, including Apache Jena, which enables semantic metadata management and querying through RDF and SPARQL [6]. However, as data generation continues to accelerate, the demand for scalable and efficient metadata management solutions is increasing, leading to the question of what the most effective system for implementing metadata enrichment in modern data lakes might be [7].
In parallel, data lakes face another key challenge transforming organized data into valuable knowledge. Traditional centralized data lakes often struggle with slow data delivery and poor adaptability, contributing to the risk of data swamp formation [8]. The data mesh paradigm addresses this issue by decentralizing ownership and offering domain teams control over their data products. While this organizational approach empowers subject matter experts, it also introduces a bottleneck: the reliance on human expertise for manual data product creation, a process that is time-consuming and dependent on scarce specialist skills.
This paper proposes an approach for supporting data product discovery via a framework that leverages Large Language Models (LLMs). The model utilizes semantic reasoning over metadata and dataset samples to suggest and generate prototype data products. The objective is not to replace domain experts, but to assist them by reducing manual dataset inspection. This research evaluates the capability of LLMs to identify relevant datasets and generate executable data product queries in terms of accuracy, relevance, and adaptability under varying levels of dataset heterogeneity and conceptual complexity.
Accordingly, the research is guided by two research questions: (i) Are LLMs able to effectively automate aspects of data product suggestion and creation within a data lake environment? (ii) How robust is LLM-driven data product discovery under increasing dataset heterogeneity and conceptual complexity? By addressing these questions, this paper presents a proof-of-concept framework using LLMs to support data product discovery and prototyping, accompanied by an evaluation of their performance, and a discussion on limitations and potential benefits. Through this contribution, the present work advances the understanding of both technical and practical approaches to transforming data within modern data lakes, ultimately bridging the gap between raw data storage and actionable knowledge creation.
In this paper, data product discovery refers to identifying which combinations of available datasets and transformations are relevant for a given analytical intent, while data product creation refers to the actual materialization of the selected data product through executable queries. Discovery is often the more challenging step, as it requires semantic understanding of metadata, data content, and user intent. The proposed framework primarily targets the partial automation of data product discovery, while also supporting data product creation once a suitable candidate has been identified. Accordingly, the terms are related but not interchangeable, and their distinction explains the use of both concepts throughout the paper. Traditionally, data product discovery and creation rely on domain experts who manually inspect metadata, explore datasets, design queries, and iteratively refine results through trial and error. This manual process is time-consuming, difficult to scale, and highly dependent on expert availability. In this context, automation refers to reducing the need for manual dataset inspection and query formulation by leveraging LLM-based semantic reasoning over metadata and sample data, while preserving expert oversight for validation and governance.
The remainder of this paper is structured as follows: Section 2 provides the technical background, introducing the core concepts used throughout this paper. Section 3 presents the literature review, identifying research gaps related to the automation of data product suggestion. Section 4 describes the experimental design and methodology, along with a demonstration of the LLM-based system. It details the design of the proof-of-concept framework for automated data product generation using LLMs. Section 5 presents and analyses the experimental results, discussing key findings, system performance factors, and the effectiveness of the proposed approaches, including results of the LLM-based data product generation across different complexities and dataset domains. Finally, Section 6 summarizes the key findings and provides recommendations for future work.

2. Technical and Scientific Background

The exponential growth of data in volume, variety, and velocity has established data lakes as a dominant architecture for storing heterogeneous datasets in their native form [9]. Unlike data warehouses, which require schema-on-write, data lakes apply schema-on-read, allowing ingestion from diverse sources, such as relational databases, IoT devices, and streaming platforms [3]. Built on scalable distributed file systems, such as the Hadoop Distributed File System (HDFS) [10], and integrated with distributed processing frameworks like Apache Spark—Version 3.5.0 [11], they offer flexibility for large-scale analytics. However, without robust governance, data lakes can degrade into data swamps, where unorganized data becomes difficult to discover and use effectively.
Metadata enrichment addresses this problem by adding structured descriptive attributes, both static (e.g., source name, type) and dynamic (e.g., volume, timestamps) to raw data, improving discoverability and enabling semantic querying. Andreou and Pingos [12] introduced a semantic enrichment mechanism based on the “5Vs” of Big Data (Volume, Velocity, Variety, Veracity, and Value), supported by semantic blueprints, to classify and describe data sources prior to ingestion. This mechanism represents metadata in the Resource Description Framework (RDF), enabling advanced search and retrieval through SPARQL queries.
Apache Hadoop provides the foundational infrastructure for many data lakes, offering distributed storage via HDFS [13], large-scale computation through MapReduce [14], and resource coordination with YARN [15]. On the other hand, Apache Spark enhances this ecosystem by overcoming MapReduce’s disk I/O limitations with in-memory computation, Resilient Distributed Datasets (RDDs), and Directed Acyclic Graph (DAG) execution for optimized, fault-tolerant workflows [11,16]. On top of Hadoop, Apache Hive [17,18] simplifies Big Data analytics by translating SQL-like HiveQL queries into distributed execution plans for engines such as Tez or MapReduce [18]. Hive maintains a relational metastore for schema management and uses the Optimized Row Columnar (ORC) format to improve storage efficiency and query performance. Apache Jena offers a semantic web framework for RDF storage, SPARQL querying, ontology modelling via the Web Ontology Language (OWL), and reasoning capabilities to infer new facts from existing relationships. While effective for semantic metadata management, Jena’s RDF graph store can incur substantial indexing overhead at large scale, impacting performance.
As companies grow and generate more data, it becomes difficult for a central team to handle the growing complexity and meet the evolving business needs, which is the second challenge faced by data lakes [10]. To address this issue, data mesh was introduced as a modern data architecture that promotes a decentralized, domain-driven approach to data management. This means that data mesh shifts data ownership from a single central team to multiple teams across the organization, with each team being responsible for managing its own data. By distributing the data responsibilities across domain teams, data mesh enables organizations to scale their data capabilities more efficiently. The core concept in data meshes is the creation of data products, which represent high-quality datasets designed to meet specific business needs [19]. This concept arises from creating well-defined data products that can be easily discovered and accessed, rather than simply storing raw or unprocessed data, which is often difficult to find and use effectively. Therefore, this idea is domain-driven but relies heavily on domain experts. These experts have deep knowledge of their specific business area and are playing a critical role in transforming raw data into meaningful data products. Specifically, in a data mesh architecture, a data product is a reusable dataset that is owned and managed by a domain team to serve specific business or analytical use cases. Each data product includes not only the data itself, but also standardized metadata, quality guarantees, documentation, and access interfaces to ensure interoperability and self-service consumption. By treating data as a product within data mesh, organizations enable decentralized teams to scale data usage while maintaining consistency, reliability, and governance across the enterprise.
A clearer understanding of the role of experts in guiding data products creation can be achieved by examining how a previously proposed framework transforms data lake metadata into data mesh data products [20]. The workflow begins with domain experts identifying the relevant data sources within their area of responsibility, as these sources form the basis of the semantic metadata captured in the Turtle (.ttl) descriptions. Once selected, each source is enriched through the metadata mechanism, which semantically annotates it using RDF triples to create a consistent and machine-readable representation. Building on this enriched metadata, domain experts then create Semantic Data Blueprints, a standardized descriptive model that organizes source characteristics into reusable structure, to define the attributes that shape each data product and its granularity level. Finally, the resulting data products are generated and published within the data mesh, where they can be efficiently accessed and queried through semantic technologies such as SPARQL and Apache Jena, enabling advanced retrieval based on the blueprint-defined criteria. The automation of the process described above is the target of this paper, which investigates LLMs as the means to achieve it.
LLMs are part of AI systems that have enabled machines to understand and generate human language. These capabilities are offered through the development of transformers, a neural network architecture designed to handle sequential data such as text, but it can also process other data types, such as images and audio [21]. To better understand how LLMs work, it is essential to understand the architecture of transformers. The transformers use an encoder-decoder design, like other neural networks, in which the encoder processes the input text, and the decoder generates the output. The feature that makes transformers strong is the attention mechanism which allows the model to focus on all parts of the sentence simultaneously rather than processing words one by one. However, their effectiveness depends on the quality of data they are trained on as, well as the level of fine-tuning for specific tasks. The training process of LLMs involves two parts, the pre-training and the fine-tuning. During pre-training, which takes place first, the model is trained on various general datasets and the goal is for the model to learn different patterns. During fine-tuning, the model is trained on more specific and case relevant datasets to improve further the performance on the problem in hand. During fine-tuning, human feedback is also involved in refining the model’s responses and ensuring high accuracy [22].
Taking all this into account, LLMs are able to enhance the management of Big Data in data lakes where raw data is stored by automatically generating metadata during the ingestion process. By doing so, the organization and retrieval of the data become easier, thus minimizing or even eliminating the risk of becoming data swamp. Another advantage of LLMs is their ability to process data using natural language queries. Instead of relying on SQL queries, users can address specific questions in plain, natural language and retrieve the information they need.

3. Related Work

This section reviews existing research related to this work, focusing specifically on metadata enrichment in data lakes and LLM-assisted knowledge extraction. The aim is to provide an overview of the current state of research by highlighting existing solutions and methodologies and establishing the context to which this paper contributes. This paper extends previous work [12,19], which is briefly reviewed in this section along with other related studies, by developing a proof-of-concept framework that uses LLMs to support the suggestion and prototyping of data products, accompanied by an evaluation of their performance and potential benefits. Furthermore, this section reviews related work in light of the definitions of data products, data product discovery, and automation introduced in Section 1. In particular, it examines how existing approaches support metadata enrichment and manual data product creation and highlights the absence of automated mechanisms for discovering and generating data products based on semantic reasoning.
Metadata enrichment has been widely studied as a means to improve the discoverability, organization, and usability of data in data lakes. A notable contribution is offered in [12], whose authors proposed a semantic enrichment mechanism that embeds metadata into the data lake structure using the 5V characteristics of Big Data, that is, Volume, Velocity, Variety, Veracity, and Value, to describe and categorize data sources prior to ingestion. Specifically, the aforementioned work proposed a metadata semantic enrichment mechanism that embeds metadata into the data lake structure by leveraging a combination of the 5Vs of Big Data (Volume, Velocity, Variety, Veracity, and Value). The mechanism can therefore describe the different data sources and categorize them based on their characteristics before they are ingested into the data lake. To achieve this, the authors introduced the concept of the blueprint ontology, which consists of a static and a dynamic blueprint. The static blueprint defines attributes that are constant, such as the data source name and type. Unlike static blueprints, dynamic blueprints maintain attributes that change over time, such as data volume and timestamps, and they are automatically updated when new data arrives or metadata changes. Furthermore, the suggested framework utilizes Resource Description Framework (RDF) to model the metadata in a machine-readable format, enabling advanced querying through SPARQL. Additionally, they proposed the pond architecture to further organize the data lake into structured, semi-structured and unstructured parts (ponds). This architecture groups similar data types together and therefore not only simplifies storage and retrieval, but also improves query performance.
Extending the abovementioned work, the authors in [22] adapted their semantic data blueprints framework to the emerging data mesh paradigm, implementing metadata management and querying using Apache Jena. This real-world evaluation demonstrated that metadata enrichment supports the creation of high-quality data products in decentralized environments. In addition, the authors introduced a standardized framework for transforming metadata into data domains and products within a data mesh using the RDF Turtle format. Their framework organizes metadata sources into a hierarchical model of pillar domains and subdomains, supporting up to six levels of granularity. In evaluations using real-world case studies involving up to 100,000 semantically described data sources, they reported significant query execution speed-ups of up to 26.5 times over direct data lake querying.
Unlike previous approaches that focus mainly on data discovery and organization, the framework presented in [20] emphasizes data ownership and access control, using blockchain technology. The data is stored in the data lake which follows the same idea as discussed earlier, using semantic data blueprints. The framework sends a SPARQL query to request access to a specific data product and the system uses the semantic data blueprints to find matching data sources. If the query matches relevant sources and the data owner approves the request, the data product is generated based on the selected sources. After that, the system creates an NFT on blockchain and the users can access data products through a tokens gated web portal that checks whether the user’s wallet holds the NFT, the NFT has not expired, and the access level allows viewing or transferring. Overall, this framework addresses a significant gap in data meshes by offering an automated way to manage data access and ownership without relying on human intervention.
While metadata enrichment offers a powerful semantic solution to prevent data lakes from turning into data swamps, other researchers have proposed structural solutions from an architectural perspective. In particular, the survey by Azzabi Li et al. [2] introduces an approach known as the Zone-based architecture which divides the data lake into distinct zones each serving a specific role in data processing and management. As presented, the process begins with incoming data passing through the transient landing zone where basic compliance and business rule validations are applied to ensure that only acceptable data enter the lake. The data is then moved to the raw zone, where it is stored permanently in its original format. From there, the data that meets quality and regulatory requirements is promoted to the trusted zone, where only validated and cleansed datasets are retained. Furthermore, the refined zone holds data originating from the trusted zone that undergoes further transformations such as aggregation and filtering. The purpose of this zone is to store business-ready datasets that are tailored to support specific analytical needs. Finally, the sandbox zone is designed to provide a flexible and isolated environment for data scientists to conduct exploratory analysis by allowing temporary access to data without affecting production workflows. The above architecture helps minimize the risk of a data lake turning into a data swamp due to its structured and layered design.
The zone-based architecture described above offers a well-structured approach to managing data. However, Sawadogo and Darmont [23] observe that the application of such architectures across the literature is sometimes inconsistent. This means the functions and definitions of zones vary between models. For example, some implementations of this architecture remove raw data after processing, while others retain it permanently. The authors of this article introduce a more systematic classification of data lake architectures based on two dimensions. The first dimension, functionality-based architectures, usually contains basic functions, like data ingestion to connect with data sources, data storage to persist raw as well as refined data, data processing, and data access to allow raw and refined data querying. The second dimension, data maturity-based architectures, classifies zones according to the level of data refinement. These zones typically range from raw data zones to curated or consumption zones. The authors suggest a hybrid architecture that combines both functionality-based and data maturity-based perspectives. In addition to their suggested architecture, the authors also emphasize the importance of supporting metadata systems with six specific features in order for a data lake to be considered comprehensive. They evaluated several existing systems and found that MEDAL, their proposed metadata model, was the only one that supported all six features. However, MEDAL is yet to be implemented in practice and remains a conceptual model.
Recent work has explored the integration of LLM reasoning with algorithmic search procedures to identify meta-structures in complex graph data, demonstrating the potential of LLMs as explainable reasoning agents over structured representations [24]. While such approaches focus on structural discovery within graph-based domains, our work applies LLM-based semantic reasoning to data lake and data mesh environments. Instead of reasoning over graph topology, the proposed framework reasons over metadata, schemas, and representative data samples to identify relevant datasets and generate executable data products. In this sense, our contribution complements existing research on LLM-augmented structural discovery by extending these ideas to the domain of data management and data product automation.
Overall, the existing literature provides strong foundations for metadata enrichment, semantic querying, and data product management in data lake and data mesh environments. However, current approaches either focus on improving data organization or assume the availability of domain experts for data product discovery and creation. To the best of our knowledge, no prior work has evaluated the use of LLMs© to support data product discovery through semantic reasoning over metadata and representative data samples. This gap motivates the present work, which proposes and empirically evaluates an LLM-assisted framework for data product discovery and query prototyping under varying levels of dataset heterogeneity and conceptual complexity.

4. Methodology

Addressing the scalability and complexity challenges that domain experts face, this section proposes an experimental proof of concept to assess the feasibility of using LLM as a semantic reasoning assistant for suggesting and prototyping data products within a data lake environment. Domain experts possess specialized knowledge that enables them to analyze and interpret data, and guide the production of meaningful insights, in our case referred to as data products. Given the rapid developments in LLMs over the past four years and their strong expertise across a wide range of domains, this study explores their potential to support semantic reasoning and reduce the manual workload associated with data product discovery.
The proposed approach builds on the work of [12,22] described earlier, and most specifically on the notions of semantic metadata and blueprints. The overall workflow of the proposed system is illustrated in Figure 1. Real-world data is used in this approach to provide experimentation and evaluation. This data comes from two sources: (a) Paradisiotis Group, a prominent industrial player in Cyprus specializing in poultry farming and meat production, and (b) Europeana, a digital heritage digital repository that provides public access to cultural artifacts from museums and archives across Europe.
To clarify the practical role of LLMs within the data lake and data mesh ecosystems, this work utilizes LLMs as a semantic reasoning and orchestration layer. Specifically, LLMs operate on top of structured metadata retrieved from Apache Hive and semantically enriched descriptions expressed in RDF and managed through Apache Jena. By reasoning over this metadata together with representative data samples retrieved via Apache Spark, the LLM can interpret user intent, identify relevant datasets, and generate executable Spark SQL queries that materialize data products, as presented in Figure 1.
LLMs lack explicit task awareness, therefore the task intent must be encoded in the prompt, either implicitly or explicitly, through instructions, examples, or contextual definitions. This study considered quite important to include a generic concept that reflects the user’s intent. Alongside the user concept, the system constructs a dataset context that includes structured metadata and sample data records. This context helps the LLM reason over what type of information exists in the data lake and how it might be helpful in the user’s concept. As regards metadata retrieval, the system uses Apache Hive as a centralized metastore for all datasets. This metadata is stored in a centralized Hive table, which contains schema-level information about each data source, including attributes such as dataset identifiers, file paths, data volume, accuracy, and other domain-specific fields. The metadata is retrieved by executing a simple SQL query in the form “SELECT * FROM …”. The structure of this metadata table is shown in Table 1.
Following this, the system proceeds to retrieve a small sample of rows from each dataset stored in HDFS to provide the LLM with a clearer understanding of the actual data content, as metadata alone does not fully capture the structure or semantics of the data. To achieve this, the system uses Apache Spark, an environment that supports a variety of formats, such as CSV and JSON. A module scans each directory path listed in the metadata table. Using Spark SQL, the system loads the file and randomly samples a small number of records to avoid introducing bias that could result from selecting only the first rows. Sampling is important due to the token and context length limits of LLMs, which cannot process complete datasets.
After receiving the user’s concept, along with the metadata and sample data from each dataset, the system composes the prompt to send it to the LLM. The prompt consists of the user concept, the metadata for each dataset, a small sample from each dataset, and strict formatting instructions for the required output format. The proposed framework integrates the OpenAI API, and, specifically, it employs the GPT-4.1 (web-search-preview) model, which is capable of processing complex prompts and supports real-time web search.
Generating the suggested data products requires the model to perform a semantic analysis to understand the intent behind the concept. It then reasons like a domain expert by evaluating what types of data products would be relevant given the user’s goal and the available data structure. However, the LLM performs an additional background web search using keywords derived from the prompt, and the retrieved sources are used to generate annotations and citations for additional justification. Table 2 shows the structure of a suggested data product. The valid suggestions are received in JSON format and saved to HDFS.
Finally, to turn the LLM’s suggestions into actual data products, the system parses the JSON and executes the associated SQL queries using Apache Spark. However, before executing the queries, it identifies all the HDFS file paths referenced in each query. Suppose the query returns results. The resulting data product, in other words, the actual output of the query, is saved to HDFS in a structured JSON format. After each successful execution, the system maintains a registry that stores metadata about each output, including the title, description, SQL query, associated concept, output path in HDFS, and generation timestamp. This registry is implemented using a lightweight SQLite database to maintain a fast and easily manageable metadata catalog of the generated data products. The overall workflow described above is summarized in Algorithm 1, which provides a step-by-step pseudocode representation of the LLM-based data product suggestion system.
Algorithm 1 LLM-Based Data Product Suggestion Workflow
Input: User concept C, dataset pool D
Output: Suggested data products P
1metadataCollection ← ∅;
2sampleCollection ← ∅;
3metadataCollection ← fetchMetadata(D);
4foreach record r in metadataCollection do
5 samples ← getSampleFromHDFS(r.source_path);
6 append(samples) to sampleCollection;
7context ← constructContext(metadataCollection, sampleCollection);
8prompt ← formatPrompt(userConcept C, context, constraints);
9response ← callLLM(prompt, webSearch = true);
10products ← parseJSONResponse(response);
11P ← ∅;
12foreach product p in products do
13 if isSQLExecutable(p.query) then
14 result ← executeSQL(product p query, Spark);
15 if result ≠ ∅ then
16 saveToHDFS(result, product p);
17 registerProduct(product p, concept C);
18 add product p to P;
19P ← rankProducts(P);
20return P;
To illustrate how the proposed methodology works in practice, a step-by-step demonstration example using data from Europeana is presented, along with a visual representation. Figure 2 shows the initial system interface, where users can enter a concept in natural language reflecting their analytical intent. In this example, the input represents a low-complexity concept involving 24 available datasets. Once the user submits a data request, the system starts processing it using the LLM. This step involves parsing the user’s concept, along with the available metadata and sample data, in order to generate relevant data product suggestions. After the LLM processes the user’s request, the system displays a list of suggested data products. Each data product includes a title, a description, the reasoning, a SQL query, annotation, and relevant citations. An example of the suggested data products is shown in Figure 3, where users can also download each data product in JSON format.

5. Experimental Evaluation

Given the objectives of this study, a systematic evaluation study was conducted to examine the effectiveness of the proposed LLM-based approach and determine whether an LLM can support data product discovery and prototyping. Specifically, the experimental design focuses on how domain characteristics, dataset availability and user concept complexity shape the LLM’s reasoning and data product generation ability. The following subsection analyzes the experimental process in detail.

5.1. Design of Experiments

As already mentioned in Section 4, Europeana and Paradisiotis Group are used as the domains of the evaluation experiments. These domains were intentionally selected to examine the LLM’s ability in a realistic data lake environment containing both semi-structured and structured data, respectively, to provide a more accurate assessment of the model’s performance under real-world conditions. The second critical aspect of the evaluation is the user concept, as it presents a real-world analytical question that a domain expert would typically address. Designing appropriate user concepts for each domain is essential for evaluating the LLM’s semantic reasoning capabilities under varying levels of cognitive demand. Therefore, three escalating levels of complexity were designed in the experiments, low, medium, and high. At the lowest complexity level, the objective is to evaluate the LLM’s ability to identify and retrieve relevant datasets with minimal reasoning. This means that it focuses on simple filtering and selection operations. At the medium complexity level, the LLM is prompted to reason more, but also to combine information from multiple datasets and apply a more complex logic to the SQL queries, such as WHERE clauses and JOIN operations. Finally, at the highest level of complexity, the user concept requires more advanced analytical reasoning across multiple datasets, which can trigger more complex conditional logic. Additionally, the high complexity level can trigger the LLM to perform external web search by a more open-ended phrase (e.g., based on current trends) that can enrich its reasoning and thereby allow for a better understanding of its capabilities. The diversity of data sources included in the experimental process does not affect the validity of the results as the proposed approach is not tied up to a specific form of data. On the contrary, it demonstrates the generalizability of the approach to be able to handle data coming from different application domains having diverse formats.
For each domain, the experiments were conducted across three data pool sizes: 6, 12, and 24 datasets. When referring to data pool size, we mean the number of distinct datasets, each with unique schemas and corresponding metadata entries, made available to the LLM during a single experimental run. Each time the data pool size increases, the semantic search space introduces more heterogeneity and potential noise because, as the pool grows, the number of schemas, metadata records, and sampled rows from datasets increases. Additionally, within each dataset pool, the different levels of complexity were applied across all dataset pool sizes for each domain, with each domain having its own distinct set of concepts tailored to its context. This design is intended to examine (i) the scalability of the LLM across different pool sizes when reasoning over larger and more diverse data sources, and (ii) how both the volume of available datasets and the complexity of user-defined concepts affect the LLM’s reasoning capabilities and its ability to generate meaningful data products.
Throughout the experiments, the LLM model GPT-4.1 was used, which supports 30,000 tokens for Tier 1 users. However, this limitation introduced a new challenge in the experimentation, as it restricted the number of metadata and sample data that could be provided in a single prompt. As a result, 24 datasets were established as the upper bound, not only to allow the model to receive contextual information without exceeding its token limit, but also to investigate the LLM’s ability to reason over a larger number of datasets. This upper bound was determined using the LLM’s tokenizer tool.
Having established the main evaluation conditions, the next step was to provide the baseline for measuring the LLM’s performance, evaluating both correctness and relevance of the LLM’s suggestions by preparing a ground truth (the expected outputs that a domain expert would suggest when given the same task as the LLM). Each entry of the ground truth has the structure of the LLM’s response, as shown earlier in Algorithm 1. A significant challenge in such an experiment is creating the necessary reference data products that serve as a ground-truth baseline for the evaluation. Ideally, a domain expert would be responsible for creating ground-truth datasets for each domain. However, this poses a major constraint as domain experts are difficult to find and the process is time-consuming, requiring substantial effort and dedication. Instead, domain analysis was performed systematically by inspecting datasets, analyzing metadata, and consulting publicly available domain documentation. However, given the practical constraints of the experimental setup, an LLM was used as an auxiliary drafting tool to propose some candidate data products. It is important to note that the LLM did not act autonomously, and its outputs were not accepted verbatim. Instead, each data product was reviewed, edited, extended, or rejected based on the domain relevance and the available datasets. In several cases, data products were manually authored without any LLM involvement.
Evaluating the performance of the proposed LLM in generating data products required a combination of quantitative and qualitative metrics to provide a comprehensive assessment. To operationalize these evaluation goals, a set of complementary metrics was defined, to capture different aspects of the LLM’s performance as follows: The first metric was reasoning similarity, which measures how similar reasoning explanations of the LLM are compared to the ground truth. The following logic was used to measure this metric: in order to decide which suggested data product matches the best with the data products from the ground truth, the process first grouped all products with the same FROM clause of the SQL query. Then, for each data product in that group, it calculated a similarity score using a sentence embedding generated by the all-MiniLM-L6-v2 model from Sentence Transformers. The data product that had the highest similarity score was considered a match. However, if the highest similarity score was lower than the threshold value of 0.6, it was not considered a match.
One may argue that using sentence-embedding similarity with a fixed threshold to decide product matches is a pragmatic choice, but it risks rewarding overlap rather than substantive equivalence. However, in our evaluation framework, embedding similarity is not intended to capture the textual resemblance but to assess semantic alignment between the generated and reference data products. A partial schema-level criterion is implemented by comparing only products that refer to the same underlying dataset by grouping candidates according to the FROM clause. We chose not to enforce additional matching criteria such as strict SQL structural equivalence because the objective of this study was to evaluate whether the LLM correctly identifies the relevant dataset and captures the overall intent of the data product rather than whether it reproduces the exact syntactic formulation of the ground truth query. Therefore, enforcing a rigid schema could penalize semantically valid but structurally alternative implementations.
The SQL structural accuracy was also used to evaluate whether the LLM generated SQL queries matched the corresponding logic of the ground truth queries. However, the comparison did not only perform a simple matching but also focused on semantic similarity to ensure that the underlying logic of the queries were aligned. The comparison required first to parse the SQL statement and breaking them down into their logical clauses (e.g., SELECT, FROM, WHERE, JOIN). Each clause was assigned a specific weight accordingly and it was compared individually using the sqlglot parser which produced a parsed syntax tree. Each clause contributed to the final accuracy based on its assigned weight. The final accuracy was computed by averaging the per-query SQL structural accuracy scores across all queries.
A third metric employed in the evaluation was execution accuracy, which examines whether the generated SQL queries are not only correct in structure but also executable on real datasets. To measure this, the suggested queries were executed using Apache Spark directly on HDFS datasets, and the execution accuracy was calculated as the ratio of successfully executed queries to the total number of generated queries.
Another metric considered in the experiments was ranking. Each time, the LLM ranks the data products based on their importance. These ranks were then matched against the data products identified through reasoning as described above. The matching idea was simple: if the matched LLM data product had the same rank value as the ground truth, then this metric returned 1; otherwise, 0, and the final metric was calculated by averaging across all matched predictions.
Precision and recall were also calculated to determine how relevant and accurate the LLM’s suggestions were, as well as how comprehensive the LLM’s suggestions were in covering the ground truth. In more detail, precision measures the proportion of correctly matched data products out of all the data products suggested by the LLM. Recall, on the other hand, measures the proportion of correctly matched data products out of the total number of ground truth data products. Based on these metrics, the following conclusions can be drawn: High precision but low recall means that the LLM suggests only a few but mostly correct products. On the other hand, high recall but low precision means that the LLM suggests many products, but a lot of these are incorrect. The F1 score is also reported to summarize the trade-off between correctness and ground-truth coverage in the generated data products.
In addition to reporting precision and recall at a fixed matching threshold, the performance is also evaluated across varying similarity thresholds using precision-recall (PR) curves. For each run, reasoning similarity scores are treated as confidence values, and precision and recall are computed by sweeping the decision threshold. The area under the resulting curves demonstrates the model’s ability to consistently assign higher similarity scores to correct data products than to incorrect ones across all thresholds.
Finally, the generation time of the LLM was also recorded. This metric measures the time required for the LLM to generate the suggested data products for each run. The results were averaged across multiple runs to provide a reliable estimate of the model’s response time.
Within the proposed framework, the role of the LLM is utilized to specific stages of the data product creation process, in alignment with the data mesh paradigm. The LLM does not perform data ingestion, storage, or execution, which remain the responsibility of the underlying Big Data infrastructure (Hive, Spark, HDFS). Instead, it operates as a semantic reasoning layer that interprets the user-defined concept, reasons over structured metadata and representative data samples, selects relevant datasets and generates candidate data products in the form of executable Spark SQL queries accompanied by semantic justifications. These outputs correspond directly to data mesh data products, which are defined as dedicate and reusable datasets rather than raw data assets.
The selected evaluation metrics were chosen to align with the nature of data products in data mesh environments and the specific role of the LLM within the proposed framework. Traditional machine learning metrics such as classification accuracy, BLEU, ROUGE, or perplexity were not considered appropriate, as the task does not involve label prediction or text generation quality in isolation, but rather the generation of executable, semantically meaningful data transformations over real datasets. Reasoning similarity was therefore selected to capture conceptual alignment between the LLM’s explanations and expert intent, which is central to assessing whether a suggested data product satisfies a business or analytical goal. SQL structural accuracy was preferred over result-based similarity metrics because different SQL queries can return identical outputs while expressing different transformation logic. Evaluating structural correctness, therefore, better reflects the reusability, readability, and maintainability requirements of data products. Execution accuracy was included to ensure operational validity which is essential for any data product deployed in a production data lake. Finally, ranking accuracy was selected instead of relevance-only metrics to evaluate the LLM’s ability to prioritize outputs, which is critical for self-service discovery and consumption in data mesh architectures. Collectively, these metrics were selected to reflect semantic validity, technical correctness, and practical usability, which cannot be adequately captured by generic Natural Language Processing (NLP) or predictive performance metrics.
Finally, to mitigate subjectivity in the evaluation, the ground truth was not generated autonomously by the LLM but constructed under expert supervision, as mentioned above. The LLM was used as an assistive tool to propose candidate data products, while domain experts reviewed and validated before inclusion. Only outputs that satisfied experts in terms of correctness, relevance, and executability criteria were retained. This hybrid approach reduces bias, ensures consistency across experimental conditions and provides a scalable and objective reference baseline for evaluating LLM performance.

5.2. Experimental Results

This section presents the evaluation results of the LLM-based data product suggestion framework across the Europeana (Domain 1) and Paradisiotis Group (Domain 2) domains. First, the focus is on evaluating the model’s performance based on the defined metrics, emphasizing both system accuracy and quality of the generated data products for Domain 1. Then, a comparison between the two domains is provided to examine how the LLM handles reasoning over structured versus semi-structured data.
As shown in Figure 4, all metric curves decline as both the dataset size and complexity increase, with the decrease becoming particularly noticeable when moving from medium to high complexity, especially for larger dataset pools. Also noticeable is a steep decline from medium to high complexity, particularly for 12 and 24 datasets. However, higher complexity produces a larger drop in the metrics compared to the number of datasets, which causes a relatively smaller decline. The decrease reflects the LLM facing a larger search space of potential data product combinations. Although the input remains within the token limit, the model struggles to filter information effectively, generating more irrelevant suggestions and missing some correct data products, thus reducing recall and F1.
Figure 5 shows generation time trends for Domain 1. As can be observed from the graph, for small dataset pools generation time remains relatively stable across all concept complexities. As the dataset size and complexity increase, generation time rises, indicating that increased reasoning is required over a larger search space. For the largest dataset pool, a slight decrease in generation time is observed at high complexity, which may be due to domain characteristics or the model approaching its token limit, potentially limiting reasoning scope to manage cognitive load efficiently. Overall, generation time requirements are quite low, less than 1.5 min in worst case scenario.
Figure 6 and Figure 7 illustrate the evaluation metrics for Domain 1 across different dataset sizes and concept complexities. All predictions include every LLM-generated data product grouped by the FROM table regardless of correctness, while matched predictions are those that successfully align with a ground truth data product and meet the highest reasoning similarity threshold. For all predictions, SQL accuracy and reasoning similarity decline as dataset size and complexity increase, reflecting the challenges the LLM faces in navigating a larger search space. In contrast, matched predictions maintain relatively stable SQL accuracy and reasoning similarity indicating that when the LLM correctly identifies relevant datasets it generates structurally correct queries and reasoning closely aligned with the ground truth. Execution accuracy remains consistently high across both all predictions and matched predictions, demonstrating that generated queries are syntactically valid and executable. Finally, ranking accuracy decreases with increasing complexity and dataset size for both prediction types, suggesting that the LLM struggles to prioritize data products under more demanding analytical scenarios.
Having initially examined Domain 1, it is important to compare performance across both domains to assess whether the LLM generates more accurate data products when operating on structured datasets. Figure 8 presents the performance evaluation results for Domain 2, based on which we can infer the following comparative outcomes: Across dataset sizes and complexity levels, the F1 scores indicate that the LLM generally performs better on Domain 1, with values ranging from 0.49 to 0.96, compared to Domain 2, which ranges from 0.48 to 0.76. While performance declines in both domains as dataset size and complexity increase, Domain 2 remains relatively stable at low and medium complexity levels, whereas Domain 1 exhibits greater variability. These observations do not necessarily imply that the LLM handles semi-structured data better. Instead, they suggest that the model evaluated is more familiar with Domain 1, likely due to greater prior training or exposure to this domain’s data.
The structure of the datasets appears to have a direct impact on generation time. As shown in Figure 9, for Domain 2, which consists of well-defined structured data, the generation time remains relatively low and stable, with only slight increases as the number of datasets grows. In contrast, Domain 1 exhibits consistently higher generation times across all dataset sizes and complexity levels, probably because the LLM is required to devote more effort to navigate semi-structured data.
Comparing the performance of the LLM across all predictions for Domain 1 and Domain 2 reveals distinct patterns. First, from Figure 10, Domain 2 generally maintains higher SQL accuracy across most dataset sizes and complexity levels, particularly at higher complexities, whereas Domain 1 shows stronger reasoning similarity in most scenarios, indicating that the LLM produces outputs that are more aligned with the expected reasoning for this domain. Execution accuracy is consistently higher in Domain 1 at lower dataset sizes but declines more sharply with increasing dataset size and complexity, while Domain 2 exhibits more stable execution performance. Ranking accuracy remains low in both domains across all scenarios, though Domain 1 occasionally achieves slightly higher values at lower complexities.
To further analyze the model’s classification performance, confusion matrices were used to examine the relationship between the predicted and actual data product suggestions. It is worth noting that false positives correspond to generated data products that do not match any ground-truth product. In contrast, false negatives correspond to ground-truth products that no generated candidate matches. True negatives are not present in our evaluation because the design is candidate-restricted rather than a closed-set binary classification. Therefore, potential data products that are neither generated by the model nor included in the ground truth are not enumerated and therefore not evaluated. As a result, true negatives are undefined under this evaluation and are set to zero. As shown in Figure 11 and Figure 12, both domains show a consistent pattern, confirming that concept complexity has a stronger impact on performance than dataset size. Under low complexity, the model achieves relatively balanced predictions, with high true-positive rates. However, as complexity increases, the LLM tends to overpredict relevant data products which leads to a rise of false positives and a decline in true positives.
Precision-recall curves, as shown in Figure 13 and Figure 14, respectively, were employed to examine the tradeoff between precision and recall for both domains. This analysis helps to better understand how well the model distinguishes relevant from irrelevant data products across varying confidence thresholds. For Domain 1, the AUC improves from 0.82 to 0.90 as the dataset size increases from 6 to 24, while in Domain 2 it remains relatively consistent, between 0.88 and 0.90. Overall, the results indicate that the LLM maintains a stable ability to distinguish relevant from irrelevant data products, even as dataset size increases.
The use of external web search theoretically can introduce temporal variability, as retrieved web content may change over time, leading to non-deterministic model behavior and reduced reproducibility. To investigate this situation, an ablation study was conducted comparing the framework’s performance with web search enabled and disabled. Specifically, this study isolates the effect of external web access on the generated data products while keeping all other components of the pipeline unchanged, including prompts, datasets, domain, configurations, and evaluation pipelines.
As shown in Table 3 and Table 4, across both domains, this study suggests that when web search is enabled, precision across all complexity levels is improved, indicating that reliance on external background knowledge increases something which leads to capturing better domain-specific semantics. This increase in precision is particularly noticeable in Domain 2, where values are substantially larger. In contrast, execution accuracy, reasoning, and recall decrease when web search is enabled, regardless of the complexity, with the degradation being more noticeable in Domain 2. A possible explanation for this decrease lies in the fact that the external information can shift the model’s focus away from the available dataset. In particular, web-derived content may encourage the model to introduce additional assumptions, attributes, or relationships that are semantically valid in a general sense but not supported by the underlying data.
Overall, the experimental results show that both dataset pool size and concept complexity influence system performance, with their combined effect reducing performance under high complexity and large pool conditions. The semantic search space expands as the number of datasets increases, however, the most significant declines in precision, recall, and ranking accuracy are observed when moving from medium to high complexity. At the same time, increasing concept complexity results in greater performance degradation within fixed dataset pools, particularly from medium to high complexity. The confusion matrix analysis further confirms that higher complexity leads to overprediction; therefore, LLMs struggle more with dataset prioritization. However, once the relevant datasets are identified, the LLM can generate valid and meaningful data products. Finally, the web-search ablation study revealed that external search improves precision by leveraging broader domain knowledge but reduces recall and execution accuracy, likely because external information influences the model’s reasoning and its alignment with the underlying data sources.

6. Conclusions

This paper introduced a framework for supporting the process of suggesting and prototyping data products by leveraging the semantic reasoning capabilities of LLMs over data lake metadata and file contents. The effectiveness of the LLM’s suggestions depends on a well-designed prompt that incorporates a user-defined concept, metadata retrieved from Apache Hive, sample records from HDFS, and clear instructions regarding the scope of the task.
A set of experiments was designed and executed to assess the efficiency and accuracy of the system when varying query complexity, dataset size and format of data. Two domains were selected for the experiments, one related to a poultry meat factory and the other the well-known cultural heritance digital library of Europeana. The diversity in the type, content and structure of the data was the key driver for the selection of these domains, along with the availability of the data.
Based on the evaluation conducted, the following findings were recorded:
  • The LLM framework demonstrated that it can effectively support data product suggestions in simple scenarios and small dataset pools. However, as dataset size and concept complexity increase, effectiveness declines. Specifically, both precision and recall decrease significantly under higher complexity. This outcome is explained by the exponentially larger search space the LLM must explore, which makes it hard to filter and reason effectively.
  • The evaluation also suggested that quality of the suggested data products depends heavily on the model’s ability to correctly identify relevant datasets. When the LLM selected the appropriate datasets, it demonstrated a strong capability to generate high-quality results in terms of both query accuracy and reasoning. However, in complex scenarios, the LLM struggled to select the correct datasets, often leading to incorrect or less relevant data products. Additionally, the evaluation results indicated that the LLM also struggled to prioritize and rank the data products, not only in complex scenarios, but even in simpler cases.
  • The evaluation process also revealed that the LLM performs more consistently and accurately when working with structured datasets in terms of SQL accuracy, as observed in the evaluation of Domain 2. This is likely because structured data reduces schema complexity, allowing the model to translate intent into precise SQL queries more easily. In contrast, semi-structured data resulted in stronger reasoning performance. However, this difference does not necessarily stem from the structure of the data itself, but rather from the LLM’s pre-existing knowledge and familiarity with the specific domain.
In summary, while the LLM demonstrated potential in low-complexity scenarios and smaller dataset pools, real-world business requirements remain demanding. The results show that LLM performance declines as dataset heterogeneity and conceptual complexity increase. Therefore, the model should be viewed as a support tool that assists domain experts during the data product discovery and prototyping process by generating candidate data products grounded in metadata and sample data. As complexity and dataset heterogeneity grow, expert judgment remains essential to ensure correctness, relevance, and alignment with domain-specific requirements.
Furthermore, data products can be understood as epistemic intermediaries that bridge raw data and decision-making by encoding assumptions, interpretations, and notions of relevance. From this perspective, as argued in recent philosophical analyses of data-centric systems, the significance of a data product lies not only in its correctness or efficiency, but also in its capacity to operationalize meaning and support explanation within a specific domain context [25]. This epistemic framing provides a useful lens for interpreting the role of LLMs in the proposed framework. Rather than treating LLM-generated data products as authoritative substitutes for domain expertise, our results suggest that LLMs function as scalable reasoning tools and methods that help surface candidate interpretations grounded in metadata, schema structure, and representative data samples. These candidates act as provisional epistemic artifacts that can support expert exploration and validation. Consequently, relevance is not determined only by algorithmic output, but emerges through the interaction between LLM-assisted generation and expert judgment. Framing data products in this way clarifies why human expertise remains essential, while also explaining how LLMs can meaningfully augment data-driven knowledge creation in data mesh environments by expanding the space of interpretable and actionable data products.
Several threats to validity should be considered when interpreting the results of this study. Regarding internal and construct validity, the experimental design does not isolate individual factors such as data structure, domain semantics, and prior model familiarity, which may jointly influence the observed performance differences. Consequently, performance metrics should be interpreted as empirical indicators of system behavior rather than as evidence of causal relationships. In terms of external validity, the use of heterogeneous domains and mixed structured and semi-structured datasets reflects realistic data lake and data mesh environments, supporting the applicability of the findings to real-world settings, although generalization to other domains or organizational contexts may require further validation. Finally, concerning reliability, the experiments were conducted using a fixed evaluation protocol, consistent dataset pools, and predefined metrics across multiple complexity levels, which supports the repeatability of the reported results. Nonetheless, variations in model versions or prompt formulations may affect outcomes, and future studies should assess robustness across alternative configurations.
Future research steps will include extending the system by fine-tuning an LLM on datasets from other specific domains. The goal is to investigate whether a domain-specialized model can generate more accurate and contextually relevant data product suggestions compared to a general-purpose model. By doing this, this study will evaluate the trade-off between broad generalization and domain-focused specialization, and whether targeted fine-tuning reduces irrelevant outputs or improves alignment with the actual needs of domain experts. Finally, future work will investigate the incorporation of governance and security controls into the LLM-driven data product lifecycle, including automated policy-aware dataset selection, access control enforcement, and compliance validation, ensuring that generated data products fit to organizational, regulatory, and privacy constraints within data mesh environments.

Author Contributions

Conceptualization, M.P.; methodology, M.P., A.P. and A.S.A.; software, A.P.; validation, M.P. and A.S.A.; formal analysis, M.P. and A.P.; investigation, M.P. and A.P.; resources, M.P. and A.P.; data curation, M.P. and A.P.; writing—original draft preparation, M.P. and A.P.; writing—review and editing, M.P., A.P. and A.S.A.; visualization, A.P.; supervision, M.P. and A.S.A.; project administration, M.P. and A.S.A.; funding acquisition, not applicable. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Kaisler, S.H.; Armour, F.; Espinosa, J.A.; Money, W.H. Big Data: Issues and Challenges Moving Forward. In Proceedings of the 2014 47th Hawaii International Conference on System Sciences; IEEE Computer Society: Los Alamitos, CA, USA, 2013; pp. 995–1004. [Google Scholar] [CrossRef]
  2. Rawat, R.; Yadav, R. Big Data: Big Data Analysis, Issues and Challenges and Technologies. IOP Conf. Ser. Mater. Sci. Eng. 2021, 1022, 012014. [Google Scholar] [CrossRef]
  3. Azzabi, S.; Alfughi, Z.; Ouda, A. Data Lakes: A Survey of Concepts and Architectures. Computers 2024, 13, 183. [Google Scholar] [CrossRef]
  4. Hlupić, T.; Oreščanin, D.; Ružak, D.; Baranović, M. An Overview of Current Data Lake Architecture Models. In Proceedings of the 2022 45th Jubilee International Convention on Information, Communication and Electronic Technology (MIPRO), Opatija, Croatia, 23–27 May 2022; pp. 1082–1087. [Google Scholar] [CrossRef]
  5. Boukraa, D.; Bala, M.; Rizzi, S. Metadata Management in Data Lake Environments: A Survey. J. Libr. Metadata 2024, 24, 215–274. [Google Scholar] [CrossRef]
  6. Hoseini, S.; Theissen-Lipp, J.; Quix, C. Semantic Data Management in Data Lakes. arXiv 2023, arXiv:2310.15373. [Google Scholar] [CrossRef]
  7. New Trends in Databases and Information Systems. In Proceedings of the ADBIS 2019 Short Papers, Workshops BBIGAP, QAUCA, SemBDM, SIMPDA, M2P, MADEISD, and Doctoral Consortium, Bled, Slovenia, 8–11 September 2019; Springer International Publishing: Cham, Switzerland, 2019. [CrossRef]
  8. Nargesian, F.; Zhu, E.; Miller, R.J.; Pu, K.Q.; Arocena, P.C. Data Lake Management: Challenges and Opportunities. Proc. VLDB Endow. 2019, 12, 1986–1989. [Google Scholar] [CrossRef]
  9. Khine, P.; Wang, Z. Data lake: A new ideology in big data era. ITM Web Conf. 2018, 17, 03025. [Google Scholar] [CrossRef]
  10. Shvachko, K.; Kuang, H.; Radia, S.; Chansler, R. The Hadoop Distributed File System. In Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), Incline Village, NV, USA, 3–7 May 2010; pp. 1–10. [Google Scholar] [CrossRef]
  11. Zaharia, M.; Xin, R.; Wendell, P.; Das, T.; Armbrust, M.; Dave, A.; Meng, X.; Rosen, J.; Venkataraman, S.; Franklin, M.; et al. Apache Spark: A Unified Engine for Big Data Processing. Commun. ACM 2016, 59, 56–65. [Google Scholar] [CrossRef]
  12. Pingos, M.; Andreou, A.S. Exploiting Metadata Semantics in Data Lakes Using Blueprints. In Evaluation of Novel Approaches to Software Engineering; Kaindl, H., Mannion, M., Maciaszek, L.A., Eds.; Springer Nature: Cham, Switzerland, 2023; pp. 220–242. [Google Scholar] [CrossRef]
  13. Manchana, R. Building a modern data foundation in the cloud: Data lakes and data lakehouses as key enablers. J. Artif. Intell. Mach. Learn. Data Sci. 2022, 1, 1098–1108. [Google Scholar] [CrossRef] [PubMed]
  14. Ghazi, M.R.; Gangodkar, D. Hadoop, MapReduce and HDFS: A Developers Perspective. Procedia Comput. Sci. 2015, 48, 45–50. [Google Scholar] [CrossRef]
  15. Zarei, A.; Safari, S.; Ahmadi, M.; Mardukhi, F. Past, Present and Future of Hadoop: A Survey. arXiv 2022, arXiv:2202.13293. [Google Scholar] [CrossRef]
  16. Zaharia, M.; Chowdhury, M.; Franklin, M.J.; Shenker, S.; Stoica, I. Spark: Cluster Computing with Working Sets. In Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing (HotCloud’10), Boston, MA, USA, 22–23 June 2010; USENIX Association: Berkeley, CA, USA, 2010; p. 10. [Google Scholar]
  17. Thusoo, A.; Sarma, J.; Jain, N.; Shao, Z.; Chakka, P.; Zhang, N.; Anthony, S.; Liu, H.; Murthy, R. Hive—A Petabyte Scale Data Warehouse Using Hadoop. In Proceedings of the International Conference on Data Engineering (ICDE), Long Beach, CA, USA, 1–6 March 2010; pp. 996–1005. [Google Scholar] [CrossRef]
  18. Camacho-Rodríguez, J.; Chauhan, A.; Gates, A.; Koifman, E.; O’Malley, O.; Garg, V.; Haindrich, Z.; Shelukhin, S.; Jayachandran, P.; Seth, S.; et al. Apache Hive: From MapReduce to Enterprise-Grade Big Data Warehousing. arXiv 2019, arXiv:1903.10970. [Google Scholar] [CrossRef]
  19. Machado, I.A.; Costa, C.; Santos, M.Y. Data Mesh: Concepts and Principles of a Paradigm Shift in Data Architectures. Procedia Comput. Sci. 2022, 196, 263–271. [Google Scholar] [CrossRef]
  20. Pingos, M.; Mina, A.; Andreou, A.S. Transforming Data Lakes to Data Meshes Using Semantic Data Blueprints. In Proceedings of the 2024 ENASE Conference, Angers, France, 28–29 April 2024; pp. 344–352. [Google Scholar]
  21. Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. arXiv 2023, arXiv:1910.10683. [Google Scholar] [CrossRef]
  22. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2023, arXiv:1706.03762. [Google Scholar] [CrossRef] [PubMed]
  23. Pingos, M.; Christodoulou, P.; Andreou, A.S. Security and Ownership in User-Defined Data Meshes. Algorithms 2024, 17, 169. [Google Scholar] [CrossRef]
  24. Sawadogo, P.; Darmont, J. On Data Lake Architectures and Metadata Management. J. Intell. Inf. Syst. 2020, 56, 97–120. [Google Scholar] [CrossRef]
  25. Chen, L.; Xu, F.; Li, N.; Han, Z.; Wang, M.; Li, Y.; Hui, P. Large Language Model-Driven Meta-Structure Discovery in Heterogeneous Information Networks. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’24), Barcelona, Spain, 25–29 August 2024; Association for Computing Machinery: New York, NY, USA, 2024; pp. 1–12. [Google Scholar] [CrossRef]
Figure 1. Workflow of the system.
Figure 1. Workflow of the system.
Bdcc 10 00072 g001
Figure 2. User interface for entering data product suggestion.
Figure 2. User interface for entering data product suggestion.
Bdcc 10 00072 g002
Figure 3. Sample data product generated by the LLM (only one example shown out of all suggested data products).
Figure 3. Sample data product generated by the LLM (only one example shown out of all suggested data products).
Bdcc 10 00072 g003
Figure 4. Performance metrics variations across dataset sizes and complexities (Domain 1).
Figure 4. Performance metrics variations across dataset sizes and complexities (Domain 1).
Bdcc 10 00072 g004
Figure 5. LLM’s generation time across different dataset sizes and complexities (Domain 1).
Figure 5. LLM’s generation time across different dataset sizes and complexities (Domain 1).
Bdcc 10 00072 g005
Figure 6. LLM performance trends for correctly matched predictions (Domain 1).
Figure 6. LLM performance trends for correctly matched predictions (Domain 1).
Bdcc 10 00072 g006
Figure 7. LLM performance trends across dataset sizes and complexities for all suggested data products (Domain 1).
Figure 7. LLM performance trends across dataset sizes and complexities for all suggested data products (Domain 1).
Bdcc 10 00072 g007
Figure 8. Performance Metrics variations across dataset sizes and complexities (Domain 2).
Figure 8. Performance Metrics variations across dataset sizes and complexities (Domain 2).
Bdcc 10 00072 g008
Figure 9. LLM’s generation time across different dataset sizes and complexities (Domain 2).
Figure 9. LLM’s generation time across different dataset sizes and complexities (Domain 2).
Bdcc 10 00072 g009
Figure 10. LLM performance trends across dataset sizes and complexities for all for all suggested data products (Domain 2).
Figure 10. LLM performance trends across dataset sizes and complexities for all for all suggested data products (Domain 2).
Bdcc 10 00072 g010
Figure 11. Confusion matrices by dataset size and complexity for Domain 2.
Figure 11. Confusion matrices by dataset size and complexity for Domain 2.
Bdcc 10 00072 g011
Figure 12. Confusion matrices by dataset size and complexity for Domain 1.
Figure 12. Confusion matrices by dataset size and complexity for Domain 1.
Bdcc 10 00072 g012
Figure 13. Precision-recall curve for Domain 1.
Figure 13. Precision-recall curve for Domain 1.
Bdcc 10 00072 g013
Figure 14. Precision-recall curve for Domain 2.
Figure 14. Precision-recall curve for Domain 2.
Bdcc 10 00072 g014
Table 1. Structure of the metadata table used for dataset descriptions.
Table 1. Structure of the metadata table used for dataset descriptions.
Field NameDescription
idUnique identifier for each dataset
source_nameName of the dataset
source_pathFull HDFS path to the dataset file
ownerEntity or organization that owns the dataset
volumeApproximate data volume (e.g., 50 MB, 1 GB)
yearYear the data was collected
locationGeographic location related to the dataset
velocityFrequency or speed of data collection (e.g., daily, real-time)
varietyType of data format or content variation
accuracyEstimated accuracy of data category
typeType of dataset
keywordsDescriptive keywords associated with the dataset (stored as an array)
Table 2. Structure of a suggested data product returned by the LLM.
Table 2. Structure of a suggested data product returned by the LLM.
FieldDescription
TitleA concise name summarizing the purpose or focus of the data product
DescriptionA short explanation of what the data product represents and how it can be used
FieldsThe specific dataset columns used to generate the data product
Spark SQLA query that extracts or transforms the data using Spark SQL
AnnotationA brief summary of why the data product is relevant or insightful
CitationsA list of at least three real, reputable URLs that support the reasoning behind the product
Table 3. Mean difference for Domain 1 between web search—enabled and disabled.
Table 3. Mean difference for Domain 1 between web search—enabled and disabled.
MetricLow ComplexityMedium ComplexityHigh Complexity
Execution−0.15−0.10−0.05
Precision+0.13+0.08+0.14
Reasoning−0.06−0.010.00
Recall−0.12−0.07−0.02
Table 4. Mean difference for Domain 2 between web search—enabled and disabled.
Table 4. Mean difference for Domain 2 between web search—enabled and disabled.
MetricLow ComplexityMedium ComplexityHigh Complexity
Execution−0.09−0.12−0.11
Precision+0.13+0.22+0.06
Reasoning−0.02−0.030.05
Recall−0.01−0.01−0.09
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Pingos, M.; Photiou, A.; Andreou, A.S. Automating Data Product Discovery with Large Language Models and Metadata Reasoning. Big Data Cogn. Comput. 2026, 10, 72. https://doi.org/10.3390/bdcc10030072

AMA Style

Pingos M, Photiou A, Andreou AS. Automating Data Product Discovery with Large Language Models and Metadata Reasoning. Big Data and Cognitive Computing. 2026; 10(3):72. https://doi.org/10.3390/bdcc10030072

Chicago/Turabian Style

Pingos, Michalis, Artemis Photiou, and Andreas S. Andreou. 2026. "Automating Data Product Discovery with Large Language Models and Metadata Reasoning" Big Data and Cognitive Computing 10, no. 3: 72. https://doi.org/10.3390/bdcc10030072

APA Style

Pingos, M., Photiou, A., & Andreou, A. S. (2026). Automating Data Product Discovery with Large Language Models and Metadata Reasoning. Big Data and Cognitive Computing, 10(3), 72. https://doi.org/10.3390/bdcc10030072

Article Metrics

Back to TopTop