Given the objectives of this study, a systematic evaluation study was conducted to examine the effectiveness of the proposed LLM-based approach and determine whether an LLM can support data product discovery and prototyping. Specifically, the experimental design focuses on how domain characteristics, dataset availability and user concept complexity shape the LLM’s reasoning and data product generation ability. The following subsection analyzes the experimental process in detail.
5.1. Design of Experiments
As already mentioned in
Section 4, Europeana and Paradisiotis Group are used as the domains of the evaluation experiments. These domains were intentionally selected to examine the LLM’s ability in a realistic data lake environment containing both semi-structured and structured data, respectively, to provide a more accurate assessment of the model’s performance under real-world conditions. The second critical aspect of the evaluation is the user concept, as it presents a real-world analytical question that a domain expert would typically address. Designing appropriate user concepts for each domain is essential for evaluating the LLM’s semantic reasoning capabilities under varying levels of cognitive demand. Therefore, three escalating levels of complexity were designed in the experiments, low, medium, and high. At the lowest complexity level, the objective is to evaluate the LLM’s ability to identify and retrieve relevant datasets with minimal reasoning. This means that it focuses on simple filtering and selection operations. At the medium complexity level, the LLM is prompted to reason more, but also to combine information from multiple datasets and apply a more complex logic to the SQL queries, such as WHERE clauses and JOIN operations. Finally, at the highest level of complexity, the user concept requires more advanced analytical reasoning across multiple datasets, which can trigger more complex conditional logic. Additionally, the high complexity level can trigger the LLM to perform external web search by a more open-ended phrase (e.g., based on current trends) that can enrich its reasoning and thereby allow for a better understanding of its capabilities. The diversity of data sources included in the experimental process does not affect the validity of the results as the proposed approach is not tied up to a specific form of data. On the contrary, it demonstrates the generalizability of the approach to be able to handle data coming from different application domains having diverse formats.
For each domain, the experiments were conducted across three data pool sizes: 6, 12, and 24 datasets. When referring to data pool size, we mean the number of distinct datasets, each with unique schemas and corresponding metadata entries, made available to the LLM during a single experimental run. Each time the data pool size increases, the semantic search space introduces more heterogeneity and potential noise because, as the pool grows, the number of schemas, metadata records, and sampled rows from datasets increases. Additionally, within each dataset pool, the different levels of complexity were applied across all dataset pool sizes for each domain, with each domain having its own distinct set of concepts tailored to its context. This design is intended to examine (i) the scalability of the LLM across different pool sizes when reasoning over larger and more diverse data sources, and (ii) how both the volume of available datasets and the complexity of user-defined concepts affect the LLM’s reasoning capabilities and its ability to generate meaningful data products.
Throughout the experiments, the LLM model GPT-4.1 was used, which supports 30,000 tokens for Tier 1 users. However, this limitation introduced a new challenge in the experimentation, as it restricted the number of metadata and sample data that could be provided in a single prompt. As a result, 24 datasets were established as the upper bound, not only to allow the model to receive contextual information without exceeding its token limit, but also to investigate the LLM’s ability to reason over a larger number of datasets. This upper bound was determined using the LLM’s tokenizer tool.
Having established the main evaluation conditions, the next step was to provide the baseline for measuring the LLM’s performance, evaluating both correctness and relevance of the LLM’s suggestions by preparing a ground truth (the expected outputs that a domain expert would suggest when given the same task as the LLM). Each entry of the ground truth has the structure of the LLM’s response, as shown earlier in Algorithm 1. A significant challenge in such an experiment is creating the necessary reference data products that serve as a ground-truth baseline for the evaluation. Ideally, a domain expert would be responsible for creating ground-truth datasets for each domain. However, this poses a major constraint as domain experts are difficult to find and the process is time-consuming, requiring substantial effort and dedication. Instead, domain analysis was performed systematically by inspecting datasets, analyzing metadata, and consulting publicly available domain documentation. However, given the practical constraints of the experimental setup, an LLM was used as an auxiliary drafting tool to propose some candidate data products. It is important to note that the LLM did not act autonomously, and its outputs were not accepted verbatim. Instead, each data product was reviewed, edited, extended, or rejected based on the domain relevance and the available datasets. In several cases, data products were manually authored without any LLM involvement.
Evaluating the performance of the proposed LLM in generating data products required a combination of quantitative and qualitative metrics to provide a comprehensive assessment. To operationalize these evaluation goals, a set of complementary metrics was defined, to capture different aspects of the LLM’s performance as follows: The first metric was reasoning similarity, which measures how similar reasoning explanations of the LLM are compared to the ground truth. The following logic was used to measure this metric: in order to decide which suggested data product matches the best with the data products from the ground truth, the process first grouped all products with the same FROM clause of the SQL query. Then, for each data product in that group, it calculated a similarity score using a sentence embedding generated by the all-MiniLM-L6-v2 model from Sentence Transformers. The data product that had the highest similarity score was considered a match. However, if the highest similarity score was lower than the threshold value of 0.6, it was not considered a match.
One may argue that using sentence-embedding similarity with a fixed threshold to decide product matches is a pragmatic choice, but it risks rewarding overlap rather than substantive equivalence. However, in our evaluation framework, embedding similarity is not intended to capture the textual resemblance but to assess semantic alignment between the generated and reference data products. A partial schema-level criterion is implemented by comparing only products that refer to the same underlying dataset by grouping candidates according to the FROM clause. We chose not to enforce additional matching criteria such as strict SQL structural equivalence because the objective of this study was to evaluate whether the LLM correctly identifies the relevant dataset and captures the overall intent of the data product rather than whether it reproduces the exact syntactic formulation of the ground truth query. Therefore, enforcing a rigid schema could penalize semantically valid but structurally alternative implementations.
The SQL structural accuracy was also used to evaluate whether the LLM generated SQL queries matched the corresponding logic of the ground truth queries. However, the comparison did not only perform a simple matching but also focused on semantic similarity to ensure that the underlying logic of the queries were aligned. The comparison required first to parse the SQL statement and breaking them down into their logical clauses (e.g., SELECT, FROM, WHERE, JOIN). Each clause was assigned a specific weight accordingly and it was compared individually using the sqlglot parser which produced a parsed syntax tree. Each clause contributed to the final accuracy based on its assigned weight. The final accuracy was computed by averaging the per-query SQL structural accuracy scores across all queries.
A third metric employed in the evaluation was execution accuracy, which examines whether the generated SQL queries are not only correct in structure but also executable on real datasets. To measure this, the suggested queries were executed using Apache Spark directly on HDFS datasets, and the execution accuracy was calculated as the ratio of successfully executed queries to the total number of generated queries.
Another metric considered in the experiments was ranking. Each time, the LLM ranks the data products based on their importance. These ranks were then matched against the data products identified through reasoning as described above. The matching idea was simple: if the matched LLM data product had the same rank value as the ground truth, then this metric returned 1; otherwise, 0, and the final metric was calculated by averaging across all matched predictions.
Precision and recall were also calculated to determine how relevant and accurate the LLM’s suggestions were, as well as how comprehensive the LLM’s suggestions were in covering the ground truth. In more detail, precision measures the proportion of correctly matched data products out of all the data products suggested by the LLM. Recall, on the other hand, measures the proportion of correctly matched data products out of the total number of ground truth data products. Based on these metrics, the following conclusions can be drawn: High precision but low recall means that the LLM suggests only a few but mostly correct products. On the other hand, high recall but low precision means that the LLM suggests many products, but a lot of these are incorrect. The F1 score is also reported to summarize the trade-off between correctness and ground-truth coverage in the generated data products.
In addition to reporting precision and recall at a fixed matching threshold, the performance is also evaluated across varying similarity thresholds using precision-recall (PR) curves. For each run, reasoning similarity scores are treated as confidence values, and precision and recall are computed by sweeping the decision threshold. The area under the resulting curves demonstrates the model’s ability to consistently assign higher similarity scores to correct data products than to incorrect ones across all thresholds.
Finally, the generation time of the LLM was also recorded. This metric measures the time required for the LLM to generate the suggested data products for each run. The results were averaged across multiple runs to provide a reliable estimate of the model’s response time.
Within the proposed framework, the role of the LLM is utilized to specific stages of the data product creation process, in alignment with the data mesh paradigm. The LLM does not perform data ingestion, storage, or execution, which remain the responsibility of the underlying Big Data infrastructure (Hive, Spark, HDFS). Instead, it operates as a semantic reasoning layer that interprets the user-defined concept, reasons over structured metadata and representative data samples, selects relevant datasets and generates candidate data products in the form of executable Spark SQL queries accompanied by semantic justifications. These outputs correspond directly to data mesh data products, which are defined as dedicate and reusable datasets rather than raw data assets.
The selected evaluation metrics were chosen to align with the nature of data products in data mesh environments and the specific role of the LLM within the proposed framework. Traditional machine learning metrics such as classification accuracy, BLEU, ROUGE, or perplexity were not considered appropriate, as the task does not involve label prediction or text generation quality in isolation, but rather the generation of executable, semantically meaningful data transformations over real datasets. Reasoning similarity was therefore selected to capture conceptual alignment between the LLM’s explanations and expert intent, which is central to assessing whether a suggested data product satisfies a business or analytical goal. SQL structural accuracy was preferred over result-based similarity metrics because different SQL queries can return identical outputs while expressing different transformation logic. Evaluating structural correctness, therefore, better reflects the reusability, readability, and maintainability requirements of data products. Execution accuracy was included to ensure operational validity which is essential for any data product deployed in a production data lake. Finally, ranking accuracy was selected instead of relevance-only metrics to evaluate the LLM’s ability to prioritize outputs, which is critical for self-service discovery and consumption in data mesh architectures. Collectively, these metrics were selected to reflect semantic validity, technical correctness, and practical usability, which cannot be adequately captured by generic Natural Language Processing (NLP) or predictive performance metrics.
Finally, to mitigate subjectivity in the evaluation, the ground truth was not generated autonomously by the LLM but constructed under expert supervision, as mentioned above. The LLM was used as an assistive tool to propose candidate data products, while domain experts reviewed and validated before inclusion. Only outputs that satisfied experts in terms of correctness, relevance, and executability criteria were retained. This hybrid approach reduces bias, ensures consistency across experimental conditions and provides a scalable and objective reference baseline for evaluating LLM performance.
5.2. Experimental Results
This section presents the evaluation results of the LLM-based data product suggestion framework across the Europeana (Domain 1) and Paradisiotis Group (Domain 2) domains. First, the focus is on evaluating the model’s performance based on the defined metrics, emphasizing both system accuracy and quality of the generated data products for Domain 1. Then, a comparison between the two domains is provided to examine how the LLM handles reasoning over structured versus semi-structured data.
As shown in
Figure 4, all metric curves decline as both the dataset size and complexity increase, with the decrease becoming particularly noticeable when moving from medium to high complexity, especially for larger dataset pools. Also noticeable is a steep decline from medium to high complexity, particularly for 12 and 24 datasets. However, higher complexity produces a larger drop in the metrics compared to the number of datasets, which causes a relatively smaller decline. The decrease reflects the LLM facing a larger search space of potential data product combinations. Although the input remains within the token limit, the model struggles to filter information effectively, generating more irrelevant suggestions and missing some correct data products, thus reducing recall and F1.
Figure 5 shows generation time trends for Domain 1. As can be observed from the graph, for small dataset pools generation time remains relatively stable across all concept complexities. As the dataset size and complexity increase, generation time rises, indicating that increased reasoning is required over a larger search space. For the largest dataset pool, a slight decrease in generation time is observed at high complexity, which may be due to domain characteristics or the model approaching its token limit, potentially limiting reasoning scope to manage cognitive load efficiently. Overall, generation time requirements are quite low, less than 1.5 min in worst case scenario.
Figure 6 and
Figure 7 illustrate the evaluation metrics for Domain 1 across different dataset sizes and concept complexities. All predictions include every LLM-generated data product grouped by the FROM table regardless of correctness, while matched predictions are those that successfully align with a ground truth data product and meet the highest reasoning similarity threshold. For all predictions, SQL accuracy and reasoning similarity decline as dataset size and complexity increase, reflecting the challenges the LLM faces in navigating a larger search space. In contrast, matched predictions maintain relatively stable SQL accuracy and reasoning similarity indicating that when the LLM correctly identifies relevant datasets it generates structurally correct queries and reasoning closely aligned with the ground truth. Execution accuracy remains consistently high across both all predictions and matched predictions, demonstrating that generated queries are syntactically valid and executable. Finally, ranking accuracy decreases with increasing complexity and dataset size for both prediction types, suggesting that the LLM struggles to prioritize data products under more demanding analytical scenarios.
Having initially examined Domain 1, it is important to compare performance across both domains to assess whether the LLM generates more accurate data products when operating on structured datasets.
Figure 8 presents the performance evaluation results for Domain 2, based on which we can infer the following comparative outcomes: Across dataset sizes and complexity levels, the F1 scores indicate that the LLM generally performs better on Domain 1, with values ranging from 0.49 to 0.96, compared to Domain 2, which ranges from 0.48 to 0.76. While performance declines in both domains as dataset size and complexity increase, Domain 2 remains relatively stable at low and medium complexity levels, whereas Domain 1 exhibits greater variability. These observations do not necessarily imply that the LLM handles semi-structured data better. Instead, they suggest that the model evaluated is more familiar with Domain 1, likely due to greater prior training or exposure to this domain’s data.
The structure of the datasets appears to have a direct impact on generation time. As shown in
Figure 9, for Domain 2, which consists of well-defined structured data, the generation time remains relatively low and stable, with only slight increases as the number of datasets grows. In contrast, Domain 1 exhibits consistently higher generation times across all dataset sizes and complexity levels, probably because the LLM is required to devote more effort to navigate semi-structured data.
Comparing the performance of the LLM across all predictions for Domain 1 and Domain 2 reveals distinct patterns. First, from
Figure 10, Domain 2 generally maintains higher SQL accuracy across most dataset sizes and complexity levels, particularly at higher complexities, whereas Domain 1 shows stronger reasoning similarity in most scenarios, indicating that the LLM produces outputs that are more aligned with the expected reasoning for this domain. Execution accuracy is consistently higher in Domain 1 at lower dataset sizes but declines more sharply with increasing dataset size and complexity, while Domain 2 exhibits more stable execution performance. Ranking accuracy remains low in both domains across all scenarios, though Domain 1 occasionally achieves slightly higher values at lower complexities.
To further analyze the model’s classification performance, confusion matrices were used to examine the relationship between the predicted and actual data product suggestions. It is worth noting that false positives correspond to generated data products that do not match any ground-truth product. In contrast, false negatives correspond to ground-truth products that no generated candidate matches. True negatives are not present in our evaluation because the design is candidate-restricted rather than a closed-set binary classification. Therefore, potential data products that are neither generated by the model nor included in the ground truth are not enumerated and therefore not evaluated. As a result, true negatives are undefined under this evaluation and are set to zero. As shown in
Figure 11 and
Figure 12, both domains show a consistent pattern, confirming that concept complexity has a stronger impact on performance than dataset size. Under low complexity, the model achieves relatively balanced predictions, with high true-positive rates. However, as complexity increases, the LLM tends to overpredict relevant data products which leads to a rise of false positives and a decline in true positives.
Precision-recall curves, as shown in
Figure 13 and
Figure 14, respectively, were employed to examine the tradeoff between precision and recall for both domains. This analysis helps to better understand how well the model distinguishes relevant from irrelevant data products across varying confidence thresholds. For Domain 1, the AUC improves from 0.82 to 0.90 as the dataset size increases from 6 to 24, while in Domain 2 it remains relatively consistent, between 0.88 and 0.90. Overall, the results indicate that the LLM maintains a stable ability to distinguish relevant from irrelevant data products, even as dataset size increases.
The use of external web search theoretically can introduce temporal variability, as retrieved web content may change over time, leading to non-deterministic model behavior and reduced reproducibility. To investigate this situation, an ablation study was conducted comparing the framework’s performance with web search enabled and disabled. Specifically, this study isolates the effect of external web access on the generated data products while keeping all other components of the pipeline unchanged, including prompts, datasets, domain, configurations, and evaluation pipelines.
As shown in
Table 3 and
Table 4, across both domains, this study suggests that when web search is enabled, precision across all complexity levels is improved, indicating that reliance on external background knowledge increases something which leads to capturing better domain-specific semantics. This increase in precision is particularly noticeable in Domain 2, where values are substantially larger. In contrast, execution accuracy, reasoning, and recall decrease when web search is enabled, regardless of the complexity, with the degradation being more noticeable in Domain 2. A possible explanation for this decrease lies in the fact that the external information can shift the model’s focus away from the available dataset. In particular, web-derived content may encourage the model to introduce additional assumptions, attributes, or relationships that are semantically valid in a general sense but not supported by the underlying data.
Overall, the experimental results show that both dataset pool size and concept complexity influence system performance, with their combined effect reducing performance under high complexity and large pool conditions. The semantic search space expands as the number of datasets increases, however, the most significant declines in precision, recall, and ranking accuracy are observed when moving from medium to high complexity. At the same time, increasing concept complexity results in greater performance degradation within fixed dataset pools, particularly from medium to high complexity. The confusion matrix analysis further confirms that higher complexity leads to overprediction; therefore, LLMs struggle more with dataset prioritization. However, once the relevant datasets are identified, the LLM can generate valid and meaningful data products. Finally, the web-search ablation study revealed that external search improves precision by leveraging broader domain knowledge but reduces recall and execution accuracy, likely because external information influences the model’s reasoning and its alignment with the underlying data sources.