Data Integration and Storage Strategies in Heterogeneous Analytical Systems: Architectures, Methods, and Interoperability Challenges

Koukaras, Paraskevas

doi:10.3390/info16110932

Open AccessReview

Data Integration and Storage Strategies in Heterogeneous Analytical Systems: Architectures, Methods, and Interoperability Challenges

by

Paraskevas Koukaras

School of Science and Technology, International Hellenic University, 14th km Thessaloniki-Moudania, 57001 Thessaloniki, Greece

Information 2025, 16(11), 932; https://doi.org/10.3390/info16110932

Submission received: 27 August 2025 / Revised: 28 September 2025 / Accepted: 22 October 2025 / Published: 26 October 2025

(This article belongs to the Special Issue Advanced Methods for Multi-Source Information Management, Modeling, and Analysis)

Download

Browse Figures

Versions Notes

Abstract

In the current scenario of universal accessibility of data, organisations face highly complex challenges related to integrating and processing diverse sets of data in order to meet their analytical needs. This review paper analyses traditional and innovative methods used for data storage and integration, with particular focus on their implications for scalability, consistency, and interoperability within an analytical ecosystem. In particular, it contributes a cross-layer taxonomy linking integration mechanisms (schema matching, entity resolution, and semantic enrichment) to storage/query substrates (row/column stores, NoSQL, lakehouse, and federation), together with comparative tables and figures that synthesise trade-offs and performance/governance levers. Through schema mapping solutions addressing the challenges brought about by structural heterogeneity, storage architectures varying from traditional storage solutions all the way to cloud storage solutions, and ETL pipeline integration using federated query processors, the research provides specific attention for the application of metadata management, with a focus on semantic enrichment using ontologies and lineage management to enable end-to-end traceability and governance. It also covers performance hotspots and caching techniques, along with consistency trade-offs arising out of distributed systems. Empirical case studies from real applications in enterprise lakehouses, scientific exploration activities, and public governance applications serve to invoke this review. Following this work is the possibility of future directions in convergent analytical platforms with support for multiple workloads, along with metadata-centric orchestration with provisions for AI-based integration. Combining technological advancement with practical considerations results in an enabling resource for researchers and practitioners seeking the creation of fault-tolerant, reliable, and future-ready data infrastructure. This review is primarily aimed at researchers, system architects, and advanced practitioners who design and evaluate heterogeneous analytical platforms. It also offers value to graduate students by serving as a structured overview of contemporary methods, thereby bridging academic knowledge with industrial practice.

Keywords:

data integration; data storage; metadata management; heterogeneous systems; schema mapping; lakehouse architecture; federated querying; data lineage; data interoperability; pipeline orchestration

Graphical Abstract

1. Introduction

Modern-day analytics operate over a heterogeneous collection of formats, systems, and semantics, introducing integration challenges poorly served by schema or storage-centric approaches. This section introduces the contextual background around this research topic. It highlights the importance of defining consistent methodologies to bridge heterogeneity-aware integration to scalable storage and clarifies relevant parameters and research contributions.

1.1. The Growing Heterogeneity in Analytical Ecosystems

In today’s society, in which the value of data usage has become even more salient, organisations have increasingly relied on a heterogeneous set of data sources for to boost decision-making activities, knowledge gathering, and machine learning activities. These sources include traditional relational databases, key-value storage repositories, document-oriented storage repositories, semi-structured repositories like JavaScript Object Notation (JSON) and Extensible Markup Language (XML), and public cloud storage solutions like Google Cloud Storage and Amazon Simple Storage Service (S3), along with structured flat files organised in a column-wise structure like Apache Parquet and ORC [1,2,3,4]. The diversity inherent in these data sources comes not only from the storage structure, along with supported query methods, but also from heterogeneities in schemata, data representation methods, access protocols, and update characteristics. This diversity becomes a significant obstacle for data management activities, along with integration activities, mainly because end users aim to produce uniform, high-quality outputs from environments that happen to be highly heterogeneous but also show considerable distribution characteristics [4,5,6].

Researchers, engineers, and data scientists regularly find themselves dealing with issues related to heterogeneous data pipelines—more specifically, with the problem of the integration of heterogeneous data from multiple storage systems, models, and schemas. Although there have been significant boosts towards scaled storage solutions and high-query-throughput technologies, various systems still reel from issues related to schema mismatches, redundancy, deceptive metadata, and varying data semantics from a multiplicity of sources. These issues are highly noted within enterprise data lakes, federated info structures, science research campuses, and analytical platforms across multiple operational areas. The increased occurrence of these problems within hybrid cloud infrastructure clearly highlights the need for more advanced integration methods, along with adaptive storage solutions offering varying levels of flexibility with regards to changing schemas, lowered redundancy, and semi-structured data [5,7,8].

1.2. Motivation for Unified Integration and Storage Strategies

It is important to acquire holistic knowledge of integration approaches to meet the challenges of diverse data types that comprise schema matching, instance alignment, and data fusion. The need is especially accentuated when such approaches are combined with modern storage platforms like distributed file systems, Not Only SQL (NoSQL) stores, and lakehouse implementations [3,9,10,11]. Static relational schema datasets were the focus of traditional integration approaches. This vision is now less sufficient to address the workloads that require more flexible approaches to process-dynamic, semi-structured, and rapidly changing datasets.

Storage innovation has brought about new trade-offs between performance, flexibility, and consistency. Schema-on-read data querying practices, late-binding policies, and federated access protocols can be used for the querying of data without pre-data modelling but generally weaken semantic expressiveness or operational efficiency [5,6]. Therefore, data integration solutions require storage infrastructure alignment for scalability, interoperability, and reliable analytics [7,8,12].

This study provides a valuable and beneficial question for researchers and practitioners alike, as it enables effective decision-making in system development, helps determine challenges of integration due to discipline-specific differences in infrastructure, and ensures the application of best practices in managing analytical processes in various infrastructural environments.

1.3. Scope and Contributions of the Review

This research undertakes a broad review of integration and storage methods utilised within various analytical architectures, providing an equitable focus on areas that include both structured and semi-structured data. The research includes schema-based approaches such as schema matching, Entity Resolution (ER), and semantic enrichment, as well as architecture-centric strategies that involve columnar database systems, key-value store alternatives, cloud storage solutions, and integrated query processing systems [1,2,3,7,9].

Modern methodologies and frameworks are categorised according to their roles in the data management continuum, including features that range from data acquisition to storage setup, metadata management, and cross-source query optimisation. This analysis is centred on the solutions offered by these methodologies with respect to heterogeneity, scalability, and interoperability concerns, as well as regarding the trade-offs that arise in real-world deployments [5,6].

This work responds to three main aspects: (i) synthesis of data integration and storage paradigms in heterogeneous analytical contexts, drawing connections between schema-level and infrastructure-level heterogeneity; (ii) a comparative analysis of representative systems and tools, highlighting their functionalities, limitations, and integration strategies; and (iii) identification of outstanding challenges and potential paths, including opportunities for integration spurred by metadata, schema evolution management, and AI-enabled interoperability. Beyond these three aforementioned concerns, this work offers four substantial contributions.

It introduces an end-to-end cross-layer taxonomy that distinguishes integration mechanisms like schema matching, entity resolution, and semantic enrichment over a heterogeneous collection of storage methods and access models, including row/column stores, many families of NoSQL stores, cloud-native columnar stores, and lakehouse systems. This taxonomy clearly illustrates the interplay of workloads and architectures and their effects for governance and performance. By considering Section 2 and Section 3 simultaneously, practitioners can immediately map data unification to its concomitant storage point and the corresponding utilised query format.
Operational planning of ETL, ELT, and data virtualisation/federation is laid out by considering popular process models and tools, followed by an analysis of the impact of optimisers and pushdown methods in integrated SQL engines. This creates a pragmatic architecture for the balancing of the latency, cost, and freshness of data across varying backends.
Metadata, lineage, and schema evolution are highlighted as central constructs in ensuring reproducibility and dependability, in addition to formalising performance enhancements such as caching/materialisation and freshness/consistency control, in a checklist to support architectural decision-making.
The findings are then consolidated with cross-domain examples such as enterprise lake/lakehouse, scientific integration, and public analytics domains, which leverage pragmatic patterns for managed, heterogeneous schema-on-read/write pipelines. These developments integrate embedded methodologies and infrastructure into a unified, decision-focused view for data teams and architects.

Positioning vs. Prior Surveys

Relative to recent surveys on data lakes [5], data federation [6], lakehouses [13], unified analytics [14], and data integration [9], Table 1 contrasts scope and adds a side-by-side view of what this synthesis contributes beyond these works.

This review contributes (i) a cross-layer taxonomy that explicitly couples integration mechanisms (schema matching, entity resolution, and semantic enrichment) with storage and access models (row/column stores, NoSQL, lakehouses, and federation), operationalising this via an application cross-walk (Section 7); (ii) working, reproducible resolution workflows for canonical heterogeneity problems—schema name mismatch and instance-level date ambiguity—with concrete enforcement/lineage policies (Section 2.4); (iii) a governance-aware treatment that binds performance levers (pushdown and materialisation/caching) to data-quality SLAs, freshness, and consistency (Section 6); and (iv) the first synthesis, according to known information, that integrates peer-reviewed 2024–2025 evidence on AI-assisted integration—LLM-based schema/ontology matching and cost-aware/active ER—into an architectural decision frame [15,16,17,18,19,20].

This positioning clarifies how this survey complements system-focused reviews by providing design-level guidance that traverses schema, semantics, storage, and operations by translation into domain patterns (Section 7).

The goal of this work is to act as a reference point for data engineers, system architects, and researchers involved in the integration of data analytics and data management in terms of connecting theoretical integration models with relevant technical frameworks for the storage and retrieval of data. To clarify the intended audience, consider the following: (i) researchers benefit from a synthesis of taxonomies, comparative evaluations, and identification of open challenges; (ii) practitioners and system architects gain guidance on trade-offs, architectural patterns, and integration strategies applicable to real deployments; and (iii) students obtain an accessible yet rigorous entry point into heterogeneous data integration and storage, which complements more specialised literature. This multi-tiered orientation ensures that the review delivers concrete utility across scholarly, professional, and pedagogical contexts.

Finally, to make such a taxonomy directly actionable, this work explicitly connects its layers to the application domains analysed in Section 7. Concretely, (i) integration mechanisms (schema matching and entity resolution for semantic enrichment from Section 2.2 and Section 2.3) are mapped to enterprise lakehouse harmonisation of customer and product identifiers; (ii) metadata, lineage, and ontology scaffolding (Section 5) underpin scientific workflows for reproducibility and semantic alignment; and (iii) hybrid schema strategies and federated access (Section 4.1, Section 4.2 and Section 4.3 address governance and inter-agency interoperability in public-sector pipelines.

1.4. Review Methodology

This work is conceived as a structured survey rather than a systematic review. Accordingly, it did not aim at exhaustive coverage of all publications but, instead, sought to capture representative methods, architectures, and trends that define the state of practice in heterogeneous analytical systems.

The literature selection followed three guiding principles. First, it focused on peer-reviewed venues where novel data integration and storage strategies are usually published, including ACM SIGMOD; VLDB; IEEE ICDE; and journals such as TKDE, VLDBJ, and Information Fusion. Second, the time span of 2015–2025 was targeted to reflect the transition from early data-lake and NoSQL adoption to recent lakehouse and AI-assisted integration methods while retaining seminal pre-2015 contributions for context (e.g., schema matching [21], data fusion [22], and provenance [23]). Third, an inclusion criterion was applied that required explicit linkage between integration mechanisms (schema matching, ER, and semantics) and storage/query substrates (row/column, NoSQL, lakehouse, and federation), as well as relevance to metadata, lineage, governance, or performance trade-offs.

Searches were performed in ACM Digital Library, IEEE Xplore, DBLP, and Google Scholar using combinations of terms such as “data lake”, “lakehouse”, “federated query”, “schema matching”, “entity resolution”, and “metadata catalog”. Reference snowballing from key surveys (e.g., [5,6,13,14]) ensured coverage of influential works.

Finally, to structure the review, the selected literature was organised into a cross-layer taxonomy spanning schema/ER/semantic integration (Section 2), storage models and architectures (Section 3), integration strategies (Section 4), metadata/lineage/semantics (Section 5), performance levers (Section 6), and application domains (Section 7). This taxonomy provides a balanced view that prioritises clarity, representativeness, and analytical depth over completeness.

2. Foundations of Data Integration

In this part, schema, semantic, and instance levels are pinpointed as sources of heterogeneity, followed by methodologies suggested to overcome these challenges, such as schema mapping or matching, entity resolution, and other forms of data fusion. These are the basic terms and central notions applied throughout the research.

2.1. Types of Heterogeneity: Schema, Semantic, Instance Levels

Data integration is the process of combining information from independent sources into a unified scheme. This stage is related to various aspects of heterogeneity [5,21]. Schema heterogeneity occurs when there are different datasets that use differing structures or schemas to portray repeated or similar data. For instance, two databases can refer to customer details using different terms for attributes (like customer_id and client_no) or can use differing types (like string and integer with respect to identification-based data) with respect to customer details. The common method of harmonising these structurally differing variations often forms a considerable part of the first phase of the integration process [21].

Besides schema mismatches, semantic heterogeneity appears in the case of differing meanings of the elements of data, regardless of their structure or name being similar. A classic example of this occurrence is an application using the word salary for gross annual earnings but another using it for net take-home monthly earnings. Addressing semantic heterogeneity calls for context understanding, often aided by metadata, external ontologies, or domain-specific vocabularies [23,24].

Instance-level heterogeneity refers to value differences arising from different sources. Value differences can occur in the form of format differences (1/12/2025 vs. 2025-12-01), unit mismatches (kilometres vs. miles), or identifier disagreements (customer IDs varying across systems). These instances pose barriers to dataset integration and, hence, necessitate the use of instance-level reconciliation methods, which can be in the form of normalisations, transformations, or record linkages [25,26]. The integration of multiple types of heterogeneity poses significant challenges, particularly in the integration of semi-structured and even unstructured data with traditional structured databases [5]. Next-generation data integration platforms need to have the ability to deal with the dynamic alignment of schemas, enabling semantic mediation and recording of the reconciliations of records, along with maintenance of data integrity and provenance [23,27].

For pedagogical clarity, the resolution steps are also made explicit. Typical schema name mismatches (e.g., customer_id vs. client_no) are first detected via semi-automated schema matching (lexical/structural/instance evidence), then compiled into executable mappings (SQL/XQuery) and validated against gold (ground-truth curated by domain experts) or ontology-backed correspondences. Instance-level ambiguities (e.g., the date “01/12/2025”) are eliminated through schema-on-write or hybrid enforcement, which standardises canonical formats (e.g., ISO 8601 [28]) at ingestion and validates them via metadata-driven checks. Where legacy sources cannot be conformed at write time, a compensating read-time normaliser is applied, with lineage being recorded.

Table 2 summarises the main forms of heterogeneity encountered in data integration, illustrating schema, semantic, and instance-level challenges with examples.

2.2. Schema Matching and Map Generation

Schema matching is the activity of identifying matches between the elements of various data schemas. This is a basic activity for data integration where heterogeneous datasets can be combined into an integrated frame of reference [21,29]. The algorithms used for schema matching can broadly be classified into three main categories: name-based methods, structure-based methods, and instance-based methods (Table 3 summarises these). Recent AI-driven instruments demonstrate that LLM prompting over schema documentation can bootstrap high-quality correspondences in instance-restricted domains and reduce verification effort, exceeding purely lexical baselines [30]. Beyond instance-free discovery, fine-tuned LLMs have been used to materialise executable mappings against standard information models in industrial settings, thereby strengthening semantic interoperability [31].

Lexical matchers rely upon known lexical relationships present at both the attribute and table levels. String distance measuring techniques, including edit-distance dynamic programming and set-overlap measures, may be used in combination with methods employing synonyms from linguistic resources to determine potential matches [32].

Structure-based matchers exploit the inherent relationships between the elements of a schema, such as type constraints, hierarchies, and foreign keys, to boost the effectiveness of a match. For example, tables that share similar parent–child relationships or similar referential patterns could be considered structurally equivalent [29].

Instance-based matchers evaluate the compatibility of real data values stored within schemas via comparative analysis. Signs of semantic similarity between schema elements may depend on measures like the intersection of value ranges, shared value distributions, or statistical interdependencies. Once correspondences are identified, mapping generation produces transformation rules, which specify how source schemas are converted into the target schema. These mappings can then be specified in a declarative manner using languages like the Structured Query Language (SQL), XQuery, or Datalog or, conversely, can be built visually using ETL software [29]. For scenarios where higher sophistication is a necessity, map generation is performed using semi-automated or interactive software, which still requires human intervention for validation or fine-tuning of the proposed recommendations, especially in environments where precision is of high value.

Schema matching solutions using AI methods like embeddings and neural models have been reported in recent literature [33]. Studies indicate an encouraging possibility of providing high-quality mappings with limited manual intervention to facilitate comprehensive integration efforts accommodating a heterogeneous set of sources, inconsistent naming conventions, and limited documentation assistance. Foundational mapping systems such as Clio demonstrate map creation and data exchange at scale [34].

2.3. Entity Resolution and Data Fusion

Though the schema is the same throughout, a major challenge refers to ER, which is more commonly termed record linkage, duplication identification, or reference reconciliation. It refers to the correlation and comparison of records for a single real-world entity across multiple platforms. Data relating to customers gathered from more than a single Customer Relationship Management system can employ varying identifiers, spellings, or unique address patterns for a single individual [25]. Concurrently, cost-aware LLM prompting has been systematised. The authors of [15] designed batch prompting for ER that preserves accuracy while substantially reducing API cost. In engineering contexts, fine-tuned open-source LLMs used as ER classifiers surpass the prior SOTA (and GPT-4 with in-context learning) on benchmarking datasets, indicating practical headroom for LLM-driven ER beyond zero/few-shot architectures [31]. Moreover, active in-context learning improves cross-domain ER without task-specific fine-tuning, mitigating domain shift [16].

Most ER algorithms are based on a rich framework consisting of similarity functions, blocking methods, and classification models [26]. Similarity functions are applied to assess particular attributes like names, dates, and locations. Blocking, on the other hand, reduces the search space by dividing potential matches into subsets that enable fast processing. Most classification methodologies use supervised learning to infer whether two records refer to the same entity using previously extracted features. Recent surveys also emphasise probabilistic and Bayesian approaches for ER at scale [35].

Upon completion of the entity identification process, the next step is data fusion, which is the combination of similar records into a single, unified representation. This involves the resolution of inconsistencies inherent in the varying sets of attribute values. Techniques used to enable this fusion include the following:

Source prioritisation: Prioritisation of a primary source over other sources;
Aggregation or voting: The combination of many values applying statistical methods;
Provenance-aware fusion: the use of metadata in selecting values based on their timeliness, prevalence, or reliability.

Advanced fusion techniques have the ability to incorporate domain-specific data, ontologies, or user input to enable conflict resolution. Across many real-world applications, data fusion needs to preserve lineage and traceability, allowing users to better understand the processes by which the resultant fused representation was constructed, as well as the initial sources from which it was derived [22,23]. Modern ER and data fusion techniques are increasingly being integrated into data management systems, especially into Master Data Management (MDM) systems and data lakehouses [3]. In analytical applications that necessitate real-time or near-real-time processing performance, it is critical that such software tools improve their scalability while making it trouble-free to integrate with metadata repositories and support workflow orchestration [5,26]. The overall process of entity resolution and data fusion is summarised in Figure 1, illustrating the sequence from similarity computation through blocking and supervised classification to fusion strategies that produce unified records with preserved lineage.

2.4. Illustrative Resolution Scenarios and Workflows

This subsection provides practical examples that move beyond conceptual statements and demonstrate concrete, reproducible resolution flows for common heterogeneity issues.

2.4.1. Scenario A: Schema Name Mismatch (customer_id vs. client_no)

Evidence gathering: Apply name-, structure-, and instance-based matching algorithms to recommend a correspondence amongst customer_id (RDBMS A) and client_no (RDBMS B).
Candidate alignment: Check with domain ontology (e.g., “Customer” with one main identifier) to resolve homonyms and enforce cardinality/uniqueness constraints.
Mapping synthesis: Develop an operational mapping (for instance, an SQL view or an ELT transformation) that presents client_no as customer_id while ensuring type harmonisation.
Validation and lineage: Validate the precision and recall with a duly chosen sample. Keep the mapping, quality measures, and provenance in the metadata catalogue and implement it as a reusable module in the pipeline.

Figure 2 visualises the matcher→mapping→validation workflow.

2.4.2. Scenario B: Instance-Level Ambiguity (Date “01/12/2025”)

Policy: Adopt a canonical date format (ISO 8601, YYYY-MM-DD) as a data contract for analytical layers.
Enforcement (schema-on-write where feasible): At ingestion, parse source dates using locale-aware parsers with explicit day/month disambiguation. Reject or quarantine unparsable records. Normalise to ISO.
Hybrid fallback (schema-on-read): For immutable/legacy sources, apply a deterministic normaliser at query time with source-specific locale rules. Mark records with confidence and retain the raw value.
Validation and observability: Define expectations (e.g., no ambiguous DD/MM vs. MM/DD overlaps) and monitor drift. Expose freshness and conformance metrics in the catalogue/lineage system.

Table 4 summarises the policy, and Figure 3 shows the control flow.

2.4.3. Scenario C: Semantic Conflict (Salary = Gross vs. Net)

Vocabulary alignment: Bind source attributes to ontology concepts (e.g., GrossAnnualSalary and NetMonthlySalary).
Normalisation: Define transformation rules (unit/calendar/period adjustments) with explicit semantics. Compute derived measures only where definable.
Query mediation: Expose a semantically consistent view (virtualised or materialised) that prevents cross-meaning joins. Annotate with provenance and assumptions in the catalogue.

2.5. Limitations and Possible Biases in Reviewed Methods

Despite substantial progress, each strand of data integration exhibits inherent risks and biases, which may constrain reproducibility.

Schema matching and mapping. Lexical/structural matchers inherit biases from naming conventions and language resources. Instance-based approaches are sensitive to distributional drift, identifier inconsistencies, and noisy or limited gold standards. Consequently, reported accuracy may overstate generalisability, and human validation remains indispensable, introducing subjectivity [21,33].

Entity resolution and data fusion. Blocking and filtering reduce comparison space but can depress recall. Severe class imbalance biases classifiers toward majority decisions. Fusion rules that ignore source reliability, provenance, and timeliness may propagate systematic errors [25,26,35]. These factors become more acute at scale and under streaming or near-real-time constraints.

Ontology-based and semantic integration. Alignment suffers from coverage gaps, ambiguous correspondences, and evolving vocabularies. Fully automatic matching remains elusive in heterogeneous domains, often requiring expert intervention that is costly and potentially subjective. Alignment errors can propagate downstream and bias integrated views [17,18,24].

System-level evaluation in federated/lakehouse settings. Heterogeneous connectors and partial statistics create optimiser blind spots, yielding mis-costed pushdown and inefficient cross-source joins. Materialisation and caches risk staleness if not incrementally maintained. Thus, reported performance may reflect system-specific assumptions rather than general capability [6,7].

3. Storage Architectures for Analytical Workloads

This section discusses how storage choices shape analytics, i.e., Row vs. column stores for OLTP/OLAP trade-offs, NoSQL models (key value, document, and graph) for semi-structured needs, and cloud-native columnar formats and lakehouse layers for scale and reliability. It sets the stage for the mapping of workload patterns to the right storage medium.

3.1. Row-Oriented and Column-Oriented Stores

The choice of storage structure strongly affects the scalability and efficiency of analytical systems. Traditional row-oriented relational databases like PostgreSQL, MySQL, and Oracle store data records sequentially, with the fields for each record located near each other on the storage device. This structure is a very efficient storage mechanism for the typical Online Transaction Processing (OLTP) activities like reading or updating complete records. Classic examples are updating a customer’s details and placing a new order [4,36].

Analytical queries, also known as Online Analytical Processing (OLAP), typically require the retrieval of many columns from large rows (e.g., summing up sales amounts over many geographic regions or filtering over integer attributes). These types of queries suffer from poor performance in row-oriented databases because of the large number of input/output operations that they require, as well as poor caching mechanisms. A suggested solution to this weakness has been the advent of column-oriented databases. Excellent examples of this category of databases are MonetDB, Vertica, and ClickHouse, which isolate individual columns into separate contiguous blocks, thereby allowing for fast column scanning and increased compression ratios and encouraging better CPU vectorisation [4,37,38].

Column storage architectures use a series of encoding methods like run-length encoding (RLE), dictionary encoding, and bit-packing to store data efficiently while supporting fast filtering and aggregation. These storage frameworks for data also use advanced indexing methods like zone maps, along with bitmap indexes to boost analytical query execution performance [39,40,41]. Row-oriented databases were once predominant within traditional real-time transaction environments but no longer comprise the single type of database being used today. Column-store databases like Google BigQuery, Amazon Redshift, and Snowflake became popular within analytical environments, as well as cloud data warehousing scenarios. Figure 4 contrasts row stores and column stores and situates hybrid systems that support mixed workloads.

Modern systems like Apache HBase, along with SAP HANA, use dual-storage engine or hybrid settings, which provide storage for the data in either a column-oriented or row-oriented scheme and meet the varied needs of heterogeneous workloads accordingly [42,43]. This blending of storage models indicates the progress in optimally matched storage frameworks, keeping in view the major access patterns of analytical workloads.

3.2. Key-Value, Document, and Graph Databases

To manage semi-structured or non-tabular data efficiently, several NoSQL storage models have been developed, each with distinct data models and query paradigms. Most NoSQL storage systems can be classified into three main categories (Table 5): key-value stores, document stores, and graph databases [44,45,46].

Key-value stores like Redis, Amazon DynamoDB, and Riak structure data in pairs, where a unique key acts as an identifier attached to a corresponding value that can be returned in an object or an unstructured blob form. These key-value stores’ architectures have important performance attributes defined mainly by high reading capability combined with low latency, along with distributed scalability, which greatly simplifies application design in areas like session management, caching storage, and keeping huge records [44,46].

Although their primitive querying capability severely constrains their use in analytical scenarios unless complemented by secondary indexing or specialised processing methods, document store databases like MongoDB, Couchbase, and Amazon DocumentDB structure their data into hierarchical forms similar to JSON-like forms. A key feature of document stores is sa flexible schema definition combined with nested structures and pipeline aggregation, which greatly simplifies the handling of relatively complex queries involving semi-structured data [47,48]. Document store implementations are often used in scenarios where structuring of the data is prone to constant changes or heterogeneous types of records like in application areas related to content management, product catalogues, or telemetry data storage within Internet of Things scenarios.

Graph databases such as Neo4j, Amazon Neptune, and JanusGraph are designed to store entities (nodes) and their relationships (edges), along with attributes. These databases excel at handling highly interconnected datasets and support graph pattern matching via languages such as Cypher, Gremlin, or the SPARQL Protocol and Resource Description Framework (RDF) Query Language (SPARQL) [49,50]. In addition, they provide specialised aspects geared toward supporting a wide range of applications, ranging from social network data analysis to recommendation systems, anti-fraud filters, and the querying of bioinformatics data. In spite of the attractive features of schema flexibility combined with scalability benefits inherent in these NoSQL databases, issues evolve concerning the guarantee of strong consistency requirements when compared to the more complex functionalities of traditional relational database management systems (RDBMSs). As such, in the analytical settings considered in this research, these systems operate as adjunct resources or data sources in computational pipelines, adding value by enabling querying and aggregation in subsequent stages of computation.

To complement the categorical overview, Table 6 benchmarks the three NoSQL families on consistency/availability, scale-out, and query-path performance (including scans vs. traversals), consolidating peer-reviewed evidence.

3.3. Cloud-Native Formats and Distributed File Systems (e.g., Parquet and Delta Lake)

Recent developments in cloud-native infrastructure and data lakes have accelerated the adoption of open, columnar file formats with distributed storage systems. Apache Parquet, Apache Optimised Row Columnar (ORC), and Avro enable fast storage solutions with efficient query operations for large amounts of structured and semi-structured data in cloud-centric and distributed environments. Beyond disk-oriented formats, evaluations also consider in-memory columnar exchange frameworks such as Apache Arrow and its Feather variant. While Arrow/Feather are not designed for long-term storage, they provide extremely fast (de)serialisation and zero-copy interoperability across engines, making them an important baseline in benchmarking studies [1,51].

Parquet is a columnar storage format specially designed to support nested data structures and is heavily optimised for use cases that arise from high-frequency read workloads. The layout of this format exhibits excellent compatibility with multiple big data frameworks, such as Apache Spark, Hive, Dremio, and Amazon Web Services (AWS) Athena. The fast encoding protocol, with strong metadata, allows query filtering and planning to be more efficient, essentially reducing the need for full scans. Likewise, ORC is designed to enhance the performance of reading and writing and includes advanced compression mechanisms, predicate pushdown, and indexing features [1,52]. Table 7 consolidates results on Parquet, ORC, and Arrow/Feather, highlighting compression efficiency, query performance, nested data handling, and workload-specific trade-offs as reported in peer-reviewed studies.

The development of lakehouse architectures is due to the natural limitations inherent in raw data lakes—namely, their lack of sufficient data management and lack of Atomicity, Consistency, Isolation, and Durability (ACID) compliance. Lakehouse architectures take advantage of cloud-native infrastructure in providing support for transactions, versioning, incremental schema evolution, and temporal capabilities [3,10].

In addition, lakehouse architectures provide a single platform that combines the scalability features usually found with data lakes and the reliability and ease of management that define data warehouses. Current storage infrastructure enables the use of batch and streaming data, allows for unified read–write operations across multiple data nodes, and provides inherent compatibility with SQL engines, along with Spark [3,10]. Additionally, these storage platforms effectively address latent issues related to large-scale analytical systems, like keeping data fresh, imposing server-level schema constraints, ensuring consistency for update operations, and more. Usage of these storage platforms is rapidly gaining momentum for the building out of analytical pipelines for various industries, like finance, healthcare, advertising, and scientific exploration. Notably, lakehouse platforms, along with cloud-native storage platforms, provide storage-agnostic features that boost interoperability between a heterogeneous set of storage backends like Amazon S3, Google Cloud Storage, Azure Data Lake, and others. This storage-agnostic capability enables multi-cloud adoption, with streamlined integration across multiple environments. Inclusion of federated query engines like Presto, Trino, and Starburst in these platforms provides a storage-oriented foundation design for modern scale-out analytics systems [7,53,54,55].

4. Bridging Integration and Storage

The interplay between data mobility and accessibility can be seen by analysing the interplay between multiple mechanisms, such as ETL and ELT pipelines, data virtualisation, and federated SQL engines and by the compromises involved with schema-on-read and schema-on-write. The overall effects of choosing these tools are the minimisation of redundancy, an increase in agility, and holding performance and governance levels constant.

4.1. ETL/ELT Pipelines and Data Virtualisation

In heterogeneous environments, integration depends on efficiently moving and transforming data across diverse storage backends rather than enforcing a single schema upfront. Extract–Transform–Load (ETL) and Extract–Load–Transform (ELT) are the main methods used to enable this type of integration [5,56].

In conventional ETL, data is extracted, then cleaned and transformed to conform to a normalised target schema. Eventually, the data is stored in a main repository, which is most often a relational data warehouse [56]. These practices are common in enterprise business intelligence scenarios, where data quality maintenance, along with conformance with schema specifications, is an essential requirement prior to the data ingestion step. Commercial ETL offerings like Talend, Informatica, and Pentaho have powerful platforms for the design of transformation workflows with rule-based data cleansing, deduplication, and aggregation capabilities [56,57].

Nevertheless, the arrival of storage solutions featuring schema-on-read abilities, along with cloud-native data lakes, has made it possible for the ELT methodology to be used more extensively [3,5]. In such a context, data that is unstructured or slightly structured is first stored in a horizontally scalable data repository like Amazon S3, Hadoop Distributed File System (HDFS), or Azure Data Lake, then processed at the query execution point or any subsequent point in the processing pipeline. This approach increases flexibility, along with data ingestion rates, more significantly, especially in scenarios involving stream data management, where more regular changes take place to the schema, or where exploratory analysis is beneficial. Both Extract–Transform–Load (ETL) and Extract–Load–Transform (ELT) practices require careful management of metadata, schema versioning, and data lineage tracking in order for transformations to be repeatable, auditable, and appropriate for auditing practices.

Declarative transformation pipelines like Apache NiFi, Airbyte, dbt (data build tool), and Dagster have features for automatic testing, distribution, and scheduling. Their integration with cloud storage infrastructure, version control systems like Git, and distributable computation environments increases their contribution within modern data integration solutions [57]. Data virtualisation components represent optional or complementary means of addressing traditional ETL/ELT practices using a logical abstraction layer for the integration heterogeneous data sources without pre-transfer or pre-transformation phases. Denodo, Red Hat JBoss Data Virtualisation, and Dremio represent solutions using abstraction layers to provide unified views for heterogeneous back-end systems through federated query engines. These solutions integrate relational databases, NoSQL databases, REST APIs, and file storage within a semantic modelling context, thereby allowing end users to query the combined data using standard SQL queries, with the platform automatically managing data access tasks in the background [6,58]. To address heterogeneous data pipelines, one may use ETL for warehouses, ELT on cloud storage, or logical data virtualisation. Table 8 compares these approaches and tools. This allows for the minimisation of redundancy, boosts query response efficiency in today’s data environments, and allows for the integration of multiple analytical models. On the downside, it can cause possible performance loss because of latency issues, data inconsistencies, or the unavailability of sources of data in the course of carrying out the indexing process. For these reasons, hybrid infrastructure often utilises virtualisation in a bid for on-demand querying functionality, complemented by ETL/ELT practices for optimal structured data management.

Decision Guidance: ETL vs. ELT vs. Virtualisation

ETL is preferable when conformance and upstream data quality must precede analytical use (e.g., governed marts and slowly evolving schemas) [56]. ELT is effective for high-volume, schema-evolving feeds landing in open columnar storage. Enforcement is deferred to query time or periodic consolidation in the lakehouse, leveraging vectorised execution engines [51,59]. Virtualisation/federation is appropriate where duplication is undesirable or sources must remain authoritative. Performance hinges on connector-aware pushdown and cost-guided planning [55,58]. In practice, hybrids persist in frequently used aggregates while federating the long tail, with promotion from late-bound to enforced schemas once contracts stabilise (see Table 8 and Figure 5).

4.2. Federated Querying and Unified Query Engines

Federated query systems are central to storage integration. Federated systems allow the end user to run analytical queries on different heterogeneous data sources through a single unified interface. As opposed to traditional integration methods based on materialised data warehouses, federated systems operate under a query-time integration model, which allows for synchronous and dynamic integration of multiple sources of data [6,55].

Prominent federated query engines include Presto, Trino (formerly known as Presto), Apache Drill, and Starburst Enterprise. These engines provide a unified SQL-focused querying interface for users by interacting with multiple relational databases, NoSQL databases, cloud object storage services, and distributed file systems using connectors [7,55]. The automatically generated queries are able to carry out joins, groupings, and filtering across several systems, like MySQL, MongoDB, Hive, and S3, without the need for data relocation or pre-loading.

Such engines are defined by high-tech characteristics such as the following:

Cost-centric query optimisation over various backend systems;
Predicate pushdown, allowing filtering operations to be executed near their corresponding data sources;
Parallel execution plans, enabling distributed scalability;
Connector extensibility for custom data-source support.

Figure 5 shows a federated SQL engine orchestrating connectors, predicate pushdown, and parallel execution across heterogeneous sources.

To provide a practical comparison, Table 9 benchmarks representative federated engines across optimisation, execution, and scalability features.

In independent YCSB experiments, MongoDB achieves the best overall runtime across diverse CRUD workloads, while scan-heavy workloads favour CouchDB, which also exhibits the strongest thread scale-up among the three evaluated document stores. In contrast, native graph engines optimise adjacency locality (e.g., adjacency lists and, in some designs, direct pointers), so traversal cost is governed by the visited subgraph rather than the total graph size.

The capabilities of the systems have been documented and evolved in production deployments [7]. For instance, a financial analyst can leverage Trino to combine reference data gleaned from a PostgreSQL table with transactional data stored in Parquet format within an Amazon S3 bucket, along with metadata gained from a MongoDB collection, without resorting to reorganising or duplicating data [55,58]. Federated querying allows for a decrease in data transfer costs. At the same time, it brings about latency issues, along with concerns around the availability of data sources and maintaining consistency guarantees. In addition, performance improvements are hindered by performance-related restrictions specific to certain data sources, which are enhanced by a lack of in-depth indexing or statistics data. To counter these issues, different approaches are being utilised by systems, for example, the use of caching layers, the use of materialised views, or the reuse of query results. Orchestration platforms (for example, Airflow) and catalogue services (for example, AWS Glue and Hive Metastore) are incorporated with federated engines to manage schema definitions and enhance execution effectiveness. These components form the backbone of lakehouses, allowing for analysis at continuous real-time or near-real-time speeds for structured and semi-structured data in formats such as Delta Lake, Apache Iceberg, or ORC [3,10].

4.3. Schema-on-Read vs. Schema-on-Write Trade-Offs

A key design trade-off is choosing between schema-on-write and schema-on-read, which reflect different philosophies for schema management, validation, and integration.

In schema-on-write, data is required to conform to a predefined schema before being ingested into the system. This approach, prevalent in traditional RDBMS and data warehouses, enforces strong consistency, data quality, and semantic clarity. It facilitates indexing, integrity checks, and query optimisation. However, it also introduces rigidity, requiring costly transformation steps and limiting the ability to incorporate evolving or irregular data sources quickly [56].

On the other hand, the schema-on-read approach postpones the creation of the schema until the execution of a query takes place. This approach is commonly used, together with data lakes and semi-structured storage implementations, using data types like JSON or Avro, thereby enabling more flexibility plus better data ingestion mechanisms [5]. It allows for the storage of uncompressed or poorly structured data while deferring the application of the schema, thereby enabling analyses that require little preprocessing to be performed by the analyst and with the researcher. This method is particularly handy in scenarios like experimental studies and ad hoc analysis, together with rapid prototyping applications.

However, schema-on-read implementation is very challenging for the following reasons:

Variability in schema interpretation across queries or individuals;
In the phase of inquiry, complex transformations gain prominence;
The issues involved in handling metadata in large datasets marked by dynamically changing schemas, requiring scrupulous analysis;
Query performance becomes inefficient as a result of overly extended binding times.

Proposed solutions involve hybrid approaches tackling the fundamental trade-offs inherent in data management systems, as well as improving the effectiveness of delta computation. Hybrid lakehouse architecture implementations combine the schema adaptation of schema-on-read features with properties commonly seen in data warehousing environments, including schema enforcement, version management, and governance—the functions largely evident in traditional data warehouses [3,10,59]. Solutions involving the use of Delta Lake and Apache Iceberg provide selective operations on schemas, data types, and partition pruning, in addition to supporting the manipulation of raw and semi-structured data. One key improvement in this area is the use of metadata-driven schemas, wherein data conforms to automatically deduced schemas constructed with the help of tools such as Great Expectations, Deequ, or DataHub.

These schemas have the characteristic of evolving over time and can be used to create templates for future verification or transformation processes, thereby eliminating enforcement during the process of taking in data. In practice, instance-level ambiguities such as locale-dependent dates (e.g., “01/12/2025”) are handled by preferring schema-on-write conformance (canonical ISO at ingestion) and, where infeasible, by a hybrid read-time normaliser with explicit lineage and quality expectations.

In summary, the decision between schema-on-write and schema-on-read need not be seen as a straightforward binary. Rather, it is guided by a number of considerations, such as the requirements of the application in question, the nature of the workload, the factors that contribute to data variability, and governance requirements. Systems that leverage a hybrid strategy and can transition smoothly between the two approaches have been seen as having flexibility and continued effectiveness over time.

5. Metadata, Lineage, and Semantic Interoperability

Metadata is the glue that holds together discoverability, governance, and trust, including catalogues and lineage tools, while ontologies and semantic enrichment define meaning across sources. Altogether, they form a basis for reproducibility, auditability, and greater integration at scale.

5.1. Metadata Management Frameworks (e.g., Apache Atlas and DataHub)

The growth of large and complex data ecosystems has brought with it increased recognition of the value of metadata—sets of data describing other sets of data—as a key factor for integration, discoverability, governance, and access control. Several metadata catalogue systems (Table 10) provide scalable repositories for technical, operational, and business metadata. In many analytical applications, keeping the metadata in a coherent and current condition is important because this can help mitigate problems related to discrepancies in a schema, transformations, and semantic homogeneity [5,61,62].

Modern metadata standards outline a unified structure, which enables the integration of technical metadata within differing classifications, encompassing schemas, types, format, and partitions, for example, operational metadata related to things like recency, lineage, and access history, together with business metadata related to descriptions, ownership, and policies for use. Numerous reliable open-source offerings like Apache Atlas, LinkedIn DataHub, and Amundsen offer horizontally scaled architectures that prove highly efficient in dealing with the intake, storage, and querying of metadata with diverse origins [5,63].

The platforms also support the integration of data-processing engines like Apache Spark and Airflow, with storage options like Hive, S3, and Delta Lake, along with governance features like Role-Based Access Control (RBAC) and audit logging. Additionally, these platforms support Application Programming Interfaces (APIs) and user interfaces built specifically for data engineers, analysts, and data stewards. Some of the most important features that significantly impact discoverability and interpretability of data include tagging, searchability, lineage visualisations, and automatic classification of data [62,63]. A centralised metadata repository greatly reduces redundancy, encourages the reuse of schemas, and enables governance for schema development. Additionally, the contribution of metadata is gaining wider prominence in data quality evaluation, audit activities related to regulatory compliance checking, and impact assessment in highly regulated industries like healthcare and financial services. As various datasets are being brought together with increasingly mature methods of integration, metadata systems become even more important for consistency and auditability across the data ecosystem [5].

5.2. Ontology-Based Integration and Semantic Enrichment

Together with structural metadata, semantic metadata clarifies why data elements have meaning and are related to each other, thereby enabling the integration of disparate and distributed systems. Ontologies formally describe domain concepts and relations, providing a foundation for semantic interoperability.

Ontologies become key tools within data integration spaces, enabling schema mapping, entity reference disambiguation, and context alignment. For example, a single entity can be represented with different terminologies within two datasets, for instance, “client” or “customer”, or appear through aggregated data at differing hierarchical levels, for instance, “monthly income” or “annual earnings”. Schema elements can be related to ontology classes using RDF or OWL. These links allow systems to infer relationships such as equivalence, subsumption, or aggregation.

The use of an ontological basis is widely applied in many scientific domains, including bioinformatics, as the Gene Ontology; environmental science, as embodied by the Semantic Web for Earth and Environmental Terminology (SWEET) ontologies; and cultural heritage, as expressed in the International Committee for Documentation Conceptual Reference Model (CIDOC CRM) [64,65,66]. In business, application domain-specific terminologies, e.g., the Financial Industry Business Ontology (FIBO) for financial services and Health Level Seven (HL7) for healthcare, are crafted to enhance data standardisation and interoperability [67,68].

The use of entity-linking techniques, combined with text annotation procedures and knowledge graph improvement, enables dynamic semantic enrichment. Dynamic semantic enrichment is the process by which information is contextualised with external knowledge bases like Wikidata, DBpedia, and domain-specific taxonomies [69,70]. Dynamic semantic enrichment increases the data’s discoverability, provides query support through natural language user interfaces, and enables reasoning across diverse and heterogeneous datasets. In spite of their promise, ontology-based solutions face a variety of challenges, such as problems related to ontological alignment, high computational costs, and reluctance to apply them in settings with insufficient expertise.

Recent ontology alignment frameworks pair retrieval with LLM prompting to raise unsupervised matching quality while curbing calls to the model. MILA reports the top F-measure on multiple OAEI tasks and a lower runtime through a prioritised search pipeline [17]. Complementarily, LLMs4OM explores zero-shot and representation-aware prompting for ontology matching across diverse ontology views, illustrating how foundation models can support semantic interoperability [18].

Advancement in tech developments, represented by lightweight ontologies, schema annotation, and semantic data stores, can help increase the scalability and universality of semantic integration.

5.3. Lineage Tracking and Schema Evolution Handling

Data lineage traces the full life cycle of a dataset, from source to processing to final use. Within integrated environments, lineage analysis explains the data origin, validates its integrity, and detects anomalies in data pipelines. It also supports compliance with regulatory requirements, like the “right to be forgotten” imposed by the General Data Protection Regulation (GDPR); improves reproducibility; and supports cooperation [23,62].

It can be discussed with reference to various aspects:

Table-level lineage describes the relationship between datasets, such as how table A is created from tables B and C.
Column lineage provides precise mappings that show relationships between the transformation processes in individual columns.
At the code level, family constructs create relationships between conversions related to specified activities or scripts, blending execution environments with parameter specifications for settings.

Tools like Marquez, OpenLineage, and Apache Atlas offer fine-grained APIs with query frameworks for the express purpose of identifying lineage metadata for data transfer moves. These tools assist with the integration of pipeline orchestrators like Dagster and Airflow, which allows for a better view of data transfer across platforms [5,63]. Figure 6 provides an overview of lineage tracking and schema evolution, connecting dataset lifecycle stages with lineage levels, supporting tools, and schema evolution mechanisms.

One of the key issues at this level is regarding schema evolution, which essentially includes changes to the data structure over time, such as adding new columns, changing data types, or updating the conventions used for column naming. In the conventional static ETL framework, the changes often create processing issues or give rise to inconsistency problems. To support schema evolution, it is critical that the systems do the following:

Note and explain changes to the schema;
Validate compatibility across pipeline stages;
Enable both backward- and forward-compatible reading.

Historical changes have been well documented due to advances in storage structures that are column-major in data organisation, such as Parquet, and are complemented by transactional systems like Apache Iceberg and Delta Lake. These underlying prototype technologies provide necessary flexibility in multiple environments while providing uniform reliability and reproducibility throughout the process [2,3,10].

At a deeper level, the elements of metadata, lineage, and semantic context become foundational cornerstones for the intelligent integration platform. These aspects outline the essential guidelines that empower systems to explore relationships between the data, correct inaccuracies, conduct automatic transformations, and improve the frameworks for governance [5,71].

6. Performance, Scalability, and Consistency Challenges

The problems involved in operating over varying backends can be broken down for analysis. These include cost-based query optimisation strategies, materialisation and caching, and the real-world implications of freshness and eventual consistency. This section provides insight into the mechanisms and trade-offs that govern performance in realistic situations.

6.1. Query Optimisation Across Heterogeneous Sources

Federated query execution poses complex optimisation challenges due to the diversity of underlying sources. Compared with traditional monolithic databases, which provide the optimiser with full control over schemas, statistical information, and execution plans, federated and multi-model databases face challenges involving stale metadata, diverse query capability, and non-uniform performance characteristics across a heterogeneous set of sources [7,14].

One of the key challenges is the derivation of cross-source query plans that efficiently minimise data transfer while leveraging local computational resources. For example, a raw join between a local relational database and a remote NoSQL database can create a lot of network traffic if intermediate results are not quickly filtered. Predicate pushdown, which involves the allocation of filters and projections to respective underlying systems, is a common optimisation strategy that reduces the data to be transferred. However, its effectiveness is dependent on the availability of query operator matching and the indexability of each data source [7,60].

Cost-based optimisation is also complicated by the absence of reliable and consistent statistical data. Selectivity or cardinality estimation becomes difficult in distributed environments, particularly when heterogeneous sources have varying capacities for sampling or provide metadata in different formats. Some systems use heuristic rules or runtime feedback to dynamically adjust their execution plans. Recent work explores learned estimation and optimisation—e.g., learned cardinality estimators and reinforcement learning-based join ordering—to adapt plans under uncertainty [72,73,74].

Engines like Presto/Trino and Apache Drill employ federated optimisers that account for connector-specific capabilities and support adaptive planning but still suffer from slowdowns from remote-source latency, schema mismatches, and transformation overheads [7,60]. Most recent work has explored machine learning optimisers whose performance models are learned and can steer join orders and execution paths [74].

6.2. Materialisation and Caching Strategies

To reduce repeated query costs, federated systems increasingly rely on materialisation and caching to reuse results. These approaches lead to lower latency, reduce the load on source systems, and improve the predictability of performance in exploratory data analysis and dashboard-style analytics use cases [75,76].

Materialised views are the results of precomputed queries that are stored in advance and refreshed at regular intervals in some particular system—typically, a data warehouse or analytical repository. They are useful for frequently accessed join paths, filtered aggregates, or derived measure paths known to have high computational costs when computed in real time. However, the use of materialised views includes storage overhead, requires consistency management, and could cause data to become stale if not periodically refreshed [75].

In query engines used in Presto/Trino or Dremio setups, result caching refers to the storage of results for run or running queries in transient storage or solid-state drives (SSDs). This technique significantly minimises the computation costs of similar queries. Caching intermediate outputs through the reuse of the subquery results is a very efficient technique in use-case scenarios involving high query overlap, such as business intelligence dashboards and multi-tenanted environments. Related systems show how to persist and reuse intermediate results and sub-jobs effectively [7,77].

Maintenance of metadata related to schema definitions, statistical information about columns, and access patterns is key to performance improvement. It allows for accelerated query planning and reduces the overhead of repeated schema discovery typical with the querying of file-based data stores like Parquet or JSON. Despite these advantages, caching methods have to bypass a complicated balance between data timeliness requirements, storage costs related to those requirements, and cache invalidation problems. Datasets undergoing perpetual and dynamic changes, especially those being updated in real time by streaming or modified externally through transactions, require careful synchronisation protocols to be adopted. Modern engines also support materialised federation and incremental maintenance to balance fast availability of cached views with on-demand flexibility [75,76].

6.3. Data Freshness and Eventual Consistency Issues

Across integrated systems comprising a range of data stores, including relational databases, NoSQL databases, files, APIs, and streaming systems, keeping the data fresh and consistent is a key challenge. Analytical workflows often rely on data ingested or processed asynchronously, which causes the different sources to become temporally misaligned [78].

Freshness of data refers to how much the consolidated view captures the timeliness of the source datasets. For periodic-batch ETL pipelines, the level of freshness depends on the frequency of extraction and the update lag thereof. In streaming or near-real-time systems, freshness depends on factors like ingestion latency, event processing delay, and the effectiveness of the checkpointing system. Robust watermarks and event-time semantics are important to quantify and bound lateness [78,79]. Users often have to trade off low latency against system stability, especially in pipelines with complex transformations or downstream dependencies.

In scenarios with distribution or federation, eventual consistency is a situation wherein changes to the data are not immediately reflected across all instances or replicas. It is a frequently seen occurrence with NoSQL stores like DynamoDB or Cassandra or design patterns involving asynchronous replication or a microservices pattern [44,80]. An update to an order’s status in a specific service, for example, will not be immediately reflected on a user analytics screen, nor will it be simultaneously available together with customer details obtained from a separate repository.

The fact that schema changes evolve over a period of time, often combined with non-linear data transmission—particularly in streaming scenarios—and the usage of retry protocols, which can cause event duplication or loss, amplifies the extent of consistency issues. For this reason, in order for integrated systems to maintain analytical integrity, features for deduplication, temporal windowing, out-of-time data management, and conflict resolution must be implemented. To address these challenges, more platforms are adopting versioned data architectures like Apache Iceberg and Delta Lake, which provide capabilities such as time travel, rollback, and reproducible querying support [3,10]. Observability and data-quality verification frameworks help monitor freshness and correctness in production data pipelines [81]. Additionally, consistency and freshness considerations should be looked upon as design principles rather than just operational issues. Well-performing analytical systems should set up Service-Level Agreements (SLAs) for data timeliness, as well as governance patterns that evaluate stale time, revision policies, and high-level strategies to build confidence in analytical outcomes. Figure 7 synthesises key performance challenges in heterogeneous analytics and maps them to optimisation levers and mitigating strategies across engines and storage layers.

6.4. Federation Overhead and Performance Tuning

Federated access in data lakes and polystores is concrete and implementation-specific. For example, Ontario and Squerall execute federated query processing over semantic data lakes by decomposing a SPARQL input into subqueries per dataset, then translating each subquery to the target system (e.g., Spark SQL for TSV/HDFS) using dataset profiles and rules. Squerall retrieves from CSV/Parquet, MySQL, Cassandra, and MongoDB through a mediator (high-level ontologies) and ships data via connectors (two implementations: Spark and Presto) before joining into the final result [5].

In a broader integration survey, federated query answering is explicitly defined as “a consistent way of accessing data from sources without duplicating them in a central repository”, achieved “by using sub-queries that target the data sources within the federation and evaluating their results based on predefined rules” [11]. These concrete mechanisms surface where overhead arises: (i) connector capability skew (e.g., which operators can be translated/pushed and with what plan quality), (ii) planning under partial or per source metadata (Ontario “uses the profiles to generate subqueries” and “uses metadata… to generate optimised query plans” [5]), and (iii) movement/serialisation when subresults are shipped back to the mediator for final assembly.

In practice, the choice of connector matters: the same mediator (Squerall) reports two runtime stacks “with different data connectors: Spark and Presto” [5], anticipating distinct pushdown, transfer, and scheduling behaviour. At the orchestration layer, LLM interfaces increasingly appear in pipelines. However, even strong models show execution gaps in data tasks (e.g., GPT-4 text-to-SQL execution accuracy of 54.89% vs. human 92.96%), which cautions against uncritical delegation of query planning/translation to LLMs [19].

Tuning, in turn, follows those concrete pain points.

First, connector-aware pushdown is not optional but infrastructural: Ontario’s use of dataset profiles and Squerall’s mediator mapping illustrate that federation layers must know source capabilities to drive translation and decide where to execute selections/aggregations/joins [5,11].
Second, planning with partial statistics can still be effective if the mediator exploits metadata to derive good subquery decompositions and join orders (Ontario “uses metadata … to generate optimised query plans” [5]) and, when available, sampling or progressive execution to refine estimates (see also systematisations in [6]).
Third, movement minimisation is a transport problem: the choice of columnar/vectorised paths and batching reduces per tuple overhead. Contemporary evaluations of columnar runtimes and data paths emphasise the sustained throughput advantages of vectorised processing and columnar layouts for scans and aggregations [1,2].
Fourth, materialisation and incremental maintenance mitigate repeated cross-source joins. Rather than fully recomputing federated joins/aggregations, incremental frameworks (e.g., DBSP) maintain views by applying deltas to compiled differential programs, reducing refresh latency and source load in steady state [76].
Finally, hybridisation—persisting “hot” integrated slices (lakehouse/warehouse) while federating the long tail—follows the storage–execution split documented across recent lakehouse discussions [5,82].

Table 11 maps these overheads to tuning levers with the precise loci (translation, planning, transfer, and maintenance) where they act.

Case focus (AAS–ECLASS industrial federation). In a manufacturing integration where AAS submodels act as the mediator and ECLASS serves as the external dictionary, the authors of [31] report two very specific performance levers. (i) Blocking to cut candidate space: Before pairwise matching, the system narrows candidates via ANN over embeddings (open-source SFR-Embedding-Mistral) with Faiss. In their AAS–ECLASS setting, the dictionary spans 27,423 entries, so blocking is operationally decisive for both compute and downstream join fan-out (ii) Classifier choice as a speed/accuracy knob: a fine-tuned generative LLM achieves slightly better results, whereas an encoding-based classifier enables much faster inference, and the fine-tuned LLM surpasses BERT variants and GPT-4+ICL on entity-matching benchmarks.

In the ER stage feeding federation, the authors of [15] show that batching demonstrations and questions (BATCHER) are very cost-effective for ER, outperforming both fine-tuned PLMs and manually designed LLM prompting. This directly trims external API overhead and stabilises latency. With respect to LLMs driving orchestration or NLQ, another work [19] documents execution-level gaps (GPT-4 text-to-SQL of 54.89%), arguing for deterministic translation paths or verification stages in production. Together with mediator-level pushdowns (Ontario/Squerall) and incremental materialisation (DBSP), these concrete techniques reduce shipped data, avoid misplaced computation, and keep the AAS federation responsive under heterogeneous source capabilities [5,11,76].

7. Applications and Case Studies

This section grounds the concepts in practice (applications), including enterprise lake/lakehouse deployments, scientific data integration, and public sector pipelines. It highlights domain-specific constraints (e.g., governance and standards) and architectural patterns that recur, showing how the reviewed methods translate into outcomes. Table 12 presents a cross-walk from the taxonomy to the three abovementioned application domains, indicating for each taxonomy element where it is instantiated (“where used”) and where it is analysed in the text (“where discussed”). This makes the taxonomy operational and allows readers to locate concrete occurrences and the corresponding discussion quickly.

7.1. Enterprise Data Lakes and Lakehouses

In large companies, the need for the consolidation of heterogeneous internal and external data silos has compelled the large-scale adoption of data lakes and lakehouse platforms. Traditional data warehouses are limited by rigid schemas, high scaling costs, and tedious loading. Data lakes, on the other hand, provide for the integration of raw, semi-structured, and structured data from various business areas, such as customer relationship management systems, transaction logs, sensor data, and external service providers, into horizontally scaled object storage solutions like Amazon S3, Azure Data Lake Storage, or Google Cloud Storage [5].

To move beyond simple data storage options, an increasing number of organisations is adopting lakehouses—unified platforms leveraging the natural flexibility of data lakes with added functionalities characteristic of data warehouses like ACID transactions, temporal capabilities, and schema enforcement policies. Products like Delta Lake [3], Apache Hudi, and Iceberg [10] enable end-to-end querying, incrementally updated data, and support for schema evolution over large datasets [13].

One such example is a worldwide retail company that aggregates sales, inventory, customer feedback, and supply chain data from over 50 systems within an enterprise-wide analytical framework. Utilising Apache Spark for distributed computation, Presto [7,55] for federated querying, and Apache Atlas for metadata management, the company enables batch and real-time analytics with traceability and governance across multiple regions and departments.

7.2. Scientific Data Integration

Scientific fields often operate at the frontier of data integration, which requires the integration of heterogeneous datasets across spatial, temporal, semantic, and modality aspects. For applications like life sciences, environmental science, physics, and social sciences, researchers acknowledge the need to integrate a range of sources like experimental observations, sensor readings, simulation software, domain ontologies, and research articles. These sources tend to vary in their forms, granularity, semantic material, and frequency of update, presenting a major challenge with respect to integration for possible subsequent usage.

In resolving the problem, modern scientific data infrastructure more increasingly relies on methodologies that include semantic integration, standardised metadata frameworks, and distributed computing systems [83,84]. To ensure conceptual consistency in datasets, ontologies and controlled vocabularies are used, thereby improving the accuracy of alignment and aggregation of linked variables. Various tools, such as ontology mapping engines, semantic catalogues of data, and knowledge graphs, are utilised continuously to support the interlinking of data in various repositories, instruments, and organisations.

In addition, each of several disciplines has adopted modular, workflow-based systems for data ingestion, annotation, and transformation processes. These systems support reproducible analyses, versioning, and shared curation—key attributes in domains where datasets change over time and that involve large, distributed user bases. The widespread adoption of cloud-native storage formats (e.g., Parquet and Zarr [85,86]), spatio-temporal indexing, and interoperable Application Programming Interfaces (APIs) supports greater scalability for querying and integration of data from a wide variety of structured and unstructured sources.

Data integration pipelines enable many uses, including genomic discovery, climate modelling, astronomy, and disease surveillance. The generated results are often determined not only by the volume or speed of data but also by semantic matching effectiveness, provenance annotation, and contextualisation of relevance to a particular field—highlighting the essential need for strong, flexible integration systems to fuel scientific understanding.

7.3. Cross-Domain Data Pipelines in Public Sector Analytics

Public sector organisations are increasingly implementing standardised data platforms to enable evidence-based policymaking, improve the management of service delivery, and ensure transparency. An important feature of analytics in public sector organisations is the need to consolidate data from diverse silos across the organisation, such as education, health, employment, tax, and mobility.

For instance, a national statistical office might bring together data from censuses, hospitalisation rates, school performance measures, and social programs’ enrolment rates for the purpose of understanding disparities or designing certain interventions. Data sources can have varying identifiers, structures, frequencies of update, and access-mandated legal restrictions. Data integration methods must be sensitive to the needs for anonymisation, possible errors in data correspondence, and auditability needs, which are often controlled through strict data protection legislation (e.g., GDPR) [87].

To support effective integration pipeline management, a range of tools such as OpenRefine [88], CKAN, and custom data warehouses with lineage traceability and role-based access controls are used. Additionally, using a semantic standard such as the Statistical Data and Metadata Exchange (SDMX) [89], together with linked data-based methodologies supports consistency of definitions across different agencies.

Interoperability with open data portals, city virtualisations, and peer-to-peer dashboards between and among different agencies is increasingly reliant upon integration platforms that have real-time capabilities. Such integration platforms are those that integrate dynamic datasets such as traffic flow patterns, energy consumption, and pollution levels with stable statistical indicators. These conditions highlight the imperative to develop data management policies that are operational across technical, institutional, and legal levels, pursuing a harmonious balance of interoperability, governance, and scalability.

Figure 8 provides an architectural comparison of integration strategies across eight layers, showing how they are instantiated across enterprise, scientific, and public sector pipelines.

8. Future Directions

Future expectations refer to analytically sound frameworks, metadata-rich and self-explanatory workflows, and integration methods enriched with AI, like automatic mapping, entity-relationship modelling, and semantic reasoning. These can be leveraged to minimise manual effort while concurrently maintaining human supervision. This section establishes a strategic blueprint for the production more composable, governable, and intelligent data platforms.

8.1. Towards Unified Analytical Fabrics

Ongoing innovation in integration and storage demands unified analytical frameworks that enable transparent access, governance, and processing of data assets. A major goal of these frameworks is to converge the benefits of data warehouses, data lakes, and operational stores. This is achieved through a common query and metadata layer that is independent of the data format, geographic distribution, and movement rate [13,14,90].

Future analytical fabrics will provide for hybrid execution models that can combine batch, streaming, and federated operations. They will employ declarative models of metadata for automatic configuration of workflows and increased runtime performance. Rather than building monolithic platforms, organisations will look towards the adoption composable architectures with modular components for data ingestion, cataloguing, transformation, and access control—networked together with open standards and APIs [13,14].

Commercial business ventures and open-source projects, including Dataplex (Google), Data Fabric (IBM), and LakeFS, are at the forefront of this space. Subsequent frameworks will evolve further in terms of features, including data lineage tracing, accurate access control, real-time monitoring, and multi-cloud capabilities. All these frameworks will be of utmost significance in decentralised organisations and collaborative institutions, which will require integration in a seamless and secure manner across geopolitical and technical boundaries [10].

8.2. Metadata-Driven and Self-Describing Pipelines

To counter the intrinsic brittleness and labour-intensive character of existing integration processes, the future for data management hinges on pipelines that are both self-describing and metadata-driven. Such next-generation pipelines can autonomously infer, propagate, and adapt to schema and data profile changes using integrated metadata and declarative policies [83,91].

In this context, metadata moves beyond being an auxiliary companion resource towards being an essential element within an overall end-to-end pipeline design. Pipeline building is deliberately schema-aware, context-adaptive, and carefully controlled at the level of version control. Tools like dbt, Apache Hop, and Tecton enable developers to express partially declarative pipeline definitions, which vary accordingly with source schemas, data quality measures, and business rule logic [91].

Self-describing data structures like Parquet, which includes an integrated schema, Avro, and Arrow, make this progress possible through the capability of pipelines to dynamically analyse and verify at runtime. Prior evaluations of columnar file formats reveal trade-offs between Parquet and ORC. Additionally, the development of data contracts—organised contractual agreements between consumers and producers that outline structure, semantics, and SLAs—boosts such development [2,41,92].

In the future, automated testing, schema drift detection, lineage impact estimation, and semantic reconciliation are expected to be built-in capabilities of data pipelines. That shift should promote reuse, modularity, and resilience—the key properties to enable maintainability in complex analytical frameworks for a long duration [83,91].

8.3. AI-Assisted Integration and Auto-Schema Mapping

One of the high-profile and challenging areas of exploration is related to the integration of AI with a focus on boosting the automation of data integration processes. Schema mapping, ER, and rule establishment for transformations are highly prone to human intervention, errors, and limitations in scalability. Natural language modelling, learning of representations, and primitive models have brought about new possibilities for contextually informed intelligent assistance with integration processes [90].

AI-based tools can suggest field mappings, design transformation scripts, and identify semantic relationships by analysing schemas, instance values, and external knowledge graphs. Some specific tools like AutoMapper, Google Cloud Dataprep, and those using OpenAI Codex can execute interactive mapping and transformation according to natural language commands [90].

Furthermore, embedding-based matching methods like BERT and graph embeddings offer effective solutions for harmoniously reconciling heterogeneous schemas, especially in applications involving inconsistent labelling or heterogeneous data forms. Additionally, the integration of active learning within an interface incorporating human feedback can improve mapping accuracy through user contributions [93,94]. From a pipeline perspective, LLMs are increasingly positioned as programmable interfaces for data pipelines, synergising with KGs, XAI, and AutoML to mediate discovery, transformation, and governance [19]. Early orchestration prototypes further show LLM-assisted DAG synthesis for data enrichment pipelines [20], indicating a path from point tools to agentic, metadata-aware integration flows.

As foundation models evolve for structured data manipulation, future systems will ingest heterogeneous datasets and identify their internal structure and semantics. They will automatically generate integration frameworks, quality assessments, and domain-specific interpretations. This will revolutionise insight extraction and greatly lessen the challenges involved in complex data integration projects [90].

However, concerns of explainability, control, bias, and governance remain. Artificial intelligence design should be conceived to augment human capacity and not replace humans—incorporating facets of transparency, auditability, and the ability to allow human intervention across all automated decision-making systems.

8.4. Ethical and Regulatory Directions

Beyond the challenges inherent to architecture and performance, systems of integration in the future will also require considerations of ethics and rules of regulation as part of their design. In terms of privacy and security of data, Regulation (EU) 2016/679 (GDPR) provides a critical framework, requiring data processors and controllers to adhere to standards of responsibility, respect rights of the person (e.g., data erasure and transportability), and comply with purpose restriction in all processing operations, including integration [95]. Moreover, the newly ratified EU Artificial Intelligence Act (Regulation (EU) 2024/1689) provides mandatory rules for AI systems, requiring explanation, risk assessment, and human oversight—matters of particular relevance to integration involving learned or generative schema matching or entity resolution techniques [96]. Moreover, laws of a particular domain-specific nature (e.g., HIPAA for healthcare and PSD2 for financial services) place special limits on the use and sharing of integrated data.

Both from a technical standpoint and an ethics standpoint, there are serious risks involved in the proliferation of biases in integration pipelines, for instance, processes of schema matching or of entity resolution being associated with a model learned from a biased dataset may inadvertently retain and propagate biases. Overcoming such challenges implies the integration of explainability and verification with human oversight and auditable mechanisms, particularly with transformations afforded by automated inference [97]. In order to build trust and responsibility, prospective integration systems should include “governance-aware” capabilities, including fine-grained tracking of lineage, history of audit, and human oversight at critical decision points. Such rules of design not only aim to bolster transparency and ensure compliance through design but also ensure a balanced reconciliation of technological advancements and legal and societal responsibility and, ultimately, an efficient conjoining of interoperability, scaling, and robust governance.

8.5. Limitations of This Review

This work is a structured survey rather than a systematic review. It aims for representative, cross-layer coverage rather than exhaustiveness. New empirical benchmarks were not established. Performance assessments reflect published studies and production reports. The 2015–2025 focus can introduce recency bias and version drift for rapidly evolving engines and connectors. Where feasible, results were triangulated with contemporary surveys. In addition, non-archival/vendor whitepapers were excluded, which may have led to the omission of operational detail. Finally, generalisation across domains is limited, i.e., the examples in Section 7 are indicative rather than comprehensive, and some modalities (e.g., unstructured media) fall outside the scope of this work.

9. Conclusions

The current study explored the evolving dynamics of data integration and storage in analytics systems, with special focus on architectures, tools, and methodologies that target improved performance, scalability, and semantic consistency. A comparative evaluation of primary storage models, such as row-store and column-store systems, NoSQL databases, and lakehouse architectures, was conducted in terms of their suitability for different workloads. Integration patterns were studied in the scenarios of ETL/ELT pipelines, federated query workloads, and metadata-centric orchestration. In addition, the importance of semantic enrichment, data provenance, and schema evolution was highlighted as key enablers for the building of fault-tolerant and traceable data pipelines. Moreover, major challenge areas related to query optimisation, caching mechanisms, data freshness, and consistency were discussed, in addition to real-world applications in enterprise, scientific, and public sector scenarios.

To build future-ready analytical infrastructure, practitioners should adopt modular, metadata-driven architectures; leverage open standards; and invest in governance-aware integration. Schema flexibility and tracking of lineage and reproducibility have to be balanced with each other, and performance tuning has to consider storage configuration, along with federation overhead. Semantic interoperability and AI-powered toolsets will play an increasingly important role in integration cost reduction and streamlined self-adaptive pipelines. Data teams can ensure scalability, agility, and fault tolerance within complex analytical environments by aligning technical design with organisational needs, along with imposed regulatory constraints.

Finally, this work offers a broad and multifaceted overview that connects integration methods to storage solutions. It stands as a comparative approach that highlights trade-offs with ETL/ELT/virtualisation and federated pushdown and a metadata- and lineage-focused approach that combines performance and consistency controls into useful design tools across domains. This integration is conceived as a decision aid in the construction of regulated hybrid pipelines that can sustain performance and reproducibility in the face of rapidly changing schemas and workloads. In contrast to prior surveys that concentrated separately on lakes, federation, or lakehouses, this contribution represents a unification of integration mechanisms with storage and governance under actionable, reproducible workflows, augmented by the latest AI-assisted methods.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

The author would like to thank the anonymous reviewers for their constructive feedback, which improved the paper substantially.

Conflicts of Interest

The author declares no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ACID	Atomicity, Consistency, Isolation, and Durability
AI	Artificial Intelligence
API	Application Programming Interface
AWS	Amazon Web Services
CIDOC CRM	International Committee for Documentation Conceptual Reference Model
CPU	Central Processing Unit
ELT	Extract–Load–Transform
ER	Entity Resolution
ETL	Extract–Transform,–Load
FIBO	Financial Industry Business Ontology
GDPR	General Data Protection Regulation
HDFS	Hadoop Distributed File System
HL7	Health Level Seven
JSON	JavaScript Object Notation
MDM	Master Data Management
NoSQL	Not Only SQL
OLAP	Online Analytical Processing
OLTP	Online Transaction Processing
ORC	Optimised Row Columnar
OWL	Web Ontology Language
RBAC	Role-Based Access Control
RDF	Resource Description Framework
REST	Representational State Transfer
S3	Simple Storage Service
SDMX	Statistical Data and Metadata Exchange
SIGMOD	ACM Special Interest Group on Management of Data
VLDB	Very Large Data Bases (Conference)
ICDE	IEEE International Conference on Data Engineering
TKDE	IEEE Transactions on Knowledge and Data Engineering
VLDBJ	The VLDB Journal
SLA	Service Level Agreement
SPARQL	SPARQL Protocol and RDF Query Language
SQL	Structured Query Language
SWEET	Semantic Web for Earth and Environmental Terminology
XML	Extensible Markup Language

References

Liu, C.; Pavlenko, A.; Interlandi, M.; Haynes, B. A Deep Dive into Common Open Formats for Analytical DBMSs. Proc. VLDB Endow. 2023, 16, 3044–3056. [Google Scholar] [CrossRef]
Zeng, X.; Hui, Y.; Shen, J.; Pavlo, A.; McKinney, W.; Zhang, H. An Empirical Evaluation of Columnar Storage Formats. Proc. VLDB Endow. 2023, 17, 148–161. [Google Scholar] [CrossRef]
Armbrust, M.; Das, T.; Sun, L.; Yavuz, B.; Zhu, S.; Murthy, M.; Torres, J.; van Hovell, H.; Ionescu, A.; Łuszczak, A.; et al. Delta lake: High-performance ACID table storage over cloud object stores. Proc. VLDB Endow. 2020, 13, 3411–3424. [Google Scholar] [CrossRef]
Abadi, D.J.; Madden, S.R.; Hachem, N. Column-stores vs. row-stores: How different are they really? In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, Vancouver, BC, Canada, 9–12 June 2008; pp. 967–980. [Google Scholar] [CrossRef]
Hai, R.; Koutras, C.; Quix, C.; Jarke, M. Data Lakes: A Survey of Functions and Systems. IEEE Trans. Knowl. Data Eng. 2023, 35, 12571–12590. [Google Scholar] [CrossRef]
Gu, Z.; Corcoglioniti, F.; Lanti, D.; Mosca, A.; Xiao, G.; Xiong, J.; Calvanese, D. A systematic overview of data federation systems. Semant. Web 2024, 15, 107–165. [Google Scholar] [CrossRef]
Sun, Y.; Meehan, T.; Schlussel, R.; Xie, W.; Basmanova, M.; Erling, O.; Rosa, A.; Fan, S.; Zhong, R.; Thirupathi, A.; et al. Presto: A Decade of SQL Analytics at Meta. Proc. ACM Manag. Data 2023, 1, 1–25. [Google Scholar] [CrossRef]
Potharaju, R.; Kim, T.; Song, E.; Wu, W.; Novik, L.; Dave, A.; Acharya, V.; Dhody, G.; Li, J.; Ramanujam, S.; et al. Hyperspace: The Indexing Subsystem of Azure Synapse. Proc. Vldb Endow. 2021, 14, 3043–3055. [Google Scholar] [CrossRef]
Dong, X.L.; Srivastava, D. Big Data Integration; Synthesis Lectures on Data Management; Springer Nature Switzerland AG: Cham, Switzerland, 2015. [Google Scholar] [CrossRef]
Okolnychyi, A.; Sun, C.; Tanimura, K.; Spitzer, R.; Blue, R.; Ho, S.; Gu, Y.; Lakkundi, V.; Tsai, D. Petabyte-Scale Row-Level Operations in Data Lakehouses. Proc. VLDB Endow. 2024, 17, 4159–4172. [Google Scholar] [CrossRef]
Alma’aitah, W.Z.; Quraan, A.; AL-Aswadi, F.N.; Alkhawaldeh, R.S.; Alazab, M.; Awajan, A. Integration Approaches for Heterogeneous Big Data: A Survey. Cybern. Inf. Technol. 2024, 24, 3–20. [Google Scholar] [CrossRef]
Pedreira, P.; Erling, O.; Basmanova, M.; Wilfong, K.; Sakka, L.; Pai, K.; He, W.; Chattopadhyay, B. Velox: Meta’s unified execution engine. Proc. VLDB Endow. 2022, 15, 3372–3384. [Google Scholar] [CrossRef]
Schneider, J.; Gröger, C.; Lutsch, A.; Schwarz, H.; Mitschang, B. The Lakehouse: State of the Art on Concepts and Technologies. SN Comput. Sci. 2024, 5, 449. [Google Scholar] [CrossRef]
Kaoudi, Z.; Quiané-Ruiz, J.A. Unified Data Analytics: State-of-the-Art and Open Problems. Proc. Vldb Endow. 2022, 15, 3778–3781. [Google Scholar] [CrossRef]
Fan, M.; Han, X.; Fan, J.; Chai, C.; Tang, N.; Li, G.; Du, X. Cost-Effective In-Context Learning for Entity Resolution: A Design Space Exploration. In Proceedings of the 2024 IEEE 40th International Conference on Data Engineering (ICDE), Utrecht, The Netherlands, 13–16 May 2024; pp. 3696–3709. [Google Scholar] [CrossRef]
Zhang, Z.; Zeng, W.; Tang, J.; Huang, H.; Zhao, X. Active in-context learning for cross-domain entity resolution. Inf. Fusion 2025, 117, 102816. [Google Scholar] [CrossRef]
Taboada, M.; Martinez, D.; Arideh, M.; Mosquera, R. Ontology matching with Large Language Models and prioritized depth-first search. Inf. Fusion 2025, 123, 103254. [Google Scholar] [CrossRef]
Babaei Giglou, H.; D’Souza, J.; Engel, F.; Auer, S. LLMs4OM: Matching Ontologies with Large Language Models. In Proceedings of the Semantic Web: ESWC 2024 Satellite Events, Hersonissos, Greece, 26–30 May 2024; Meroño Peñuela, A., Corcho, O., Groth, P., Simperl, E., Tamma, V., Nuzzolese, A.G., Poveda-Villalón, M., Sabou, M., Presutti, V., Celino, I., et al., Eds.; Springer: Cham, Switzerland, 2025; pp. 25–35. [Google Scholar]
Barbon Junior, S.; Ceravolo, P.; Groppe, S.; Jarrar, M.; Maghool, S.; Sèdes, F.; Sahri, S.; Van Keulen, M. Are Large Language Models the New Interface for Data Pipelines? In Proceedings of the International Workshop on Big Data in Emergent Distributed Environments, Santiago, Chile, 9–15 June 2024. [Google Scholar] [CrossRef]
Alidu, A.; Ciavotta, M.; Paoli, F.D. LLM-Based DAG Creation for Data Enrichment Pipelines in SemT Framework. In Proceedings of the Service-Oriented Computing—ICSOC 2024 Workshops: ASOCA, AI-PA, WESOACS, GAISS, LAIS, AI on Edge, RTSEMS, SQS, SOCAISA, SOC4AI and Satellite Events, Tunis, Tunisia, 3–6 December 2024; Springer Nature: Singapore, 2025; pp. 131–143. [Google Scholar] [CrossRef]
Rahm, E.; Bernstein, P.A. A Survey of Approaches to Automatic Schema Matching. VLDB J. 2001, 10, 334–350. [Google Scholar] [CrossRef]
Bleiholder, J.; Naumann, F. Data Fusion. ACM Comput. Surv. 2008, 41, 1–41. [Google Scholar] [CrossRef]
Cheney, J.; Chiticariu, L.; Tan, W. Provenance in Databases: Why, How, and Where. Found. Trends Databases 2009, 1, 379–474. [Google Scholar] [CrossRef]
Euzenat, J.; Shvaiko, P. Ontology Matching, 2nd ed.; Springer: Berlin/Heidelberg, Germany, 2013. [Google Scholar] [CrossRef]
Christen, P. Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection; Springer: Berlin/Heidelberg, Germany, 2012. [Google Scholar] [CrossRef]
Papadakis, G.; Skoutas, D.; Thanos, E.; Palpanas, T. Blocking and Filtering Techniques for Entity Resolution: A Survey. ACM Comput. Surv. 2020, 53, 1–42. [Google Scholar] [CrossRef]
Buneman, P.; Khanna, S.; Tan, W. Why and Where: A Characterization of Data Provenance. In Proceedings of the 8th International Conference on Database Theory (ICDT), London, UK, 4–6 January 2001; Springer: Berlin/Heidelberg, Germany, 2001; Volume 1973, pp. 316–330. [Google Scholar] [CrossRef]
ISO 8601-2:2019; Date and Time—Representations for Information Interchange—Part 2: Extensions. International Organization for Standardization: Geneva, Switzerland, 2019; Confirmed 2024; Amendment 1:2025.
Bellahsene, Z.; Bonifati, A.; Rahm, E. Schema Matching and Mapping. In Schema Matching and Mapping; Springer: Berlin/Heidelberg, Germany, 2011; pp. 1–20. [Google Scholar] [CrossRef]
Parciak, M.; Vandevoort, B.; Neven, F.; Peeters, L.M.; Vansummeren, S. LLM-Matcher: A Name-Based Schema Matching Tool using Large Language Models. In Proceedings of the Companion of the 2025 International Conference on Management of Data, Berlin, Germany, 22–27 June 2025; pp. 203–206. [Google Scholar] [CrossRef]
Shi, D.; Meyer, O.; Oberle, M.; Bauernhansl, T. Dual data mapping with fine-tuned large language models and asset administration shells toward interoperable knowledge representation. Robot. Comput. Integr. Manuf. 2025, 91, 102837. [Google Scholar] [CrossRef]
Wagner, R.A.; Fischer, M.J. The String-to-String Correction Problem. J. ACM 1974, 21, 168–173. [Google Scholar] [CrossRef]
Rodrigues, D.; da Silva, A. A Study on Machine Learning Techniques for the Schema Matching Network Problem. J. Braz. Comput. Soc. 2021, 27, 1–22. [Google Scholar] [CrossRef]
Popa, L.; Velegrakis, Y.; Miller, R.J.; Hernández, M.A.; Fagin, R. Chapter 52—Translating Web Data. In VLDB ’02: Proceedings of the 28th International Conference on Very Large Databases, Hong Kong, China, 20–23 August 2002; Bernstein, P.A., Ioannidis, Y.E., Ramakrishnan, R., Papadias, D., Eds.; Morgan Kaufmann: San Francisco, CA, USA, 2002; pp. 598–609. [Google Scholar] [CrossRef]
Binette, O.; Steorts, R.C. (Almost) All of Entity Resolution. Sci. Adv. 2022, 8, eabi8021. [Google Scholar] [CrossRef]
Kemper, A.; Neumann, T. HyPer: A Hybrid OLTP&OLAP Main Memory Database System Based on Virtual Memory Snapshots. In Proceedings of the 2011 IEEE 27th International Conference on Data Engineering (ICDE), Hannover, Germany, 11–16 April 2011; pp. 195–206. [Google Scholar] [CrossRef]
Lamb, A.; Fuller, M.; Varadarajan, R.; Tran, N.; Vandiver, B.; Doshi, L.; Bear, C. The Vertica Analytic Database: C-Store 7 Years Later. Proc. Vldb Endow. 2012, 5, 1790–1801. [Google Scholar] [CrossRef]
Schulze, R.; Schreiber, T.; Yatsishin, I.; Dahimene, R.; Milovidov, A. ClickHouse—Lightning Fast Analytics for Everyone. Proc. VLDB Endow. 2024, 17, 3731–3744. [Google Scholar] [CrossRef]
Wang, J.; Lin, C.; Papakonstantinou, Y.; Swanson, S. An Experimental Study of Bitmap Compression vs. Inverted List Compression. In Proceedings of the 2017 ACM International Conference on Management of Data, Chicago, IL, USA, 14–19 May 2017; pp. 993–1008. [Google Scholar] [CrossRef]
Chambi, S.; Lemire, D.; Kaser, O.; Godin, R. Better bitmap performance with Roaring bitmaps. Softw. Pract. Exp. 2016, 46, 709–719. [Google Scholar] [CrossRef]
Ivanov, T.; Pergolesi, M. The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A Study on ORC and Parquet. Concurr. Comput. Pract. Exp. 2020, 32, e5523. [Google Scholar] [CrossRef]
Abadi, D.; Madden, S.; Ferreira, M. Integrating Compression and Execution in Column-Oriented Database Systems. In Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data, Chicago, IL, USA, 27–29 June 2006; pp. 671–682. [Google Scholar] [CrossRef]
Sikka, V.; Färber, F.; Lehner, W.; Cha, S.K.; Peh, T.; Bornhövd, C. Efficient transaction processing in SAP HANA database: The end of a column store myth. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, Scottsdale, AZ, USA, 20–24 May 2012; pp. 731–742. [Google Scholar] [CrossRef]
DeCandia, G.; Hastorun, D.; Jampani, M.; Kakulapati, G.; Lakshman, A.; Pilchin, A.; Sivasubramanian, S.; Vosshall, P.; Vogels, W. Dynamo: Amazon’s highly available key-value store. In Proceedings of the Twenty-First ACM SIGOPS Symposium on Operating Systems Principles, Stevenson, WA, USA, 14–17 October 2007; pp. 205–220. [Google Scholar] [CrossRef]
O’Neil, P.; Cheng, E.; Gawlick, D.; O’Neil, E. The Log-Structured Merge-Tree (LSM-Tree). Acta Inform. 1996, 33, 351–385. [Google Scholar] [CrossRef]
Idreos, S.; Callaghan, M. Key-Value Storage Engines. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, Portland, OR, USA, 14–19 June 2020; pp. 2667–2672. [Google Scholar] [CrossRef]
Alsubaiee, S.; Altowim, Y.; Altwaijry, H.; Behm, A.; Borkar, V.; Bu, Y.; Carey, M.; Cetindil, I.; Cheelangi, M.; Faraaz, K.; et al. AsterixDB: A scalable, open source BDMS. Proc. VLDB Endow. 2014, 7, 1905–1916. [Google Scholar] [CrossRef]
Carvalho, I.; Sá, F.; Bernardino, J. Performance Evaluation of NoSQL Document Databases: Couchbase, CouchDB, and MongoDB. Algorithms 2023, 16, 78. [Google Scholar] [CrossRef]
Besta, M.; Gerstenberger, R.; Peter, E.; Fischer, M.; Podstawski, M.; Barthels, C.; Alonso, G.; Hoefler, T. Demystifying Graph Databases: Analysis and Taxonomy of Data Organization, System Designs, and Graph Queries. ACM Comput. Surv. 2023, 56, 1–40. [Google Scholar] [CrossRef]
Francis, N.; Green, A.; Guagliardo, P.; Libkin, L.; Lindaaker, T.; Marsault, V.; Plantikow, S.; Rydberg, M.; Selmer, P.; Taylor, A. Cypher: An Evolving Query Language for Property Graphs. In Proceedings of the 2018 International Conference on Management of Data, Houston, TX, USA, 10–15 June 2018; pp. 1433–1445. [Google Scholar] [CrossRef]
Melnik, S.; Gubarev, A.; Long, J.J.; Romer, G.; Shivakumar, S.; Tolton, M.; Vassilakis, T. Dremel: Interactive analysis of web-scale datasets. Commun. ACM 2011, 54, 114–123. [Google Scholar] [CrossRef]
Rey, A.; Rieger, M.; Neumann, T. Nested Parquet Is Flat, Why Not Use It? How To Scan Nested Data With On-the-Fly Key Generation and Joins. Proc. ACM Manag. Data 2025, 3, 1–24. [Google Scholar] [CrossRef]
Ghemawat, S.; Gobioff, H.; Leung, S. The Google File System. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP), Bolton Landing, NY, USA, 19–22 October 2003; pp. 29–43. [Google Scholar] [CrossRef]
Shvachko, K.; Kuang, H.; Radia, S.; Chansler, R. The Hadoop Distributed File System. In Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), Incline Village, NV, USA, 3–7 May 2010; pp. 1–10. [Google Scholar] [CrossRef]
Sethi, R.; Traverso, M.; Sundstrom, D.; Phillips, D.; Xie, W.; Sun, Y.; Yegitbasi, N.; Jin, H.; Hwang, E.; Shingte, N.; et al. Presto: SQL on Everything. In Proceedings of the 2019 IEEE 35th International Conference on Data Engineering (ICDE), Macao, China, 8–11 April 2019; pp. 1802–1813. [Google Scholar] [CrossRef]
Vassiliadis, P. A Survey of Extract–Transform–Load Technology. Int. J. Data Warehous. Min. 2009, 5, 1–27. [Google Scholar] [CrossRef]
Almeida, J.R.; Coelho, L.; Oliveira, J.L. BIcenter: A Collaborative Web ETL Solution Based on a Reflective Software Approach. SoftwareX 2021, 16, 100892. [Google Scholar] [CrossRef]
Kolev, B.; Valduriez, P.; Bondiombouy, C.; Jiménez, R.; Pau, R.; Pereira, J. Cloudmdsql: Querying heterogeneous cloud data stores with a common language. Distrib. Parallel Databases 2015, 34, 463–503. [Google Scholar] [CrossRef]
Behm, A.; Palkar, S.; Agarwal, U.; Armstrong, T.; Cashman, D.; Dave, A.; Greenstein, T.; Hovsepian, S.; Johnson, R.; Sai Krishnan, A.; et al. Photon: A Fast Query Engine for Lakehouse Systems. In Proceedings of the 2022 International Conference on Management of Data, Philadelphia, PA, USA, 12–17 June 2022; pp. 2326–2339. [Google Scholar] [CrossRef]
Hausenblas, M.; Nadeau, J. Apache Drill: Interactive Ad-Hoc Analysis at Scale. Big Data 2013, 1, 100–104. [Google Scholar] [CrossRef]
Eichler, R.; Berti-Equille, L.; Darmont, J. Modeling metadata in data lakes—A generic model. Data Knowl. Eng. 2021, 134, 101931. [Google Scholar] [CrossRef]
Herschel, M.; Diestelkämper, R.; Ben Lahmar, H. A survey on provenance: What for? What form? What from? Vldb J. 2017, 26, 881–906. [Google Scholar] [CrossRef]
Jahnke, N.; Otto, B. Data Catalogs in the Enterprise: Applications and Integration. Datenbank-Spektrum 2023, 23, 89–96. [Google Scholar] [CrossRef]
Consortium, T.G.O. The Gene Ontology resource: Enriching a GOld mine. Nucleic Acids Res. 2020, 49, D325–D334. [Google Scholar] [CrossRef]
Raskin, R.G.; Pan, M.J. Knowledge representation in the Semantic Web for Earth and Environmental Terminology (SWEET). Comput. Geosci. 2005, 31, 1119–1125. [Google Scholar] [CrossRef]
Niccolucci, F.; Doerr, M. Extending, mapping, and focusing the CIDOC CRM. Int. J. Digit. Libr. 2017, 18, 251–252. [Google Scholar] [CrossRef]
Petrova, G.G.; Tuzovsky, A.F.; Aksenova, N.V. Application of the Financial Industry Business Ontology (FIBO) for development of a financial organization ontology. J. Phys. Conf. Ser. 2017, 803, 012116. [Google Scholar] [CrossRef]
Mandel, J.C.; Kreda, D.A.; Mandl, K.D.; Kohane, I.S.; Ramoni, R.B. SMART on FHIR: A standards-based, interoperable apps platform for electronic health records. J. Am. Med. Inform. Assoc. 2016, 23, 899–908. [Google Scholar] [CrossRef]
Vrandečić, D.; Krötzsch, M. Wikidata: A free collaborative knowledgebase. Commun. Acm 2014, 57, 78–85. [Google Scholar] [CrossRef]
Bizer, C.; Lehmann, J.; Kobilarov, G.; Auer, S.; Becker, C.; Cyganiak, R.; Hellmann, S. DBpedia—A crystallization point for the Web of Data. J. Web Semant. 2009, 7, 154–165. [Google Scholar] [CrossRef]
Moreau, L.; Groth, P.; Cheney, J.; Lebo, T.; Miles, S. The rationale of PROV. J. Web Semant. 2015, 35, 235–257. [Google Scholar] [CrossRef]
Kim, B.; Niu, S.; Ding, B.; Kraska, T.; Luo, J.; Luo, W.; Tang, C.; Wang, Z.; Zhang, C.; Zhou, J. Learned Cardinality Estimation: An In-depth Study. In Proceedings of the 2022 International Conference on Management of Data (SIGMOD), Philadelphia, PA, USA, 12–17 June 2022; pp. 1214–1227. [Google Scholar] [CrossRef]
Marcus, R.; Papaemmanouil, O. Deep Reinforcement Learning for Join Order Enumeration. In Proceedings of the 1st International Workshop on Exploiting Artificial Intelligence Techniques for Data Management (aiDM@SIGMOD), Houston, TX, USA, 10 June 2018; pp. 3:1–3:4. [Google Scholar] [CrossRef]
Marcus, R.; Negi, P.; Mao, H.; Tatbul, N.; Alizadeh, M.; Kraska, T. Bao: Making Learned Query Optimization Practical. In Proceedings of the 2021 International Conference on Management of Data (SIGMOD), Xi’an, China, 20–25 June 2021; pp. 1275–1288. [Google Scholar] [CrossRef]
Ahmad, Y.; Kennedy, O.; Koch, C.; Nikolic, M. DBToaster: Higher-order Delta Processing for Dynamic, Frequently Fresh Views. Proc. Vldb Endow. 2012, 5, 968–979. [Google Scholar] [CrossRef]
Budiu, M.; Chajed, T.; McSherry, F.; Ryzhyk, L.; Tannen, V. DBSP: Automatic Incremental View Maintenance for Rich Query Languages. Proc. VLDB Endow. 2023, 16, 1601–1614. [Google Scholar] [CrossRef]
Elghandour, I.; Aboulnaga, A. ReStore: Reusing Results of MapReduce Jobs. Proc. Vldb Endow. 2012, 5, 586–597. [Google Scholar] [CrossRef]
Armbrust, M.; Das, T.; Torres, J.; Yavuz, B.; Zhu, S.; Xin, R.; Ghodsi, A.; Stoica, I.; Zaharia, M. Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark. In Proceedings of the 2018 International Conference on Management of Data, Houston, TX, USA, 10–15 June 2018; pp. 601–613. [Google Scholar] [CrossRef]
Akidau, T.; Begoli, E.; Chernyak, S.; Hueske, F.; Knight, K.; Knowles, K.; Mills, D.; Sotolongo, D. Watermarks in stream processing systems: Semantics and comparative analysis of Apache Flink and Google cloud dataflow. Proc. VLDB Endow. 2021, 14, 3135–3147. [Google Scholar] [CrossRef]
Vogels, W. Eventually Consistent. Commun. ACM 2009, 52, 40–44. [Google Scholar] [CrossRef]
Schelter, S.; Biessmann, F.; Januschowski, T.; Salinas, D.; Seufert, S.; Krettek, A. Automating Large-Scale Data Quality Verification. Proc. Vldb Endow. 2018, 11, 1781–1794. [Google Scholar] [CrossRef]
Janssen, N.; Ilayperuma, T.; Arachchige, J.J.; Bukhsh, F.A.; Daneva, M. The evolution of data storage architectures: Examining the secure value of the data lakehouse. J. Data, Inf. Manag. 2024, 6, 309–334. [Google Scholar] [CrossRef]
Wilkinson, M.D.; Dumontier, M.; Aalbersberg, I.J.; Appleton, G.; Axton, M.; Baak, A.; Blomberg, N.; Boiten, J.W.; da Silva Santos, L.B.; Bourne, P.E.; et al. The FAIR Guiding Principles for Scientific Data Management and Stewardship. Sci. Data 2016, 3, 160018. [Google Scholar] [CrossRef]
Callahan, T.J.; Tripodi, I.J.; Stefanski, A.L.; Cappelletti, L.; Taneja, S.B.; Wyrwa, J.M.; Casiraghi, E.; Matentzoglu, N.A.; Reese, J.; Silverstein, J.C.; et al. An open source knowledge graph ecosystem for the life sciences. Sci. Data 2024, 11, 363. [Google Scholar] [CrossRef]
Gowan, T.A.; Horel, J.D.; Jacques, A.A.; Kovac, A. Using Cloud Computing to Analyze Model Output Archived in Zarr Format. J. Atmos. Ocean. Technol. 2022, 39, 449–462. [Google Scholar] [CrossRef]
Moore, J.; Basurto-Lozada, D.; Besson, S.; Bogovic, J.; Bragantini, J.; Brown, E.M.; Burel, J.; Moreno, X.C.; Medeiros, G.d.; Diel, E.E.; et al. Ome-zarr: A cloud-optimized bioimaging file format with international community support. Histochem. Cell Biol. 2023, 160, 223–251. [Google Scholar] [CrossRef]
Joyce, A.; Javidroozi, V. Smart City Development: Data Sharing vs. Data Protection Legislations. Cities 2024, 148, 104859. [Google Scholar] [CrossRef]
Ahmi, A. OpenRefine: An Approachable Tool for Cleaning and Harmonizing Bibliographical Data. AIP Conf. Proc. 2023, 2827, 030006. [Google Scholar] [CrossRef]
Willekens, F. Programmatic Access to Open Statistical Data for Population Studies: The SDMX Standard. Demogr. Res. 2023, 49, 1117–1162. [Google Scholar] [CrossRef]
Kayali, M.; Lykov, A.; Fountalis, I.; Vasiloglou, N.; Olteanu, D.; Suciu, D. Chorus: Foundation Models for Unified Data Discovery and Exploration. Proc. Vldb Endow. 2024, 17, 2104–2114. [Google Scholar] [CrossRef]
Leipzig, J.; Nüst, D.; Hoyt, C.T.; Ram, K.; Greenberg, J. The role of metadata in reproducible computational research. Patterns 2021, 2, 100322. [Google Scholar] [CrossRef]
Ahmad, T. Benchmarking Apache Arrow Flight - A wire-speed protocol for data transfer, querying and microservices. In Proceedings of the Benchmarking in the Data Center: Expanding to the Cloud, Seoul, Republic of Korea, 2–6 April 2022. [Google Scholar] [CrossRef]
Shraga, R.; Gal, A. PoWareMatch: A Quality-aware Deep Learning Approach to Improve Human Schema Matching. J. Data Inf. Qual. 2022, 14, 1–27. [Google Scholar] [CrossRef]
Zhang, J.; Shin, B.; Choi, J.D.; Ho, J.C. SMAT: An Attention-Based Deep Learning Solution to the Automation of Schema Matching. In Advances in Databases and Information Systems (ADBIS 2021); Springer: Berlin/Heidelberg, Germany, 2021; Volume 12843, Lecture Notes in Computer Science; pp. 260–274. [Google Scholar] [CrossRef]
Union, E. Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data. Off. J. Eur. Union 2016, 679, 10–13. [Google Scholar]
European Union. Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 on harmonised rules on artificial intelligence. Off. J. Eur. Union 2024. Available online: https://eur-lex.europa.eu/eli/reg/2024/1689/oj/eng (accessed on 21 October 2025).
Álvarez, J.M.; Colmenarejo, A.B.; Elobaid, A.; Fabbrizzi, S.; Fahimi, M.; Ferrara, A.; Ghodsi, S.; Mougan, C.; Papageorgiou, I.; Lobo, P.R.; et al. Policy advice and best practices on bias and fairness in ai. Ethics Inf. Technol. 2024, 26, 31. [Google Scholar] [CrossRef]

Figure 1. Entity resolution and data fusion: similarity functions; blocking; supervised classification using previously extracted features; clustering; and fusion via source prioritisation, aggregation/voting, and provenance-aware selection, producing unified records that preserve lineage/traceability.

Figure 2. Schema-matching→mapping→validation workflow for resolution of customer_id vs. client_no.

Figure 3. Date disambiguation pipeline with schema-on-write preference and hybrid read-time normalisation under lineage/quality monitoring.

Figure 4. Storage models and workload fit.

Figure 5. Federated SQL engine architecture and pushdown.

Figure 6. Lineage tracking and schema evolution: dataset lifecycle, lineage levels (table, column, and code), tools (Apache Atlas, Marquez/OpenLineage, Dagster, and Airflow), and schema evolution with Parquet, Delta Lake, and Apache Iceberg, ensuring compatibility and reproducibility.

Figure 7. Performance levers for heterogeneous analytics: optimisation, caching, and freshness/consistency.

Figure 8. Architectural comparison across eight layers—ingestion, storage, processing/query, metadata/lineage, semantics/standards, tools and enablers, access and use, and governance/policy—for the enterprise, scientific, and public sector integration contexts. The figure situates representative technologies and practices in each layer.

Table 1. Positioning against prior surveys: scope, coverage, and added value of this review.

Survey	Primary Focus	Coverage	What This Review Adds
[5]	Data lake functions/systems	Ingestion, metadata, governance, and lake infrastructure	Cross-layer linkage of schema/ER/semantics with storage + performance/governance levers
[6]	Federation architectures and capabilities	Connectors, query translation, and execution models	Integration of federation with storage models, ETL/ELT, and performance tuning
[13]	Lakehouse concepts/technologies	Transactional tables, open formats, and hybrid lake/warehouse design	Comparative role of lakehouses among integration strategies; lineage/governance integration
[14]	Unified analytics vision	Fabrics, abstractions, and open challenges	Operational taxonomy, workflows, domain case studies, and incorporation of 2024–2025 AI instruments
[9]	Big data integration foundations	Schema matching, mapping, and fusion	Updated synthesis with recent metadata catalogues, lakehouses, and federated engines

Table 2. Types of heterogeneity in data integration, with examples and resolution approaches.

Heterogeneity Type	Description and Example	Resolution Approach
Schema heterogeneity	Differing data schemas (e.g., customer_id vs. client_no)	Schema mapping and translation [21]
Semantic heterogeneity	Same term has different meanings (e.g., salary = gross vs. net)	Ontology mapping [24]
Instance-level heterogeneity	Different formats/values (e.g., dates 01/12/2025 vs. 2025-12-01)	Data cleaning and normalisation [25]

Table 3. Categories of schema matching algorithms, with matching strategies and examples.

Matcher Type	Matching Strategy	Representative Work
Name-based (lexical)	Compares schema element names (string similarity and synonyms)	[32]
Structure-based	Exploits schema structure (hierarchies and parent–child structures)	[29]
Instance-based	Uses actual data values (overlapping ranges and distributions)	[21]

Table 4. Ambiguous date inputs and canonicalisation policy for analytical layers.

Ambiguous Input	Enforced Rule (Contract)	Canonical Form	Validation Mechanism
01/12/2025 (unknown locale)	YYYY-MM-DD (ISO 8601)	2025-12-01 or reject	Expectation checks, quarantine on ambiguity
12/01/2025 (unknown locale)	YYYY-MM-DD	2025-01-12 or reject	Locale profile, confidence tagging
2025/12/01	YYYY-MM-DD	2025-12-01	Strict pattern enforcement

Table 5. NoSQL database categories, their key characteristics, and example systems.

NoSQL Category	Characteristics	Examples
Key-value store	Data stored as key/value pairs, optimised for lookups	Amazon DynamoDB and Redis
Document store	Semi-structured JSON-like documents and flexible schema	MongoDB and Couchbase
Graph database	Data as nodes and edges; suited for relationships	Neo4j and Amazon Neptune

Table 6. Benchmarking synthesis across NoSQL families used for graphs: consistency/availability and scale-out mechanics, plus workload-level performance (scans vs. traversals) [44,48,49,50].

Metric	Key-Value (e.g., Dynamo)	Document (e.g., MongoDB/CouchDB/Couchbase)	Graph DBs (LPG/RDF)
Consistency vs. availability	“Always writable” design, quorum-tuned R/W, eventual consistency with vector clocks, and hinted handoff for node outages	Stronger per node consistency, cluster settings vary by product, and designed for CRUD with JSON/BSON	Transaction models vary; many support ACID for OLTP, and global analytics are typically read-only
Partitioning/scale-out	Consistent hashing, virtual nodes, sloppy quorum, and seamless node add/remove function	Sharding and replica sets are common, as well as per collection partitioning	Sharding/replication depend on the engine; many native stores optimize locality for traversals
Write path	Fast, partition-local writes; durability ensured via a configurable W	Bulk inserts and high-rate CRUD; durability ensured via journaling/replica sync	OLTP writes supported (engine-dependent), and heavy analytics often separated from the write path
Scan/range queries	Limited (key-oriented); range needs secondary/indexed paths	YCSB: MongoDB has the best overall runtime, while a scan-heavy workload makes CouchDB faster and CouchDB scales best with threads	Scans expressed via label/property predicates; the cost depends on index design; not a primary strength vs. documents
Traversals/locality	Multi-hop traversals require app-level joins or pre-materialization	Multi-hop joins across collections are costly and not traversal-centric	Native adjacency (AL and direct pointers), and traversal cost grows with the number of visited subgraphs, not graph size; suited to path/pattern queries
Indexing	Primary key and optional secondary indexes for ranges/filters	Rich secondary and compound indexes; text/geo often available	Structural (neighbourhood) + data indexes; languages expose pattern/path operators
Query languages	KV APIs and app-side composition	Aggregation pipelines and SQL-like DSLs	SPARQL (RDF), Cypher/Gremlin (LPG), and mature pattern/path semantics

Table 7. Benchmarking results for open columnar formats: reported performance characteristics of Parquet, ORC, and Arrow/Feather [1,2,41,52].

Metric	Parquet	ORC	Arrow/Feather
Compression ratio	Strong overall compression, especially with dictionary encoding	Often higher compression on structured/numeric workloads	Not a storage format; minimal compression; focuses on speed
Scan/decoding speed	Faster end-to-end decoding in mixed workloads	Slightly slower, but predicate evaluation is stronger	Fastest (de)serialisation throughput; zero-copy in-memory
Predicate pushdown/skipping	Effective but limited by column statistics	Fine-grained zone maps yield strong selective query performance	Not applicable (in-memory only)
Nested data handling	R/D-level encoding and efficient leaf-only access	Presence/length streams; overhead increases with depth	Dependent on producer/consumer; no disk encoding
Workload trade-offs	Performs best on wide tables and vectorised execution	Strong on narrow/deep workloads with high selectivity	Best as interchange for ML/analytics pipelines

Table 8. Comparison of data integration strategies: ETL, ELT, and data virtualisation.

Approach	Process	Typical Tools
ETL (Extract–Transform–Load)	Extract → Transform → Load into data warehouse	Talend, Informatica, and Pentaho
ELT (Extract–Load–Transform)	Extract → Load (raw) → Transform later in data lake	Hadoop HDFS, Spark, and dbt
Data virtualisation	Virtual layer integrates sources in real time	Denodo, Dremio, and Presto/Trino

Table 9. Benchmarking of federated SQL engines (Presto, Trino, Drill, and Starburst) across key performance-related features [6,7,12,55,60].

Metric	Presto	Trino	Drill	Starburst
Connector support	Broad set of connectors and production deployments at scale	Broad OSS connector base; fork of Presto with added features	Schema-free connectors for JSON, NoSQL, and files	Commercial distribution; adds enterprise connectors and governance
Predicate pushdown	Supported across RDBMS, Hive, and columnar formats	Supported across most connectors	Predicate pushdown for JSON and columnar data	Extended pushdown support with enterprise optimisations
Cost-based optimisation	CBO with table/column statistics and an adaptive join order	Cost-aware planning with statistics integration	Primarily rule-based; limited CBO	Enterprise-grade CBO with workload-aware tuning
Execution model	Massively parallel and pull-based execution	Similar to Presto; optimised scheduling	Vectorised operator pipeline	Enhanced parallelism and workload management
Caching and materialisation	Result caching, materialised views, and SSD spill options	Spill to disk; MV support in OSS is limited	Reader-level pruning and limited caching features	Adds advanced caching and MV rewriting
Fault tolerance	Recoverable grouped execution; Presto-on-Spark variant	Retry-based external FT extensions	No built-in query recovery	Enterprise-level FT and workload isolation
Production use evidence	Exabyte-scale at Meta; interactive + ETL workloads	Large-scale OSS and enterprise deployments	Interactive ad hoc analysis over schema-free data	Widely adopted in regulated industries

Table 10. Open-source metadata catalogue frameworks and their integration capabilities.

Platform	Architecture & Integration	Key Features
Apache Atlas	Metadata repository, native to Hadoop	REST APIs, lineage, audit logging, and RBAC
LinkedIn DataHub	Distributed service; platform-agnostic	Metadata ingestion, search UI, and versioning
Lyft Amundsen	Graph-backed discovery	Lineage graphs, discovery UI, and access control

Table 11. Concrete federation overheads and tuning levers grounded in reported systems.

Overhead Locus (Where It Appears)	Tuning Lever (How It Is Mitigated)
Connector translation gaps and heterogeneous engines (Spark/Presto variants in Squerall)	Connector-aware planning and dialect rewriting and per connector rules/pushdown (Ontario’s profile-driven subqueries and Squerall’s mediator mapping) [5].
Partial statistics; mediator lacks global distributions	Metadata-guided plan generation (Ontario uses metadata to generate optimised plans), progressive refinement, and survey-catalogued strategies [5,6].
Row-oriented transfer and fine-grained serialisation	Vectorised/columnar paths and batching for sustained scan/aggregation throughput [1,2].
Repeated cross-source joins; freshness vs. latency	Materialised views of “hot” joins with incremental refresh (differential/delta maintenance) [76].
Orchestration via LLMs (NLQ/translation)	Guardrails: verified translations and fallbacks; LLM use where determinism is not critical (noting 54.89% text-to-SQL execution accuracy for GPT-4) [19].
Workload skew across sources	Hybridisation (persist stable, high-value slices in lakehouse/warehouse and federate the remainder) [5,11,82].

Table 12. Taxonomy→applications cross-walk (where used and where discussed).

Taxonomy Element	Applications (Role and Where in Text)
Schema matching and mapping (Section 2.2)	Harmonise identifiers/attributes across sources. Enterprise lakehouse keys via SQL/ELT (Section 7.1). Public registries (Section 7.3).
Entity resolution and fusion (Section 2.3)	Deduplicate/link records. Unified entities. Enterprise CRM+transactions (Section 7.1). Public person/org linkage (Section 7.3).
Semantic enrichment and ontologies (Section 5.2)	Disambiguation of meaning, standards-based queries, scientific knowledge graphs (Section 7.2), and Public SDMX alignment (Section 7.3).
Metadata catalogues and lineage (Section 5.1 and Section 5.3)	Discoverability, governance, reproducibility, enterprise governance (Section 7.1), and scientific provenance (Section 7.2).
Storage models (row/column/NoSQL-Section 3)	Fit workloads, hybrid query plans, enterprise columnar lakehouse, and scientific/public doc/graph as adjunct (Section 7.1, Section 7.2 and Section 7.3).
Federated SQL and virtualisation (Section 4.2)	Cross-store analytics without relocation, enterprise Trino-based joins (Section 7.1), and public inter-agency dashboards (Section 7.3).
Schema-on-read/write and hybrid (Section 4.3)	Contracts vs. flexibility, canonicalisation (e.g., dates), and public regulated pipelines (Section 7.3).
Performance levers (Section 6)	Cost/latency optimisation, freshness, dashboards, and SLAs (Section 7).

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Koukaras, P. Data Integration and Storage Strategies in Heterogeneous Analytical Systems: Architectures, Methods, and Interoperability Challenges. Information 2025, 16, 932. https://doi.org/10.3390/info16110932

AMA Style

Koukaras P. Data Integration and Storage Strategies in Heterogeneous Analytical Systems: Architectures, Methods, and Interoperability Challenges. Information. 2025; 16(11):932. https://doi.org/10.3390/info16110932

Chicago/Turabian Style

Koukaras, Paraskevas. 2025. "Data Integration and Storage Strategies in Heterogeneous Analytical Systems: Architectures, Methods, and Interoperability Challenges" Information 16, no. 11: 932. https://doi.org/10.3390/info16110932

APA Style

Koukaras, P. (2025). Data Integration and Storage Strategies in Heterogeneous Analytical Systems: Architectures, Methods, and Interoperability Challenges. Information, 16(11), 932. https://doi.org/10.3390/info16110932

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Data Integration and Storage Strategies in Heterogeneous Analytical Systems: Architectures, Methods, and Interoperability Challenges

Abstract

1. Introduction

1.1. The Growing Heterogeneity in Analytical Ecosystems

1.2. Motivation for Unified Integration and Storage Strategies

1.3. Scope and Contributions of the Review

Positioning vs. Prior Surveys

1.4. Review Methodology

2. Foundations of Data Integration

2.1. Types of Heterogeneity: Schema, Semantic, Instance Levels

2.2. Schema Matching and Map Generation

2.3. Entity Resolution and Data Fusion

2.4. Illustrative Resolution Scenarios and Workflows

2.4.1. Scenario A: Schema Name Mismatch (customer_id vs. client_no)

2.4.2. Scenario B: Instance-Level Ambiguity (Date “01/12/2025”)

2.4.3. Scenario C: Semantic Conflict (Salary = Gross vs. Net)

2.5. Limitations and Possible Biases in Reviewed Methods

3. Storage Architectures for Analytical Workloads

3.1. Row-Oriented and Column-Oriented Stores

3.2. Key-Value, Document, and Graph Databases

3.3. Cloud-Native Formats and Distributed File Systems (e.g., Parquet and Delta Lake)

4. Bridging Integration and Storage

4.1. ETL/ELT Pipelines and Data Virtualisation

Decision Guidance: ETL vs. ELT vs. Virtualisation

4.2. Federated Querying and Unified Query Engines

4.3. Schema-on-Read vs. Schema-on-Write Trade-Offs

5. Metadata, Lineage, and Semantic Interoperability

5.1. Metadata Management Frameworks (e.g., Apache Atlas and DataHub)

5.2. Ontology-Based Integration and Semantic Enrichment

5.3. Lineage Tracking and Schema Evolution Handling

6. Performance, Scalability, and Consistency Challenges

6.1. Query Optimisation Across Heterogeneous Sources

6.2. Materialisation and Caching Strategies

6.3. Data Freshness and Eventual Consistency Issues

6.4. Federation Overhead and Performance Tuning

7. Applications and Case Studies

7.1. Enterprise Data Lakes and Lakehouses

7.2. Scientific Data Integration

7.3. Cross-Domain Data Pipelines in Public Sector Analytics

8. Future Directions

8.1. Towards Unified Analytical Fabrics

8.2. Metadata-Driven and Self-Describing Pipelines

8.3. AI-Assisted Integration and Auto-Schema Mapping

8.4. Ethical and Regulatory Directions

8.5. Limitations of This Review

9. Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI