Multimodal Semantic Fusion of Heterogeneous Data Silos

Alshareef, Abdurrahman; Zeigler, Bernard P.

doi:10.3390/systems13110987

Open AccessArticle

Multimodal Semantic Fusion of Heterogeneous Data Silos

by

Abdurrahman Alshareef

^1,2,*

and

Bernard P. Zeigler

²

¹

Information Systems Department, College of Computer and Information Sciences, King Saud University, Riyadh 11451, Saudi Arabia

²

RTSync Corp., Chandler, AZ 85226, USA

^*

Author to whom correspondence should be addressed.

Systems 2025, 13(11), 987; https://doi.org/10.3390/systems13110987

Submission received: 29 September 2025 / Revised: 2 November 2025 / Accepted: 3 November 2025 / Published: 4 November 2025

Download

Browse Figures

Versions Notes

Abstract

Maintaining consistency in complex systems is a continuous challenge that requires active coordination. Data management systems often face the issue of segregated data silos due to various organizational and technical factors. Integrating them when needed can present challenges due to heterogeneity and multimodality. Recent advances in AI models with enhanced multimodal inference and semantic reasoning capabilities offer an opportunity to resolve interoperability issues at both the schema and data levels. In this paper, we discuss ways to leverage such models to mitigate a variety of heterogeneous timing and data barriers across disparate silos. We also examine their fusion and propose ways to formally define it as a foundational means for self-evolving unified meta-space in light of recent model enablements and active inference. Assessing the degree of fusion is necessary to understand and determine how silos, as subsystems, collectively interact, and therefore to control their integration while preserving data source independence. Adherence to a principled design that handles complexity can guide crucial decisions and enhance controllability over the reasoning process. We formalize a foundation for separating prior knowledge from observed data, showing how to leverage inference in both cases with examples and real data. The resulting approach enables advanced inference while providing statistical evidence from observed data by applying reasoning at multiple steps. To conclude, we discuss the implications of this approach for complex systems more generally.

Keywords:

multimodality; schema inference; generative AI; LLM inference; diffusion models; metamodeling

1. Introduction

Maintaining consistency in complex systems is a continuous challenge that requires active coordination. In data-intensive systems, inconsistencies may arise in formats, resolutions, latencies, structures, access, and other aspects. When such hurdles arise, the losses due to integration delays or failures can be severe. Data silos have been attributed to delays and unnecessary barriers that negatively impact patients with rare diseases and slow drug development [1], as well as to disruptions of critical workflows [2]. In this work, we explore the potential of systematically leveraging generative artificial intelligence (AI) to mitigate such issues and expedite the fusion process to the point of demand. The goal is to ultimately enhance preparedness for emerging data silos and modalities, and to expedite statistical inference and experimentation through a streamlined data pipeline.

There are many technical and organizational explanations behind the emergence of a data silo. In this work, we focus on the multimodality aspect, as it is fundamental to the emergence and resolution of a new silo to handle different or emerging modalities, which in turn may pose the challenge of latent integration. Establishing a connection between inherently differing data can be less intuitive. However, recent advances in machine learning embeddings and diffusion algorithms [3,4] have led to novel methods for handling multimodality, encompassing both structured and unstructured data. For example, image-to-text captioning and mappings continue to improve, producing reliable results in domains such as robotics and image retrieval [5].

Here, we propose an iterative strategy for integrating newly emerging silos as they arise. Building on recent advances in multimodal processing, we demonstrate how such models can significantly accelerate integration, while also preventing delays and eliminating key obstacles to achieving an efficient real-time response. By following a principled system design, it becomes possible to establish and sustain a dynamic fusion layer that provides robust support for decision-making at the critical milestones of system evolution.

The proposed iterative process begins with the initialization of a unified metamodel upon the introduction of the first silo. At this stage, reasoning procedures infer the schema and establish normalization as a prerequisite to full integration. Once this foundation is prepared, the variables, entities, and relations of each newly introduced silo undergo an additional reasoning step to evaluate their alignment with the existing unified metamodel. The process identifies suitable candidates for mapping new elements; when alignment proves insufficient, the unified metamodel is extended to incorporate the necessary concepts. In cases involving the removal of a previously involved silo, the reasoning process assesses the implications of deleting or modifying the remaining structures. After each cycle, a revised version of the metamodel is generated and preserved, thereby documenting both its evolution and the progression of the unification process over time.

By exposing such procedure to multimodal inference, the reasoning over the collected features from silo databases can enable derived knowledge to be more accessible with advanced query procedures. With intelligent augmentation of missing priors, the adaptive fusion utilizes previously learned architectures to compensate with quantifiable accuracy and transparent reasoning. The formal grounding to recognize the multiple layers allows for a combined neuro-symbolic approach to the inherent complexity in data growth. Executing complex queries can benefit from inference on multiple levels starting from the schema and combining observed data with prior knowledge in generative models.

In the following sections, we discuss background on existing techniques for data integration and schema inference. We also discuss recent AI models that can be applied to achieve and enhance this goal. In Section 3, we describe the methods in which multimodality support can be leveraged. We provide the foundational mathematical formulation in Section 4. In Section 5, we provide a detailed description of the flow and outline the key implementation aspects. In Section 5.4, we demonstrate the feasibility of the results with examples using real data. Section 6 demonstrates the approach more empirically with additional results for validation and quantification of accuracy. Having shown how to maintain consistency in complex data systems in the face of silo emergence, we consider implications for the management of complex systems more generally. We discuss the problem within the context of a System of Systems (SoS), where siloed databases share SoS traits—such as independence, distribution, and emergent behavior—but also pose unique integration challenges.

2. Background

We present an overview of methods used in the existing literature for data integration and statistical inference. We then provide a brief background on the candidate AI models that have recently achieved substantial improvements in addressing multimodality.

2.1. Data Integration

There have been numerous techniques in the industry for handling large datasets with heterogeneous shapes and topologies, utilizing data virtualization and management tools. Data warehouses, lakes, buckets, and meshes have been widely used in practice to provide speed and efficiency in both construction and usage. Each product aims to provide specific benefits tailored to specific needs, according to well-established trade-offs. Some offer products to speed up business intelligence packages while maintaining stronger governance and compliance [6]. Others provide more flexible object storage with less restrictive techniques, such as open file formats [7]. Several approaches have emerged for handling table formats, columns, schemas, and typing. For example, Apache Iceberg [8] provides schema evolution for large analytics datasets in data lakes, while Snowflake Open Catalog [9] serves as a catalog to facilitate reading, writing, and other types of queries across different engines, thereby increasing interoperability. Different solutions enforce different rules. Some support auto-detection of features such as types and formats. Solutions are typically created to address specific emerging needs that can vary over time.

Fundamentally, these solutions employ a range of underlying technical, statistical, and algorithmic methods to manage large-scale operations and facilitate seamless growth. Probabilistic models provide type and schema inference using Bayesian networks [10]. Other forms use a logic-based system and constraint-solving based on observed data, leveraging satisfiability solvers [11]. Casting databases as graphs is also used to enable neural network methods [12] for signal propagation, thereby enabling learning and prediction. Matching entities and columns across different datasets [13,14] has been examined through techniques that quantify similarity and learn and compare distributions. A variety of logic-based inferences have been applied to establish mappings and examine semantic consistency across domains [15].

2.2. Applicable AI Models

Recent advances in various AI models can be leveraged to further facilitate the integration of data across systems, thereby enabling the construction of unified views. Transformer-based models with integrated attention mechanisms [4] can facilitate semantically rich reasoning by extracting features across diverse data topologies, including text and images. Embeddings of different data and system representations into a unified form (e.g., Word2Vec [16]) enable alignment, search, and quantification of similarity (e.g., cosine similarity for text [17]). Neural auto-encoders [18] can produce a variety of statistical checks and insights, assisting in restoring data quality and handling growth. Generative adversarial networks (GANs) [19] generate high-quality synthetic data that can be utilized in data augmentation. Diffusion models [3] have the potential to substantially extend existing data integration techniques and methodologies by enabling the seamless generation and synthesis of multimodal data. They can be used to structure various modalities into an organized textual description and to augment missing modalities, providing a consistent and unified view for users and decision-makers. When connected to live data streams, whether from social media feeds or security imagery, these models can mitigate barriers in interpretation and transformation.

In this work, we aim to leverage these advances for complex reasoning tasks involving segregated and heterogeneous data silos. Our goal is to achieve reliable data integration and more advanced knowledge extraction while accounting for variability in confidence thresholds from generative models. Some works have emerged to accommodate such variability and add verification accounts, such as claim verification [20] and retrieval-augmented generation [21,22]. In our proposed framework, we formalize a distinction between prior knowledge and latent data. We then demonstrate, using public datasets, the use of AI models for data integration and their reasoning capabilities in a controllable and transparent manner.

3. Methods

3.1. Ontological Schema Inference

Figure 1 illustrates the proposed integration strategy. Each silo first undergoes a schema construction phase that includes data cleanup, normalization, and, where necessary, imputation (i.e., the correction of missing data). Once prepared, the newly inferred schema is aligned and mapped against the existing unified store. During this stage, the reasoning engine identifies potential joins, which are then validated to ensure global consistency with the system’s overarching design rules. The outcome is a fully specified and deployable encoding of the unified metamodel, now extended to incorporate the most recent silo.

Reasoning about an interconnected schema can be challenging, especially as it grows in size and complexity. Evolving changes and requirements necessitate accurate representations and handling in the system. Delays or discrepancies in capturing such requirements can be costly. Figure 1 shows an overall view of a streamlined, multimodal-aided automation in the AI model. The emerging data, tables, and variables can then be seamlessly integrated into the existing system while enabling traceability.

We demonstrate this inference using an LLM (large language model) to initialize the schema by reasoning about objects, data types, and relational semantics, either using the entire dataset or a subset of the given samples. Once inferred, the schema can evolve automatically to account for emerging data that were not available or deemed irrelevant during initialization. Additionally, the approach facilitates exploratory analysis of unrelated domains (e.g., healthcare, agriculture, finance, supply chains). The latter may not necessarily yield an integrated schema, but it could provide valuable insights into the data and latent correlations and expedite statistical studies and analytics. It could also be used for a domain branch off where a new schema can be fitted to serve different needs in a well-defined cycle.

3.2. Format Heterogeneity

Now, the preliminary schema construction is in place. We explore the fusion across heterogeneous domains (e.g., healthcare and supply chains), various formats (e.g., CSV, XML, JSON), and diverse structures (e.g., text and images). At this stage, basic symbolic reasoning might be insufficient. Other models (e.g., GANs and diffusion models) might be necessary to address multimodality.

Our approach is to guide the generic AI model toward extracting more relevant, semantically rich alignment by combining inference and validation through a multi-layer scaffold. By doing so, the framework lays the foundation for integrating live data streams in heterogeneous formats, enabling the discovery of latent mappings and relationships. For example, the established unified metamodel can account for connections to publicly available data streams. Then, the dynamic formation and fusion of silos can be utilized to integrate these streams in the same manner, leveraging prior knowledge in other models to learn and discover their content and structure. Leveraging multimodality in this way benefits many applications, such as combining sensor feeds. It can also help handle missing data, especially in real-time environments where some nodes or data sources may not function as expected.

4. Mathematical Formulation for the Evolving Unified Metamodel

We describe the underlying mathematical formulation of this approach in two parts. The first part describes the foundational formalism, while the second part describes the methodology and the potential uses of prior knowledge to enrich the data fusion procedure for the data under study.

4.1. The Foundational Formalism

4.1.1. Defining Silo as a Typed Information Space

We model each data silo as a typed multi-dimensional space (e.g., a tensor or a manifold), and the integration process as a cognitive expansion of a unified metamodel space that grows through alignment, transformation, and composition. Each silo S is represented as a tuple

S = (X, A, Σ, μ),

(1)

where

X

is the set of entities (or records, objects, etc),

A

the set of attributes,

Σ : A \to T

a typing function assigning each attribute a data type or modality (e.g., numeric, text, image, graph, acoustic, electromagnetic, electrochemical, etc.), and

μ

is an interpretation function providing semantic context (e.g., units, ontology tags, distribution, probabilistic models, etc.).

4.1.2. Unified Metamodel Space

Across all emerged silos, the unified metamodel evolves as:

U = (E, F, R),

(2)

where

E

is a set of canonical entities,

F

an extensible feature space, and

R

a set of relations between entities, features, and modalities. Each silo is embedded into this space via an integration map

Φ_{S} : S \to U,

(3)

which reconciles its schema with the unified metamodel schema.

4.1.3. Cross-Silo Correspondence

Given two silos

S_{i}

and

S_{j}

, integration requires discovering structural correspondences

Γ_{i j} \subseteq A_{i} \times A_{j},

(4)

where each element

(a_{i}, a_{j}, λ)

associates an attribute

a_{i}

in

S_{i}

with an attribute

a_{j}

in

S_{j}

through a transformation

λ : Σ_{i} (a_{i}) \to Σ_{j} (a_{j}) .

(5)

where

λ

can represent unit conversion, normalization, or modality transformation.

4.1.4. Compositional Closure and Derived Features

New knowledge emerges through feature synthesis. For attributes

a_{1}, \dots, a_{k}

drawn from any silos, we define a derived feature

f : = g (a_{1}, a_{2}, \dots, a_{k}) \in F,

(6)

where g is any executable operation (e.g., algebraic, logical, statistical). Thus, the feature space

F

is closed under composition:

a_{1}, \dots, a_{k} \Rightarrow f \in F .

(7)

4.1.5. Cognitive Expansion of the Metamodel Space

Let

U_{t}

denote the unified metamodel space at iteration t. Integrating a new silo

S_{t + 1}

expands the space as

U_{t + 1} = U_{t} \cup Φ_{S_{t + 1}} (S_{t + 1}) \cup {f_{new}},

(8)

where

f_{new}

are newly derivable features enabled only after the integration of

S_{t + 1}

. This models the emergence of new insights from cross-silo fusion.

4.1.6. Minimal Realization

The cognitive expansion defined above describes the potential knowledge space after integrating a new silo. However, not all newly derivable features or relations need to be instantiated. Instead, minimal realization is derived as the smallest concretization of the unified metamodel space. Newly observed features and modalities are merged with earlier ones. Transformation, conversion, and mapping are introduced as needed to accurately maintain each silo representation in the new unified space while avoiding superfluous expansion.

Let

U_{t + 1}^{*}

denote the potential expanded space,

U_{t + 1}^{*} = U_{t} \cup Φ_{S_{t + 1}} (S_{t + 1}) \cup {f_{new}},

where

{f_{new}}

is the set of all features derivable through compositional closure. A realization is any instantiated subset

R_{t + 1}

such that

U_{t} \subseteq R_{t + 1} \subseteq U_{t + 1}^{*} .

We define the minimal realization as

R_{t + 1}^{*} = arg min_{R} \{F_{i} ⊓ F_{j} across modalities at t + 1\} .

(9)

It retains only the subset of derived features across modalities that are functionally necessary to ensure interoperability between the newly integrated silo and existing ones to satisfy reasoning or analytical queries.

4.1.7. Modalities and Cross-Modal Mappings

The unified space may be decomposed by modality as a disjoint union

U = ⨆_{m} U^{(m)},

(10)

with cross-modal transformations

Ψ_{m \to n} : U^{(m)} \to U^{(n)},

(11)

which may be learned (e.g., diffusion models) or specified.

Together, the above formalism treats data silos as information manifolds, integrating them to construct structural correspondences that enable compositional knowledge expansion and efficient neuro-symbolic reasoning in a unified multimodal space.

4.2. Separating Prior Knowledge from Observed Data for the Integration and Reasoning Queries

We distinguish between prior/common knowledge (e.g., general facts, units, rates, taxonomies) that an LLM or curated knowledge source can provide and data/evidence from a concrete database. The former can be subject to error or uncertainty due to model inference or partial inputs. The system fuses both to answer queries and construct derived indicators.

4.2.1. Knowledge Decomposition

Let

K_{prior}

denote the prior knowledge, while

D

is the set of observed data via silo emergence.

K_{prior}

informs the multidimensional structures and layers of the unified metamodel space, and an AI model (e.g., LLM) can safely provide within a negligible error threshold. Analogous to prior in Bayesian inference, this stage provides a cognitive ground for stratifying meta layers and interrelationships based on the complex neural architectures learned from the concrete training corpus. It identifies axioms and definitions, such as units of time, algebra, and calendar identities. It can also be used to identify: unit canons, constants, conversions, generic ontologies, demographics, taxonomies, classes, soft constraints, ranges, logical implications, etc.

4.2.2. A Stochastic Oracle for Priors

We treat the LLM or the trained module as a stochastic oracle returning possibly uncertain priors. A query

Φ

to the oracle (e.g., a needed constant, a conversion rule, a time resolution) yields an estimate with, optionally, variance and a confidence score.

4.2.3. Normalization and Grounding

Given an attribute a in some silo

S_{l}

with specific unit/type

Σ_{l} (a)

, normalize it to canonical form using the unit graph

U

and prior functions from

K_{prior}

using the query

Φ_{norm}

to the knowledge oracle (e.g., LLM). If required data is missing, it can be requested and added to

U

with a confidence weight when applicable.

4.2.4. Query Reasoning and Semantics

Let a user query be specified as an expression Q. We define Q as a typed program over

(K_{prior}, D)

similar to structured queries (SQL). The program consists of normalized attributes that may be pulled from the knowledge base across different silos.

K_{prior}

instantiates the prior reasoning and inference for functions and rules, possibly after consulting an LLM. It could also involve a composition, projection, or join with aggregations over different dimensions or partitions.

Thus, this framework abstracts the system’s cognition as typed program evaluation over two sources: priors (general rules, units, taxonomies, ranges) and evidence (silo data). The LLM provides on-demand stochastic priors and transformations, while concrete data provide grounded observations in a transparent, controllable reasoning process.

5. Configuration

The system’s main catalog contains essential information about the metamodel, including modalities. After initialization, the catalog update process can be invoked upon the emergence of a new silo and immediately after a preprocessing step. In the following subsections, we will provide a detailed description of the configuration steps.

5.1. Schema Inference and Normalization

We use an LLM (i.e., GPT-5 [23]) to infer the overall schema from the entire dataset or a sample file. Data types, units, sample values, and ranges inform the overall design. Once inferred, the normalization process takes effect, yielding a model that can be incorporated into the unified metamodel. We note that this process can also capture the multimodality inherent in the LLM’s underlying neural architecture. It can also be bootstrapped with minimal inputs, especially for cases of limited access, data, and partial observability. Such features can be particularly beneficial, especially in silos with significant interoperability barriers. It allows the system to evolve over time from diverse input artifacts. It also ensures an expedited implementation with the prospect of efficient deployability.

5.2. Mapping and Extending

Once the normalized schema has been constructed, a mapping procedure checks each variable or attribute for potential matches in the unified metamodel catalog. The attributes can be divided into two sets. The first is the set of source attributes inferred from the new silo. The second one is the set of attributes that already exist in the unified metamodel. Each attribute is characterized by type (e.g., numeric, text, image), unit (e.g., lb, dollar, mile, byte), domain or topic (e.g., human health, transportation, food resilience [24]), timing (e.g., years, minutes, timestamps, interval), name, and a key set of available joins. For a target attribute t to be a potential candidate for a source s, the reasoning takes into account the commonality in all characterizations. We quantify the overall feasibility of prospect mapping via a weighted confidence score. If such confidence is low, the reasoner expands the metamodel with a new relation or dimension.

5.2.1. Mapping

In the case of mapping, s must be of the same type as t. Otherwise, there should exist a model (e.g., diffusion) that transforms the type of s into t (e.g., fruit name, image, nutritional facts, etc). Similarly, the unit should be of a compatible form, or a plausible conversion should exist. The domain can also contribute to quantifying the confidence score, while accounting for the approximation inherent in names and categories. We draw on the topics from [24] and attempt to work through the encountered multimodalities. The descriptive name or column title can also be informative with a given weight (e.g., cosine similarity). Although we opt to use LLM-based approximation for such assessments due to known limitations in previous methods, work to rigorously quantify these weights is ongoing.

Timing is also a key aspect of the transformation, informed by entries to guide the mapping and time-resolution procedures. Different time resolutions can be processed during the reasoning step to harmonize data coming from different sources. Additionally, the intersection of time intervals in historical data can inform the fusion process and yield insights and analytics into existing overlaps. We also plan to use this dimension as a stepping stone toward simulations in future work.

The selection of join keys between the source and target can be informed by the overlap in the underlying distribution and quality of the entries. Diffusion models can be particularly useful for exploring potential matches in cases of high-dimensional data. The final confidence score for each pair

(s, t)

can then be determined by several factors, including similarity in attribute names or semantics, unit conversion, join keys, and distribution, with tunable weights assigned to each factor.

The result is a JSON mapping specification of the form:

{

"source": "<file>:<var_s>",

"best_target": "<file>:<var_t*>",

"join_keys": [...J*(s,t*)...],

"conversion": "<lambda*>",

"confidence": 0.0-1.0

}

The results indicate the best target candidate for the source attribute, along with the join keys. In the case of conversion or transformation, the reasoning selects the most suitable models (

λ^{*}

), which can include a closed-form conversion in simple cases and a diffusion model in more complex cases.

5.2.2. Adding New Relations and Concepts

In cases of low mapping confidence scores, we consider extending the metamodel with a new concept or relation. It means that the newly introduced silo does not closely align with the existing unified schema, necessitating further extensions.

The Algorithm 1 illustrates the reasoning procedure in each iteration. The iteration can take place upon the emergence of a new silo. We replicate this in our experiment by segregating the data into separate folders and creating a new folder in each iteration. In practice, this can be undertaken by introducing new data files, streams, or other forms of data virtualization. Benefiting from the reasoning and multimodal capacity of the underlying model significantly mitigates barriers to building more seamless, automated integration of emerging silos. We will discuss details about the conducted experiment in Section 5.4.

Algorithm 1 Iterative Multimodal Silo Integration

Input: New silo S, unified metamodel $U$
Output: Updated unified metamodel and catalog

1:

Profile each segment in S: extract attribute names, types, units, and basic statistics.

2:

Schema Inference: query an LLM with profiles and sample rows to obtain a JSON schema_profile describing:

Semantic roles (dimension, relation).
Candidate join keys.
Units and transformation.

3:

Store schema_profile in catalog.

4:

for each source attribute

a \in S

do

5:

Compare with existing unified schema

U . u n i f i e d_s c h e m a

.

6:

If similarity ≥ threshold and join keys match: map with transformation.

7:

Otherwise, extend schema.

8:

end for

9:

Apply Proposals: Update

U . u n i f i e d_s c h e m a

with new measures, dimensions, or relations.

10:

Save updated catalog

U

with version tag.

11:

Repeat for each new silo.

5.3. Ablation

We address the case of ablation for various reasons. The degree of fusion across data coming from isolated sources needs to be controlled early on and during the lifecycle. The former, early, and preemptive approach might be more straightforward by recording the lineage of data silos and their impact on the unified metamodel. The latter might be more challenging, as it involves unified metamodels with insufficient lineage information. Another reason is to handle growth and prevent the metamodel from growing too large. The size of the metamodel can be quantified using metrics such as the number of attributes and relationships. Therefore, ablation can serve as a control mechanism to evolve the metamodel to an optimized state. It also lays the foundation for testing the necessity of a specific silo and its overall added value in the unified metamodel, particularly in light of an efficient handling of multimodality. Having a tool for examining the quality of a metamodel, with and without a specific silo, can be indispensable. However, it can be challenging, especially when the degree of integration is less well-known.

5.3.1. Preemptive Approach

The preemptive approach is simpler, but it can only be used under restrictive conditions. The fusion of a silo has been documented, and the introduced data has been tracked in a manner similar to timestamping. Also, no significant changes happened after the introduction. The ablation can then take place deterministically by backtracking to the earliest version before the subject silo. The fusion procedure can continue with the remaining ones.

5.3.2. Reasoning Approach

Again, we employ a reasoning procedure for ablation within any particular silo without resorting to lineage data or reverting to an earlier metamodel version. We rely on the LLM to process the different metamodel versions and extract the relevant data to achieve a reasonably successful ablation. This approach has the advantage of handling large metamodels that lack sufficient information about every silo emergence. It can also be useful in cases where data becomes increasingly difficult to manage with deterministic backtracking due to the intertwining and other complexities in data and model growth.

Enabling reasoning in this approach can serve as an instrument to tame complexity. Since added fusions might lead to an infeasible increase in size and data dimensions. We propose using this mean for experimental purposes or actual deployment.

5.4. Example Experiment

In this section, we will use an example with real data from data.gov [24] and USDA.gov [25] data sources to demonstrate the approach. We select some publicly available datasets and gradually add them to observe the results. The data is usually available in different formats. In our test, we used data in CSV, XLSX, and JSON with different structures and types. There are three main processes to demonstrate: The first one is the initialization process, where no unified model has been established yet. The second process involves adding an emerging silo to the existing unified schema. The third demonstrates the ablation process. We will also demonstrate various actions that may occur, such as adding a new relation or object, or simply mapping or concatenating values and entries into an existing table.

We devise this simple experiment to clearly illustrate the iterations through the proposed procedure in Algorithm 1, covering various cases of alignment and multimodality support. In the first two rounds, we experiment with two CSV files that are closely aligned from a metamodel perspective, differing only in some metrics and values. Then, we gradually move toward more diverse, heterogeneous data by introducing new dimensions (e.g., price). We then demonstrate the handling and leveraging of multimodality by introducing data in various forms. Finally, we demonstrate two cases of ablation. Figure 2 illustrates the main steps of the experiment. In Section 6, we demonstrate the procedure more empirically.

5.4.1. Adding Food Availability Data

Executing the procedure in Algorithm 1 produced the following schema shown in Table 1 when applied to the file fruitveg.csv from Food Availability [25] at the first iteration. The full JSON snippet is provided in Appendix A.

Then, we upload the file eggs.csv from the same source. The JSON file has been thoroughly updated, with notable changes to measures shown in Table 2.

5.4.2. Introducing Price Data

In this step, we experiment with a new dimension by introducing the Food Prices dataset [24]. Table 3 displays the inferred schema prior to the integration of the unified metamodel.

This iteration extended the unified metamodel by adding the price, given the commodity and year as join keys. It also expanded the list of measures with yield, cup equivalent price, and cup equivalent size.

5.4.3. Adding Images

In this iteration, we present an alternative perspective by introducing images, which enable various use cases. Examples in the food domain include quality control, price validation and analytics, defect and anomaly detection, dietary assessment, waste reduction, and many others [26,27]. We continue the experiment by introducing scanned images of sample grocery receipts. The following snippet illustrates the changes in the unified metamodel after introducing images using symbolic reasoning and diffusion.

{

"source": "receipts.jsonl::line_items.total_price",

"best_target": "measures.price_usd_per_lb",

"join_keys": ["Commodity","Time","Retailer","Geo"],

"transform": "DIFFUSION_PIPELINE( steps=[

VISION_EMBED_MATCH(desc_raw,image_path)->Commodity,

OCR_DENOISE(ocr_text), UNIT_NORMALIZE(qty,unit)->pounds,

PRICE_PER_LB(total_price,pounds) ] )"

}

The reasoner inferred a new schema for the receipt. We omit that for brevity. We only show the extension snippet that establishes the mapping between the scanned images and the corresponding variables in the unified view. It also extended the schema with retailer and geo dimensions.

5.4.4. Example Queries

We demonstrate, after the unification process, with two example queries from the established dataset. The first example is to inquire about the total spending using a simple aggregation after combining availability and price data. The second example consists of a query about per capita nutritional consumption, derived from the same data and combined with inference. For example, the second query requires further reasoning on prior knowledge to retrieve facts such as nutrition and consumption units. The results of such a query and the reasoning can be valuable for informing general statistics on the nutritional availability and macroeconomics of the whole population, supported by evidence from the integrated data. We will demonstrate further examples in Section 6.

5.4.5. Ablation Example

We ablate the related part to food availability to examine the system’s robustness when a foundational part is removed, as the availability silo was incorporated early on in the process. The goal is to demonstrate that the integration of price and receipt silos remains intact. The unified metamodel remains intact. Some analytics and measures are no longer available, such as per capita availability. The results of this step inform decision-makers, highlighting what analytics are preserved and what are lost.

We also experiment with the price silo to examine the degree of loss, given that the receipts silos include high-granularity price data. In this case, the analytical view will remain intact if receipts can supplement the missing price data. For example, the pricing dataset can be inferred from receipts, provided there is sufficient data to cover all price combinations. The results from ablation reasoning inform that. A decision-maker, for example, may choose one data source over another. Therefore, from an analytical standpoint, the loss due to the ablation of one silo can be quantifiable in terms of the effect on other thematic or analytical views. This case demonstrates the added value of supporting multimodality, combined with semantic reasoning, for cost savings.

6. Empirical Assessment with Sample Public Health Data

We conduct an empirical assessment of the proposed approach with a sample of public health-related data [24]. We downloaded 10 datasets and ran the procedure sequentially to perform inference. After each iteration, we report the data name and other numerical information. The full results are reported in the code repository. Table 4 shows the reported results and updates of the schema inference after each iteration.

6.1. Queries and Derived Health Indicators

After each iteration, the newly introduced data enables more queries based on the derived features. In Table 5, we report two sample queries after each iteration that are generated by GPT-5 in each integration step to demonstrate the two types of queries that can be performed on the fused data. The first type relies only on the observed data. The second type also relies on the data; however, it may suggest further prior knowledge that can be provided by the corresponding AI model. We note that the second query may include insights and related variables or metrics that an expert can further examine to conduct a full statistical inquiry. The generated insights, however, can lay the ground work to generate a technical assessment and setup for the experiment and the system under study, enabling the expert to more easily utilize technical scaffolds in an intelligent frameworks.

Table 6 shows some of the derived health indicators after performing the 10 iterations.

6.2. Validation

We perform this demonstration for query execution with and without prior. We use the obtained results with prior for validation against an established ground truth. In this example, we execute the query about the top counties in the US with person-days with maximum 8-h average ozone concentration over the National Ambient Air Quality Standard (NAAQS) in the year 2001. Then, we normalize the results using queries with prior by sending an inquiry to GPT-5 about the population to produce the days, as opposed to person-days. For the validation, we check the accuracy of the results against the number of days with maximum 8-h average ozone concentration over NAAQS, which is already obtained in the public dataset. The results of executing the first basic query (no prior or GPT call) is shown in Table 7. Now, we assume that the data about days is holdout and demonstrate how to acquire them by a combined query from the observed dataset and obtaining the population per each county in the year 2001. Algorithm 2 describes the overall procedures. The results of this query is shown in Table 8 along with a quantified uncertainty. Finally, the results are checked against the holdout from the dataset to quantify the accuracy of the second query that uses GPT calls for prior.

Algorithm 2 Query with Prior Procedure

Input: Dataset D (e.g., air quality measures), target year y, prior source $Π$ (e.g., GPT-5)
Output: Derived query results combining data and priors

1:: Identify the target measure in D (e.g., MeasureID = 293 for ozone exceedance).
2:: Extract observations ${(s_{i}, c_{i}, f_{i}, v_{i})}$ for each $(state, county, fips, value)$ in year y.
3:: Query the prior source $Π$ for missing contextual variables (e.g., population per county):

$population (c_{i}) \leftarrow Π (“ Population of county c_{i} in year y ”)$
4:: For each county $c_{i}$ , compute derived measures such that:

$days (c_{i}) = \frac{v_{i}}{population (c_{i})}$
5:: Record each prior’s uncertainty $σ_{i}$ (e.g., model confidence or interpolation error).

To assess the accuracy of the GPT-augmented query with population data as prior reproduces ozone exceedance counts, we compared the derived average days per person against the CDC Tracking Network’s reported number of days above the NAAQS threshold (Measure 292) in the holdout. We computed mean absolute error (MAE) and mean absolute percentage error (MAPE) across all counties for 2001. Results in Table 9 show near-perfect linear fit and low relative error (<1%), indicating how well population-adjusted estimates reproduce official measures despite using GPT-5 generated priors. Figure 3 shows the linear fitness of the produced results in the second query against ground truth values.

Note on Some Encountered Issues

While running experiments, we encountered several technical issues. The performance of answering queries with prior knowledge can be sensitive to the mechanisms used to formulate and batch such queries. While multiple options for optimizations are possible, variations in context and technical configuration may significantly affect performance. The speed of such queries can be highly sensitive to local and external factors in the models and other technical setups used. In practice, justifying such an approach is subject to the trade-off between speed and adaptability in evolving systems. Resorting to an authoritative source can be more efficient but maintaining an acceptable information freshness may necessitate future migration overhead and complex modality adaptations. The computational and interface overhead of accessing prior knowledge can significantly determine feasibility especially if the approach is to be used in real-time settings.

6.3. Limitations

While the approach seems promising, significant challenges and limitations remain. A clear distinction between prior and observed data is necessary for practical application and use cases. The reliability of prior data remains an open question, even for simplistic tasks and resolutions. A temporary fix can rely on a simple quantification procedure to ensure proper assessment and confidence threshold passing, consistent with the domain requirements. For example, critical domains require a higher confidence threshold, and the derived insights can take an informative role. Queries and statistical analyses can be substantially assisted and accelerated in these contexts, but they also demand further examination, investigation, and assessment.

Some limitations of this study include the limited scale and types of datasets examined. Examining more complex data with higher dimensions can also strengthen the results and observations. We plan to continue with computationally intensive experiments and expand by verifying and tuning the approach to enhance reliability, and by testing in online deployments and with live data streams. We are also working on ways to quantify the confidence threshold obtained during the alignment. Researchers [28] pointed out the sensitivity of AI models in general to spurious correlations [28]. We acknowledge this to be a challenging issue that demands careful assessment in the results evaluation with proper tooling and expertise. The proposed approach can help address this challenge.

7. Discussion: Towards a Multimodal Unified View

Currently, various AI models with learned neural architectures deliver varying improvements, particularly on well-defined tasks with known inputs and prompts. However, the quality of the results is variable and may degrade as the metamodeling layer gets more complex. The structured results, however, are still semantically enriched but may not be readily applicable to the domain of interest. At this layer, more precise instructions and nuanced corrections are necessary to elicit essential improvements in the reasoning process. It remains an open question whether the underlying model can effectively address this challenge. In this presentation, we address this issue through multiple layers of scaffolding for the data, schema, and query results, and thorough validation and examination of the generated code and results, utilizing both statistical and symbolic reasoning. Figure 4 shows a generalized view of the iterations demonstrated in Section 5.4.

Another prospect is integrating live streams. As native multimodality evolves in models, incorporating live data streams becomes increasingly accessible. Service and data virtualization techniques can benefit from real-time access to the latest data, with fewer barriers and less need to conform to strict requirements (e.g., APIs), especially for real-time dashboards, monitoring, and security systems.

Relation to Systems of Systems

Systems of Systems exhibit unique features that cannot be present in any single component. Emergent behavior arises in ways that demand a combined reasoning capacity across multiple dimensions. While each constituent needs to fulfill its own mission, the system as a whole must also operate and behave consistently. Preserving some degree of independence in management and governance across parts is also necessary. They also evolve over time as new requirements emerge and existing components are removed or modified.

Treating siloed databases and warehouses as SoS opens the way to applying well-established methodologies proven to address integration barriers iteratively and more formally. Such an approach enables more SoS engineering deliberation to be cast onto unpredictable and emerging data silos. We recognize the unique challenges posed by data silos. However, a proper application of methodical SoS approaches, combined with the leverage of recent AI models, can enhance confidence and preparedness to tackle newly emerging modalities and needs.

While the employed technologies can resolve significant barriers, such as interface rigidity, data migration, and some degree of semantic divergence, another set of challenges may persist, including ownership and conventional legacy constraints. The role of generative AI in resolving these challenges might be limited. However, it can still assist decision-making by expediting several processes and therefore informing stakeholders more efficiently. By exposing schema and semantic mappings to the learning process, the model can infer correspondences and establish connections between various modalities in heterogeneous systems. The flexibility of the transformation inherently diminishes the constraints on harmonization to the degree that they can be applied on demand when possible.

8. Conclusions

In this paper, we discussed how to leverage advances in recent AI models to effectively handle multimodality in large data collections. We focused on the potential and impact of such leverage in tackling interoperability within heterogeneous data silos. The literature is rich in approaches for using statistical inference to manage large data integration. We selected a set of models that have advanced in recent years and can be used to resolve key interoperability issues. We outlined an approach for applying such models. We demonstrated, using a realistic case with reasoning steps that can be applied to the collected data as they emerge. We also highlighted the major limitations of this study, particularly regarding scale and interpretability. We point out ongoing challenges in quantifying uncertainty and accuracy, as well as verifying the correctness of the overall results.

Nevertheless, combining the outputs of recent models could offer a pathway to continuous, dynamic fusion of segregated data through multimodal semantic reasoning, by enabling a more resilient system that can overcome traditional integration and data migration barriers. The inherent accounts of less intuitive transformations in these models enhance preparedness to emerging needs, forms, and modalities. It mitigates barriers to data analysis and expedites experiments and statistical analyses through advanced reasoning queries. Our validation results indicate that prior-integration pipeline can faithfully reconstruct validated environmental metrics from silo data using only public priors. Systems can evolve with more acceleration in particular aspects. Models can be equipped with native features to mitigate delays or disruptions, in accordance with modular design principles and transparent, controllable reasoning.

Author Contributions

Conceptualization, B.P.Z. and A.A.; methodology, A.A.; software, A.A.; validation, B.P.Z. and A.A.; formal analysis, A.A.; investigation, B.P.Z.; resources, A.A.; data curation, A.A.; writing—original draft preparation, A.A.; writing—review and editing, B.P.Z. and A.A.; visualization, A.A.; supervision, B.P.Z.; project administration, B.P.Z. and A.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The datasets used in this study are publicly available from and can be accessed via https://www.data.gov (accessed on 10 September 2025, U.S. General Services Administration [24]) and https://www.usda.gov (U.S. Department of Agriculture [25]). The source code and cached results produced are available at https://github.com/alshareef2/data_fusion (accessed on 25 October 2025).

Conflicts of Interest

Authors Abdurrahman Alshareef and Bernard Phillip Zeigler were employed by the company RTSync Corp. The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
GAN	Generative Adversarial Networks
GPT	Generative Pre-trained Transformer
LLM	Large Language Model
MAE	Mean Absolute Error
MAPE	Mean Absolute Percentage Error
SoS	Systems of Systems

Appendix A. JSON Responses

Listing A1. Full JSON representation of the unified schema, source profile, and inferred schema.

{
  "version": "0.0.1",
  "unified_schema": {
    "dimensions": {
      "Commodity": { "type": "string", "id_key": "commodity_id" },
      "Time": { "type": "temporal", "grain": "year", "id_key": "time_id" }
    },
    "measures": {
      "availability_per_capita": {
        "type": "numeric",
        "unit": "Pounds per capita"
      }
    },
    "relations": {
      "food_data": {
        "keys": ["Commodity", "Time"],
        "measures": ["availability_per_capita"]
      }
    }
  },
  "sources": {
    "fruitveg.csv": {
      "profile": {
        "path": "fruitveg.csv",
        "shape": [884, 4],
        "cols": [
          { "name": "Commodity", "type": "text" },
          { "name": "Year", "type": "num"  },
          { "name": "Attribute", "type": "text" },
          { "name": "Value", "type": "num"  }
        ],
"time_cols": ["Year"]
      },
      "inferred_schema": {
        "columns": [
          {
            "name": "Commodity",
            "semantic_type": "dimension",
            "dtype": "text",
            "time": false,
            "candidate_roles": ["Commodity"]
          },
          {
            "name": "Year",
            "semantic_type": "dimension",
            "dtype": "num",
            "time": true,
            "candidate_roles": ["Time"]
          },
          {
            "name": "Attribute",
            "semantic_type": "dimension",
            "dtype": "text",
            "time": false,
            "candidate_roles": ["Commodity", "Form", "Unit"]
          },
          {
            "name": "Value",
            "semantic_type": "measure",
            "dtype": "num",
            "unit": "Pounds per capita",
            "time": false
          }
        ],
        "suggested_dimensions": ["Commodity", "Year", "Attribute"],
        "suggested_measures": ["Value"]
      }
    }
  }
}

Listing A2. Detailed measure definitions with semantic units.

{
  "measures": {
    "availability_per_capita": {
      "type": "numeric",
      "unit": "Pounds per capita"
    },
    "eggs_shell_equiv_million_dozen": {
      "type": "numeric",
      "unit": "Shell-egg equivalent, Million dozen"
    },
    "eggs_total_million_pounds": {
      "type": "numeric",
    "unit": "Million pounds"
    },
    "eggs_lbs_per_capita": {
      "type": "numeric",
      "unit": "Pounds per capita"
    },
    "population_millions": {
      "type": "numeric",
      "unit": "Millions"
    }
  }
}

Listing A3. Full JSON for Vegetable-Prices-2022.csv: profile and inferred schema.

{
  "sources": {
    "Vegetable-Prices-2022.csv": {
      "profile": {
        "path": "Vegetable-Prices-2022.csv",
        "shape": [93, 8],
        "cols": [
          "Vegetable",
          "Form",
          "RetailPrice",
          "RetailPriceUnit",
          "Yield",
          "CupEquivalentSize",
          "CupEquivalentUnit",
          "CupEquivalentPrice"
        ]
      },
      "inferred_schema": {
        "columns": [
          { "name": "Vegetable",; "semantic_type": "dimension", "candidate_roles": ["Commodity"] },
          { "name": "Form", "semantic_type": "dimension", "candidate_roles": ["Form"] },
          { "name": "RetailPrice", "semantic_type": "measure", "unit": "varies; see RetailPriceUnit" },
          { "name": "Yield", "semantic_type": "measure", "unit": "dimensionless" },
          { "name": "CupEquivalentSize", "semantic_type": "measure", "unit": "Pounds" },
          { "name": "CupEquivalentPrice","semantic_type": "measure", "unit": "USD per cup equivalent" }
        ],
        "suggested_dimensions": ["Vegetable", "Form"],
        "suggested_measures": ["RetailPrice", "Yield", "CupEquivalentSize", "CupEquivalentPrice"]
      }
    }
  }
}

References

Denton, N.; Molloy, M.; Charleston, S.; Lipset, C.; Hirsch, J.; Mulberg, A.E.; Howard, P.; Marsh, E.D. Data silos are undermining drug development and failing rare disease patients. Orphanet J. Rare Dis. 2021, 16, 161. [Google Scholar] [CrossRef] [PubMed]
IBM. What Are Data Silos? 2025. Available online: https://www.ibm.com/think/topics/data-silos (accessed on 1 September 2025).
Sohl-Dickstein, J.; Weiss, E.; Maheswaranathan, N.; Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the International Conference on Machine Learning, PMLR, Lille, France, 6–11 July 2015; pp. 2256–2265. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NeurIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Advances in Neural Information Processing Systems. Curran Associates Inc.: Red Hook, NY, USA, 2017; Volume 30, pp. 5998–6008. [Google Scholar]
Ming, Y.; Hu, N.; Fan, C.; Feng, F.; Zhou, J.; Yu, H. Visuals to text: A comprehensive review on automatic image captioning. IEEE/CAA J. Autom. Sin. 2022, 9, 1339–1365. [Google Scholar] [CrossRef]
Google LLC. Google Cloud Platform. 2025. Available online: https://cloud.google.com/ (accessed on 1 September 2025).
Apache Software Foundation. Apache Parquet. 2025. Available online: https://parquet.apache.org/ (accessed on 1 September 2025).
Apache Software Foundation. Apache Iceberg. 2025. Available online: https://iceberg.apache.org/ (accessed on 1 September 2025).
Snowflake Inc. Snowflake Open Catalog. 2025. Available online: https://www.snowflake.com/en/ (accessed on 1 September 2025).
Wuillemin, P.H.; Torti, L. Structured probabilistic inference. Int. J. Approx. Reason. 2012, 53, 946–968. [Google Scholar] [CrossRef]
Bouquet, P.; Magnini, B.; Serafini, L.; Zanobini, S. A SAT-based algorithm for context matching. In Proceedings of the International and Interdisciplinary Conference on Modeling and Using Context, Stanford, CA, USA, 23–25 June 2003; Springer: Berlin/Heidelberg, Germany, 2003; pp. 66–79. [Google Scholar]
Fey, M.; Hu, W.; Huang, K.; Lenssen, J.E.; Ranjan, R.; Robinson, J.; Ying, R.; You, J.; Leskovec, J. Relational deep learning: Graph representation learning on relational databases. arXiv 2023, arXiv:2312.04615. [Google Scholar] [CrossRef]
Köpcke, H.; Rahm, E. Frameworks for entity matching: A comparison. Data Knowl. Eng. 2010, 69, 197–210. [Google Scholar] [CrossRef]
Barlaug, N.; Gulla, J.A. Neural networks for entity matching: A survey. ACM Trans. Knowl. Discov. Data 2021, 15, 1–37. [Google Scholar] [CrossRef]
Brachman, R.; Levesque, H. Knowledge Representation and Reasoning; Elsevier: Amsterdam, The Netherlands, 2004. [Google Scholar]
Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed Representations of Words and Phrases and Their Compositionality. In Proceedings of the 27th Annual Conference on Neural Information Processing Systems (NeurIPS 2013), Lake Tahoe, NV, USA, 5–10 December 2013; Advances in Neural Information Processing Systems. Curran Associates Inc.: Red Hook, NY, USA, 2013; Volume 26, pp. 3111–3119. [Google Scholar]
Chandrasekaran, D.; Mago, V. Evolution of semantic similarity—a survey. ACM Comput. Surv. 2021, 54, 1–37. [Google Scholar] [CrossRef]
Chen, S.; Guo, W. Auto-encoders in deep learning—A review with new perspectives. Mathematics 2023, 11, 1777. [Google Scholar] [CrossRef]
Creswell, A.; White, T.; Dumoulin, V.; Arulkumaran, K.; Sengupta, B.; Bharath, A.A. Generative adversarial networks: An overview. IEEE Signal Process. Mag. 2018, 35, 53–65. [Google Scholar] [CrossRef]
Zhang, X.; Gao, W. Towards llm-based fact verification on news claims with a hierarchical step-by-step prompting method. arXiv 2023, arXiv:2310.00305. [Google Scholar]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.t.; Rocktäschel, T.; et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Adv. Neural Inf. Process. Syst. 2020, 33, 9459–9474. [Google Scholar]
Arslan, M.; Ghanem, H.; Munawar, S.; Cruz, C. A Survey on RAG with LLMs. Procedia Comput. Sci. 2024, 246, 3781–3790. [Google Scholar] [CrossRef]
OpenAI. GPT-5. 2025. Available online: https://platform.openai.com/docs/models/gpt-5 (accessed on 1 September 2025).
U.S. General Services Administration. Data.gov—The Home of the U.S. Government’s Open Data. Available online: https://data.gov (accessed on 1 September 2025).
U.S. Department of Agriculture. USDA.gov—United States Department of Agriculture. Available online: https://www.usda.gov (accessed on 1 September 2025).
Liu, D.; Zuo, E.; Wang, D.; He, L.; Dong, L.; Lu, X. Deep Learning in Food Image Recognition: A Comprehensive Review. Appl. Sci. 2025, 15, 7626. [Google Scholar] [CrossRef]
Selvaraj, M.G.; Vergara, A.; Montenegro, F.; Ruiz, H.A.; Safari, N.; Raymaekers, D.; Ocimati, W.; Ntamwira, J.; Tits, L.; Omondi, A.B.; et al. Detection of banana plants and their major diseases through aerial images and machine learning methods: A case study in DR Congo and Republic of Benin. ISPRS J. Photogramm. Remote Sens. 2020, 169, 110–124. [Google Scholar] [CrossRef]
Ye, W.; Zheng, G.; Cao, X.; Ma, Y.; Zhang, A. Spurious correlations in machine learning: A survey. arXiv 2024, arXiv:2402.12715. [Google Scholar] [CrossRef]

Figure 1. The reasoning chain of the iterative silo integration.

Figure 2. The steps of the simple demonstrative experiment. Ablation in both cases is demonstrated by removing the corresponding set from the unified metamodel.

Figure 3. The accuracy of the derived days versus the ground truth.

Figure 4. An overall view of the experiment.

Table 1. Summary of generated schema and source inference.

Component	Description
Dimensions	Commodity (string, id_key = commodity_id); Time (temporal, yearly, id_key = time_id)
Measures	availability_per_capita (numeric, unit: Pounds per capita)
Relation	food_data with keys: (Commodity, Time) and measure: availability_per_capita
Source Profile	CSV: `fruitveg.csv`, size: 884 rows × 4 columns, includes Commodity, Year, Attribute, Value. Year used as temporal column
Inferred Schema Roles	Dimensions: Commodity, Year, Attribute; Measures: Value (unit: Pounds per capita)

Table 2. Summary of Measure Definitions.

Measure	Description/Unit
availability per capita	Pounds per capita
eggs shell equiv million dozen	Shell-egg equivalent, million dozen
eggs total million pounds	Million pounds
eggs lbs per capita	Pounds per capita
population_millions	Millions

Table 3. Source profile and inferred schema for Vegetable-Prices-2022.csv.

Component	Description
File	`Vegetable-Prices-2022.csv`
Shape	93 rows × 8 columns
Columns	Vegetable, Form, RetailPrice, RetailPriceUnit, Yield, CupEquivalentSize, CupEquivalentUnit, CupEquivalentPrice
Inferred Dimensions	Vegetable→ Commodity; Form → Form
Inferred Measures	RetailPrice (varies; see RetailPriceUnit), Yield (dimensionless), CupEquivalentSize (Pounds), CupEquivalentPrice (USD per cup equivalent)
Suggested Dimensions	Vegetable, Form
Suggested Measures	RetailPrice, Yield, CupEquivalentSize, CupEquivalentPrice

Table 4. Schema inference results after each iteration using GPT-5: counts of detected dimensions, measures, units, and signals.

Step	Data	Rows	Cols	#Dims	#Measures	#Units	#Signals
1	Air Quality Measures	218,635	14	11	1	0	1
2	Suicide Rates (US)	6390	13	7	1	1	1
3	Drug Overdose Death Rates	6228	15	7	1	1	0
4	Heart Disease Mortality	78,792	21	20	1	1	1
5	Mental Health Care (4 Weeks)	10,404	15	11	3	1	0
6	COVID-19 Hospitalizations	43,697	8	7	1	1	1
7	Leading Causes of Death (US)	10,868	6	4	2	2	0
8	Obesity	106,260	33	25	5	2	1
9	Provisional COVID-19 Deaths	38,935	14	6	8	5	1
10	Chronic Disease Indicators	309,215	34	31	4	1	0

Table 5. Sample queries after each integration step suggested by GPT-5. Each row shows the data added at that step, a Data-only query: question, and a data-and-prior question.

Step	Inputs and Example Queries
1	Data: Air Quality Measures On The National Environmental Health Tracking Network
	Data-only query: Over the most recent five years, which counties have the highest and fastest-rising ozone levels in the Tracking Network data?
	Data + prior: county ozone levels; EPA 8-h ozone standard of 70 ppb; expected May–September seasonal peak
2	Data: Death Rates For Suicide By Sex Race Hispanic Origin And Age United States
	Data-only query: Over the past decade, which sex, race/Hispanic-origin, and age groups have experienced the largest increases in suicide death rates?
	Data + prior: increases in the mental_distress; rises in suicide death rates by demographic group, ozone levels
3	Data: Drug Overdose Death Rates By Drug Type Sex Age Race And Hispanic Origin United States
	Data-only query: Which sex, age, and race/Hispanic origin groups currently have the highest U.S. overdose death rates by drug type?
	Data + prior: year-over-year changes in overdose death rates by drug type; national mental_distress; ozone signals
4	Data: Heart Disease Mortality Data Among US Adults 35+ By State Territory And County
	Data-only query: Which states and counties have the highest and lowest heart disease mortality rates among adults aged 35+, and what is the national median rate?
	Data + prior: heart disease mortality rates among adults 35+ positively; mental distress prevalence; ambient ozone levels
5	Data: Mental Health Care In The Last 4 Weeks
	Data-only query: What share of adults received mental health care in the last 4 weeks, and how does utilization vary by area and demographic subgroup?
	Data + prior: mental health care; mental distress prevalence
6	Data: Monthly Rates Of Laboratory Confirmed COVID-19 Hospitalizations (Covid-NET)
	Data-only query: What are the temporal trends, peaks, and recent month-over-month changes in Covid-NET monthly rates of laboratory-confirmed COVID-19 hospitalizations?
	Data + prior: monthly Covid-NET COVID-19 hospitalization rates; heart-related mortality
7	Data: NCHS Leading Causes Of Death (United States)
	Data-only query: In the latest NCHS Leading Causes of Death (US), what is the latest-year rank and age-adjusted death rate for heart disease, and does it align with the heart_mortality signal for the same period?
	Data + prior: ozone levels; heart mortality rates; air pollution; cardiovascular risk
8	Data: Nutrition, Physical Activity and Obesity (BRFSS)
	Data-only query: Which states have the highest and lowest adult obesity and physical inactivity prevalence in the latest Nutrition, Physical Activity, and Obesity BRFSS data?
	Data + prior: BRFSS-estimated physical inactivity; obesity levels; heart mortality; hospitalizations; mental distress; ozone
9	Data: Provisional COVID-19 Death Counts/Rates/% of Total Deaths by Jurisdiction of Residence
	Data-only query: In the most recent week available, which jurisdictions have the highest COVID-19 death rate per 100,000 and the highest percent of total deaths due to COVID-19?
	Data + prior: COVID-19 death rates; percent-of-deaths; hospitalization rates; heart mortality; obesity prevalence
10	Data: U.S. Chronic Disease Indicators (CDI)
	Data-only query: Across states, what is the correlation between obesity prevalence and heart disease mortality rates in the U.S. Chronic Disease Indicators dataset?
	Data + prior: CDI obesity prevalence rates; COVID-19 death rates

Table 6. Derived indicator catalog proposed by inference with the underlying signals and required priors.

Derived Indicator	Used Data Signals	Priors Needed
Preventable Heart Risk	obesity, heart mortality	risk-factor weights; population attributable fraction; age-standardization assumptions
Mental Burden	mental distress	comorbidity influence models; scale alignment across surveys
Respiratory Exposure	ozone	dose–response air-quality functions; exposure thresholds
Pandemic Severity	COVID deaths, hospitalizations	severity-ratio baselines; lag structure between exposure, hospitalization, and mortality
Vulnerability	heart mortality; mental distress; obesity; ozone; COVID deaths; hospitalizations	multi-factor weighting; normalization reference; correlation adjustment between signals

Table 7. Top 20 county year per person-days (MeasureID = 293, Year = 2001).

State	County	Person-Days ( $\times 10^{6}$ )
California	Los Angeles (6037)	789.300000
California	San Bernardino (6071)	197.200000
California	Riverside (6065)	187.500000
Texas	Harris (48201)	170.100000
Arizona	Maricopa (4013)	127.000000
California	San Diego (6073)	123.400000
California	Fresno (6019)	109.800000
Illinois	Cook (17031)	107.200000
California	Kern (6029)	85.200000
California	Sacramento (6067)	63.300000
Texas	Tarrant (48439)	58.100000
California	Tulare (6107)	48.500000
Texas	Dallas (48113)	45.000000
California	Ventura (6111)	43.700000
Nevada	Clark (32003)	40.900000
Michigan	Wayne (26163)	38.800000
Pennsylvania	Allegheny (42003)	38.200000
New York	New York (36061)	37.300000
New York	Kings (36047)	34.700000
Pennsylvania	Philadelphia (42101)	34.600000

Table 8. Top 20 county year by days (average days per person using population as prior for Year = 2001).

State	County	Year	Person-Days	Population	Days	Uncertainty
California	Fresno (6019)	2001	$1.10 \times 10^{8}$	$8.13 \times 10^{5}$	135.1	0.150000
California	Tulare (6107)	2001	$4.85 \times 10^{7}$	$3.75 \times 10^{5}$	129.2	0.150000
California	Kern (6029)	2001	$8.52 \times 10^{7}$	$6.79 \times 10^{5}$	125.5	0.150000
California	Riverside (6065)	2001	$1.88 \times 10^{8}$	$1.61 \times 10^{6}$	116.5	0.150000
California	San Bernardino (6071)	2001	$1.97 \times 10^{8}$	$1.74 \times 10^{6}$	113.2	0.150000
California	Los Angeles (6037)	2001	$7.89 \times 10^{8}$	$9.55 \times 10^{6}$	82.7	0.150000
California	El Dorado (6017)	2001	$9.99 \times 10^{6}$	$1.59 \times 10^{5}$	62.9	0.350000
California	Merced (6047)	2001	$1.33 \times 10^{7}$	$2.15 \times 10^{5}$	62.1	0.350000
California	Ventura (6111)	2001	$4.37 \times 10^{7}$	$7.60 \times 10^{5}$	57.5	0.150000
Tennessee	Sevier (47155)	2001	$3.68 \times 10^{6}$	$7.12 \times 10^{4}$	51.7	0.600000
California	Sacramento (6067)	2001	$6.33 \times 10^{7}$	$1.24 \times 10^{6}$	50.9	0.150000
Texas	Harris (48201)	2001	$1.70 \times 10^{8}$	$3.47 \times 10^{6}$	49.0	0.150000
California	Nevada (6057)	2001	$4.50 \times 10^{6}$	$9.20 \times 10^{4}$	48.9	0.600000
California	Placer (6061)	2001	$1.16 \times 10^{7}$	$2.58 \times 10^{5}$	44.9	0.350000
California	San Diego (6073)	2001	$1.23 \times 10^{8}$	$2.84 \times 10^{6}$	43.4	0.150000
California	Madera (6039)	2001	$5.27 \times 10^{6}$	$1.23 \times 10^{5}$	42.8	0.600000
North Carolina	Rowan (37159)	2001	$5.27 \times 10^{6}$	$1.30 \times 10^{5}$	40.5	0.600000
Arizona	Maricopa (4013)	2001	$1.27 \times 10^{8}$	$3.15 \times 10^{6}$	40.4	0.150000
Texas	Tarrant (48439)	2001	$5.81 \times 10^{7}$	$1.48 \times 10^{6}$	39.2	0.150000
New Jersey	Camden (34007)	2001	$1.97 \times 10^{7}$	$5.09 \times 10^{5}$	38.8	0.150000

Table 9. Accuracy of derived days (Query 2) vs. ground-truth ozone days for year 2001.

n	MAE	MAPE (%)	MeanBias
556	0.115000	0.996000	0.073000

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Alshareef, A.; Zeigler, B.P. Multimodal Semantic Fusion of Heterogeneous Data Silos. Systems 2025, 13, 987. https://doi.org/10.3390/systems13110987

AMA Style

Alshareef A, Zeigler BP. Multimodal Semantic Fusion of Heterogeneous Data Silos. Systems. 2025; 13(11):987. https://doi.org/10.3390/systems13110987

Chicago/Turabian Style

Alshareef, Abdurrahman, and Bernard P. Zeigler. 2025. "Multimodal Semantic Fusion of Heterogeneous Data Silos" Systems 13, no. 11: 987. https://doi.org/10.3390/systems13110987

APA Style

Alshareef, A., & Zeigler, B. P. (2025). Multimodal Semantic Fusion of Heterogeneous Data Silos. Systems, 13(11), 987. https://doi.org/10.3390/systems13110987

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multimodal Semantic Fusion of Heterogeneous Data Silos

Abstract

1. Introduction

2. Background

2.1. Data Integration

2.2. Applicable AI Models

3. Methods

3.1. Ontological Schema Inference

3.2. Format Heterogeneity

4. Mathematical Formulation for the Evolving Unified Metamodel

4.1. The Foundational Formalism

4.1.1. Defining Silo as a Typed Information Space

4.1.2. Unified Metamodel Space

4.1.3. Cross-Silo Correspondence

4.1.4. Compositional Closure and Derived Features

4.1.5. Cognitive Expansion of the Metamodel Space

4.1.6. Minimal Realization

4.1.7. Modalities and Cross-Modal Mappings

4.2. Separating Prior Knowledge from Observed Data for the Integration and Reasoning Queries

4.2.1. Knowledge Decomposition

4.2.2. A Stochastic Oracle for Priors

4.2.3. Normalization and Grounding

4.2.4. Query Reasoning and Semantics

5. Configuration

5.1. Schema Inference and Normalization

5.2. Mapping and Extending

5.2.1. Mapping

5.2.2. Adding New Relations and Concepts

5.3. Ablation

5.3.1. Preemptive Approach

5.3.2. Reasoning Approach

5.4. Example Experiment

5.4.1. Adding Food Availability Data

5.4.2. Introducing Price Data

5.4.3. Adding Images

5.4.4. Example Queries

5.4.5. Ablation Example

6. Empirical Assessment with Sample Public Health Data

6.1. Queries and Derived Health Indicators

6.2. Validation

Note on Some Encountered Issues

6.3. Limitations

7. Discussion: Towards a Multimodal Unified View

Relation to Systems of Systems

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. JSON Responses

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI