You are currently viewing a new version of our website. To view the old version click .
Big Data and Cognitive Computing
  • Editor’s Choice
  • Article
  • Open Access

13 October 2022

A Probabilistic Data Fusion Modeling Approach for Extracting True Values from Uncertain and Conflicting Attributes

,
,
,
,
and
1
Management Information Systems Department, College of Business Administration, American University of the Middle East, Egaila 54200, Kuwait
2
School of Architecture, Computing, and Engineering, University of East London, E16 2RD London, UK
3
Faculty of Ocean Engineering Technology and Informatics, Universiti Malaysia Terengganu, Kuala Terengganu 21030, Terengganu, Malaysia
*
Author to whom correspondence should be addressed.

Abstract

Real-world data obtained from integrating heterogeneous data sources are often multi-valued, uncertain, imprecise, error-prone, outdated, and have different degrees of accuracy and correctness. It is critical to resolve data uncertainty and conflicts to present quality data that reflect actual world values. This task is called data fusion. In this paper, we deal with the problem of data fusion based on probabilistic entity linkage and uncertainty management in conflict data. Data fusion has been widely explored in the research community. However, concerns such as explicit uncertainty management and on-demand data fusion, which can cope with dynamic data sources, have not been studied well. This paper proposes a new probabilistic data fusion modeling approach that attempts to find true data values under conditions of uncertain or conflicted multi-valued attributes. These attributes are generated from the probabilistic linkage and merging alternatives of multi-corresponding entities. Consequently, the paper identifies and formulates several data fusion cases and sample spaces that require further conditional computation using our computational fusion method. The identification is established to fit with a real-world data fusion problem. In the real world, there is always the possibility of heterogeneous data sources, the integration of probabilistic entities, single or multiple truth values for certain attributes, and different combinations of attribute values as alternatives for each generated entity. We validate our probabilistic data fusion approach through mathematical representation based on three data sources with different reliability scores. The validity of the approach was assessed via implementation into our probabilistic integration system to show how it can manage and resolve different cases of data conflicts and inconsistencies. The outcome showed improved accuracy in identifying true values due to the association of constructive evidence.

1. Introduction

In this era of information technology advancement, individuals and organizations must manage, process, exchange, and share information in heterogeneous environments [1,2]. To address this need, the foremost requirement is to integrate data originating from autonomous and heterogeneous sources. Data integration is the general process of producing a unified (mediated) repository from a set of heterogeneous and autonomous sources that may contain (semi-)structured or unstructured data [3,4,5,6]. Data integration represents a significant part of the activities of the global data management industry [3,7]. It provides a comprehensive yet concise overview of all available data without requiring the user to view each data source individually [1,3].
The need to access information from different sources through a uniform interface has been the driving force behind much research in the area of data integration [1]. Many approaches and tools have been proposed in recent decades to achieve integration at different schema and instance levels, with varying degrees of accuracy and success [8,9,10,11]. Such proposals have been presented in data integration systems, data management tools and techniques, and conflict resolution approaches [12,13,14,15]. This diversity of these technological solutions is the result of the adaption of diverse application domains with different needs and degrees of heterogeneity, such as the consideration of scientific data, online data, big data, and the combination of structured and unstructured data [13,16]. Issues related to overcoming data heterogeneity or the consequences of having the same information stored in different ways remain major concerns to data integration researchers [3,8,9,10,17,18]. According to Magnani and Montesi [19], Jaradat, Halimeh, Deraman and Safieddine [3], Bakhtouchi [8], and Papadakis, Skoutas, Thanos and Palpanas [9], schema mapping and entity linkage (also known as “duplicate deduction”) in diverse sources have been recognized as two major issues that must be addressed in order to integrate data into a single, consistent representation. Another major issue, named “data fusion” or “data conflict resolution”, has been recognized; this requires further research to achieve consistent, integrated data [8,9,10,20]. Data fusion is about how to find a true value from contradicting attribute values when integrating data from several sources [17,21,22]. Although the resolution of data conflicts has been mostly neglected, several approaches and techniques have evolved [2,17,23,24,25], many of which attempt to “prevent” data conflicts by focusing solely on the uncertainty of missing values, while others employ a variety of resolution strategies to “resolve” conflicts [3,9,23,24]. Accordingly, the problem regarding the lack of a proper data conflict resolution modeling approach with explicit uncertainty management and correlations has been identified in [3,8,19,24,26]. In particular, an approach is required to tackle the data fusion problem based on probabilistic data linkage results and addresses its representational and computational uncertainty management challenges. This motivates us to study the probabilistic data fusion problem based on source accuracy, probabilistic entity merging, data conflict, and uncertainty management.
This paper looks closely at a probabilistic data fusion approach for extracting true values from uncertain and conflicting attributes. To this end, we propose a new probabilistic data fusion modeling approach. The approach represents and manages the uncertain and conflicting multi-valued attributes of a generated entity named “network digital object” (nDO). An obtained nDO, with its possible alternative, is generated from the probabilistic merging of multi-corresponding entities. Our approach considers both the Open-World Assumption (OWA) and Closed-World Assumption (CWA) while constructing the possible worlds of the fused values and their probabilistic formulations. It also considers both single and multiple truth assumptions. Based on these considerations, more meaningful probabilistic data fusion answers can be obtained to fit real-world data integration and fusion scenarios. The main contributions of this paper are as follows:
(1).
Handling the data fusion representational challenge of accepting probabilistic data, i.e., reliability scores of data values and probabilistic similarity value sets, to generate probabilistic global entities with their fused data value alternatives.
(2).
A formal representation of varied data fusion cases to fit into the probabilistic data fusion problem. The formulation of data fusion cases is outlined by observing the origin of the generated data values and their uncertainty.
(3).
Incorporating data lineage into the data fusion model to trace the source of data that offers additional information with which to understand the conflict and uncertainty among the observed data and to facilitate the on-demand data fusion process.
(4).
The construction of a data fusion computational method that conditionally calculates the posterior/updated reliability score for a possible world of true fused value(s).
(5).
The implementation of the data fusion method within the probabilistic data integration system that can show the applicability of addressing on-demand probabilistic fusion for different data conflict types based on the merging of probabilistic entity alternatives.
(6).
Finally, our proposed data fusion approach is designed to cope with a dynamic, volatile and on-demand fusion environment and to support efficient modification re-execution. It provides a means of a new Decision Model (DM) logic that isolates the matching logic from the decision logic, yielding an additional efficiency advantage while dealing with dynamic data. The constructed model can work in two-fold benefit solutions; it can work as a complete probabilistic DM or replace the manual DM sub-stage in traditional DM logic.
The rest of the paper is organized as follows. Section 2 briefly reviews some literature and related work. Section 3 introduces the proposed approach. Section 4 describes our data fusion problem based on the probabilistic integration of heterogeneous data sources. In Section 5, we develop our probabilistic data fusion model and its constructed computational method. Section 6 presents a proof of concept through system implementation and mathematical proof to illustrate the validity of the proposed model. Section 7 outlines the limitations related to the proposed approach. Finally, the conclusions of this paper are given in Section 8, and some future directions for our work are identified.

3. Preliminaries

This section introduces some background of the proposed model, including the definitions of the informative digital object (iDO) concept, the integration framework, the lineage, and the rules of possible-world semantic.

3.1. Informative Digital Object (iDO) Concept

The term informative Digital Object (iDO) describes a Real-World Object (RWO) as a person or a place that has a distinct identity locally. In fact, iDO is the ground root for proposing the network Digital Object (nDO) concept that represents a particular global entity generated from the probabilistic merging of its corresponding local iDOs.
We can define the iDO concept as a uniquely identifiable container that aggregates and presents relevant multi-entities components in terms of the actual content aspect and in the form of compound digital existence, throughout mapping the associated relationships among them with a coherent and cohesive representation of information context about a specific RWO [87]. iDO is an extended content-based model mechanism for the Digital Object (DO) concept, which consists of diverse contents of information units constructed from sequence transformation of meta-levels components of relevant data. The actual contents are grouped under varied categories to give an organized representation of an RWO’s features and to convey meaningful and comprehensive information. Thus, multi-parts components can be managed and viewed as a single entity [87].
In the iDO model, the object’s features are classified under three major categories, i.e., identification, descriptive, and supportive, and where each category consists of various subcategories due to the correlation and mapping type between a specific attribute with its corresponding iDO [3]. This categorization would provide domain-independent entity resolution rules [3]. Detailed information about iDO concept to represent the entities from the participated sources within a probabilistic data integration framework, the attribute types and categories, and the possible-world generation rules and mapping types can be found in our previous research paper as [3,88]. The iDO representation for the RWOs is given in the following definitions.
Definition 1.
The participated data sources that are required to be integrated is a set of n sources, where each participated source can be in a type of (semi-)structured   ( t p 1 ) , or unstructured   ( t p 2 ) , i.e., S = { S 1 , S 2 , , S n } : i [ 1 , n ] ,   S i . t p { t p 1 , t p 2 } . This representation helps distinguish the matching comparison process between each of these two types. For example, the participated ( t p 1 )  sources are assumed to contain unique objects. Hence, the internal matching comparison is not required as no duplicated objects can be found. In contrast, ( t p 2 ) sources may contain duplicated objects, and then the comparison process proceeds internally. Detailed information about this pair-wise source-to-target matching process can be found in [3].
Definition 2.
Entities of the participated data sources for integration are considered to belong to the same domain. They are assumed to be modeled in sort of iDOs to represent persons, restaurants or any other RWOs. Each  S i source contains a set of m iDOs, where the range of m in the participated sources are different, i.e., S i = { i D O i 1 , i D O i 2 , ,   i D O i m } :   1 h m . An ( i D O i h ) is triple of a comprise set of attributes’ names ( A k i h : 1 k c ) , attributes’ values ( a k . g i h :   1 g q ) , and attributes’ types ( t y { I d n ,   D e s c , S u p p } ) , i.e., A k i h . a k . g i h . t y . These attributes describe the shared features of iDOs, where an attribute may contain single or multiple data values and belongs to a specific category that encodes the attribute type ( t y ) . According to [3], a specific mapping rule based on the attribute type can be applied to obtain the possible world.

3.2. The Best-Effort Data Integration Framework

Our probabilistic data fusion approach is formulated based on the proposed best-effort data integration framework presented in [3]. This framework considers that the global schema generation and its mapping production are a priori performed. Hence, a ( G S ) global schema consisting of global attributes is obtained before the initiating of the instance integration process, i.e., A k i h ( A 1 G S . t y ,   A 2 G S . t y ,   ,   A c G S . t y ) . These generated global attributes correspond to a specific attribute that existed at the participated data sources. Each ( A k G S ) with its mapped data sources’ attributes must belong to a specific ( t y ) attribute’s type, such that one of these global attributes represents the main parameter, such as the name of an author or a restaurant. Thus, the type of this main parameter attribute is ( t y = I d n ) .
Despite the precise integration at the schema level, the framework takes the instance integration as a non-trivial process that requires probability management capabilities. Therefore, a probabilistic global entity named network digital object ( n D O w : 1 w z ) is added to the traditional framework formulation. This framework aims to remove the manual interventions by allowing less precise but automatic instance integration (i.e., entity linkage and data fusion) answers. It also corresponds to the pair-wise source-to-target matching process. In this process, a participated ( S i ) data sources can be presented as a target data source ( T s t :   T s t S ) or a local source ( L s s :   L s s S ) . Accordingly, an i D O i h that belongs to a T s t data source is denoted as a reference entity/instance, i.e., r D O w i h : 1 w z , while an i D O i h that belongs to a L s t data source is denoted as a possible local entity/instance that may link with a specific r D O w i h entity/instance, i.e., p D O w x i h : 1 x y . This matching process allows a set of possible local instances to be compared and probabilistically matched against a specific reference instance based on their shared attribute values. In correspondence to the global schema generation, each participated i D O i h must have an attribute that represents the main parameter for the matching process between pairs of references to local instances, i.e., ( A k i h . I d n ) . Figure 2 illustrates the matching formulation and process between three iDOs obtained from three structured data sources ( t p = t p 1 ) .
Figure 2. The pair-wise-source-to-target matching and formulation process based on the best effort data integration framework. (a) The participated iDO instances as observed from three structured data sources of S 1 . t p 1 ,   S 2 . t p 1 , &   S 3 . t p 1 . (b) The pair-wise-source-to-target matching process for the three structured data sources of S 1 . t p 1 ,   S 2 . t p 1 , &   S 3 . t p 1 . (c) The pair-wise-source-to-target matching process for the three iDO instances according to their local instance ( p D O w x i h )   or reference instance ( r D O w i h ) formulations.
From the matching process, a probabilistic pair-wise entity linkage result is obtained as r D O w = { r D O w i h : p D O w 1 i h [ P r ( L w 1 , ) ] , p D O w 2 i h [ P r ( L w 2 , ) ] , , p D O w y i h [ Pr ( L w y ) ] } : 1 x y ,   0 P r ( L w x ) 1 , and P r ( L w x , ) is the probability linkage value for a pair of ( r D O w i h : p D O w x i h ) instances as representing the same RWO. By considering the possible-worlds generation rules and the probabilistic distribution, the probabilistic entities merging can be computed to generate a global merged entity that is denoted as ( n D O w ) . In correspondence, the best-effort data integration framework is four components of ( L s s ,   T s t ,   M s , t , n D O w ) , where:
  • T s t is a target data source that belongs to a T s set of n target sources, i.e., T s = ( T s 1 , T s 2 , , T s n ) : t [ 1 , n ] ,   T s t T s S . A T s t source can be in type of t p 1   o r   t p 2 . A reference instance is denoted as r D O w i h = { a 1 i h . i d n ,   a 2.1 i h . t y , a 2.2 i h . t y , , a 2 . q i h . t y ,   a 3.1 i h . t y , ,   a c . q i h . t y } : g = 1 for a 1 . g i h . i d n .
  • L s s is a local data source that belongs to a L s set of n local sources, i.e., L s = ( L s 1 , L s 2 , , L s n ) : s [ 1 , n ] , L s s L s S . A L s s . t p source can be in a type of t p 1 or t p 2 . A local instance is denoted as p D O w x i h = { a 1 i h . i d n , a 2.1 i h . t y , a 2.2 i h . t y , , a 2 . q i h . t y ,   a 3.1 i h . t y , ,   a c . q i h . t y } :   g = 1 for a 1 . g i h . i d n .
  • M s , t is a triple of ( T s t . r D O w i h . ( a k . g i h . i d n ,   , a c . q i h . t y ) ; L s s . p D O w x i h . ( a k . g i h . i d n ,   , a c . q i h . t y ) ; m s . k ~ t . k )   M s , t mapping is a set of one-to-one probabilistic matching for each reference attribute value a k . g i h . t y r D O w i h against a local attribute value a k . g i h . t y p D O w x i h , if initially the similarity ( m s . k ~ t . k ) value between a pair of main parameter attribute’s values originated from a reference instance with its corresponding local instance is greater than a specified threshold value, i.e., m s . k ~ t . k ( p D O w x i h ( a k i h . i d n ) )   ~   ( r D O w i h ( a k i h . i d n ) ) δ : r D O w i h p D O w x i h , and ( δ ) is the similarity threshold value for considering the matching between the pairs of main parameter’s data values. Thus, for each instance pairs from i D O i h = i = 1 n h = 1 m k = 1 c g = 1 q ( i D O i h . a k . g i h . t y ) there is L s s . i D O i h against T s t . i D O i h local source-to-target entities matching in the form of p D O w x i h . a k . g i h . t y   ~   r D O w i h . a k . g i h . t y :   r D O w i h p D O w x i h , and ( ~ ) denotes the pair-wise matching operation.
  • n D O w is a set of z mutual probabilistic global entities merging alternatives that are generated from merging their possible corresponding iDOs, as they have pair-wise linkage results in sort of a reference instance to its possible local instances, i.e., r D O w = ( r D O w i h : p D O w 1 i h [ P r ( L w 1 , ) ] , p D O w 2 i h [ P r ( L w 2 , ) ] , , p D O w y i h [ P r ( L w y , ) ] ) : 1 w z , 1 x y . A probabilistic global entity is a set of instances merged from a reference instance with its possible local instances, i.e., n D O w = { ( n D O w . 1 , P r ( M w . 1 ) ) , ( n D O w .2 ,     P r ( M w .2 ) ) , , ( n D O w . f , P r ( M w . f ) ) } : 1 j f , j = 1 f P r ( M w . j ) = 1 . Each possible entities merging alternative has an assigned probability distribution value obtained from multiplying the probability linkages of its linked instances, i.e., n D O w . j = ( ( r D O w i h : p D O w 1 i h , p D O w 2 i h , , p D O w y i h ) , Pr ( M w . j ) ) . For each requested attribute and within each possible merge, there could be a multi-valued attribute in which each possible attribute’s value alternative is assigned with a probabilistic data fusion value obtained from updating and conditionally computing the reliability scores of its attribute’s values, i.e., n D O w . j . A k G s = { ( a ( P w s t v . 1 ) , μ ( a ( P w s t v . 1 ) ) ) , ( a ( P w s t v .2 ) , μ ( a ( P w s t v .2 ) ) ) , , ( a ( P w s t v . P ) , μ ( a ( P w s t v . P ) ) ) } :   1 l L ,   p = 1 P μ ( a ( P w s t v . p ) ) = 1 . An a ( P w s t v . p ) possible world may contain single or multi-possible true values, i.e., a ( P w s t v . p ) = { a k . 1 w . j , a k .2 w . j , , a k . g w . j } .

3.3. Data Lineage

Data lineage represents a very important perspective in the integration process. In particular, while trying to resolve and manage uncertain and conflicted data contributed from matched and integrated entities that originated from heterogeneous and volatile data sources [89,90]. Data lineage would provide information on the entities and the data values’ origin, but also explanations for any generated information and returned results. In this approach, we combine lineage and uncertainty management into one data model. Lineage is closely related to uncertainty and conflicts because it is a powerful mechanism for tracing the uncertainty origin [89].
Having a model supporting lineage means it has the ability for modeling the trace to the origin of an iDO and its data values residing in a data source. It offers additional information that helps understand conflict and uncertainty. It also facilitates the correlations among the participated iDOs. For instance, suppose that entities generated from a particular structured source are distinct real-world objects. If two matched iDOs originated from the same structured source, then we know that a possible world cannot contain these references in one possible merged alternative. Thus, impossible worlds would not be included due to the construction of rules that consider the lineage of the data source and its type.
The proposed approach uses the lineage to identify and resolve conflicts that arise at the linkage and data fusion tasks. Data linage is a convenient mechanism to compute the probability for the data fused values where multi-values can exist at different sources [89,90]. We can obtain the correct reliability scores ( μ a k . g i h ) of a data by encapsulating linage to its actual value, such as having a data with linage from two data sources shows that this value is combined, hence its probabilistic fusion needs to be computed accordingly. i h depicts the lineage of a specific data value ( a k . g i h ) as obtained from certain S i . t p data source/s and that belongs to certain i D O i h object. Due to the similar a k . g i h values as obtained from multiple merged iDOs, i h may indicate the union lineages of these iDOs. Thus, the lineage of a data value i h = ( 1 , 2 , , | i h | ) ,   i h     a k . g i h .

3.4. Possible-World Semantic

Possible world is a fundamental concept in uncertainty management research, as most of the related works are based on it. It helps manage the odds of matching outputs, linkage’s answers, merging results, and multi-valued data [19,26,91]. As the participated data sources and the matching outputs contain incomplete, imprecise or uncertain information, they implicitly represent a collection of possible appearances in a sample space, called possible-worlds or alternatives (Pws). Possible worlds describe an object or item where many possibilities may exist as an answer for that description [3].
A possible world is a hypothetical state about an object or item that represents certain and ordinary information. It is obtained from choosing one alternative among a collection of representations that form the total sample space for each item containing uncertain data [52,92]. If there is a possibility showing that none of the given answers exists or is true, this should be treated as an alternative. In order for a database to recognize this alternative, the OWA should be taken into consideration [55]. For instance, by referring to the example presented in Figure 3, we can state that the disjunctive information of (John teaches Physics or Calculus) shows imprecise information about the subject that John teaches. Does this show uncertainty about John’s attributes? If the disjunction is interpreted inclusively using the CWA, then it represents three possible worlds: one in which John teaches Physics, he teaches Calculus, or he teaches both. Under the OWA, however, there would be a fourth-world possibility, i.e., John teaches none of them (neither Calculus nor Physics). This means there is Some Other unknown Value (SOV) that John teaches. The disjunctive facts state that one of these worlds represents the actual situation, but it is unknown which one. Assigning probability values to these worlds gives them confidence degrees that measure the certainty level for each alternative being true. For example, if we know John teaches Physics with probability (0.8) or Calculus with probability (0.6), then we can be more certain that John is more likely teaches both subjects, as this alternative has the highest confidence. Figure 3 illustrates these possible-world examples according to both assumptions of CWA and OWA.
Figure 3. Possible world’s example with associated probability values.
The presence of a null value in the outcome means we have a missing value. A missing value that exists in the real world but for some reason is not available or unknown. Moreover, the missing value is characterized with respect to the presence and meaning of null values and to the validity of the CWA or the OWA [52,92]. CWA denotes that only the values actually observed from the participated data sources, and no other values present facts of the real-world [55]. Thus, the correct value must only be contained from the participated data sources. In contrast, OWA states neither the truth nor the falsity of facts not represented in the participated data sources. Therefore, primitive sources are not necessarily complete, as the correct value may not be contained in these sources [52,92].
Null or missing value comes in five different types, as stated by [52]. These types represent different interpretations, such as a value can be missing either because it exists but is unknown, because it does not exist at all, or because it may exist. However, it is not actually known whether it exists or not. Due to the scope of this research and the nature existence of the multi-valued attributes, this research restricted the null value manipulation to Atomic Existential null value to correspond to the uncertainty conflict situation based on CWA and OWA. Atomic existential value (i.e., only one value is possible), such as when the age of a person is unknown ( N u l l ). In the existential null, there might be supplementary knowledge concerning the unknown value that the attribute may take as having a domain of possible values [3,52]. As this paper deals with uncertain data and results, several sample spaces   ( Ω w . j . k ) and possible worlds ( P w s T v C a s e . w . j . k ) will be obtained to manage the uncertainty and conflicts of the data fusion results.

3.5. Probabilistic Entity Linkage Definition

In this research paper, instance integration is approached as a non-trivial integration problem where a collective iteration process is practically inapplicable and inefficiently expensive, and the manual intervention is unaccepted (impossible or hard to achieve). It also portrays the incorporation of uncertainty management in entity linkage and data fusion tasks by using the probability theory to manage uncertain and conflict items.
Probabilistic management is the process of starting with imperfect data and manipulating correlated data to generate a new probabilistic global entity by probabilistically linking and merging its represented reference entity with its possible local entities and by probabilistically fusing its generated attributes’ values (i.e., single-valued attributes value, and multi-valued attributes) [3]. The probabilistic entity is a generated global entity that contains a set of possible instances merged from the reference instance with its possible local instances, i.e., n D O w = { ( n D O w . 1 , Pr ( M w . 1 ) ) , ( n D O w .2 , Pr ( M w .2 ) ) , , ( n D O w . f , Pr ( M w . f ) ) } . Each possible merge has an assigned probability distribution value, i.e., Pr ( M w . j ) generated from multiplying the probability linkages of its possible linked local instances, i.e., n D O w . j = { ( r D O w i h : p D O w 1 i h , p D O w 2 i h , , p D O w y i h ) , Pr ( M w . j ) } : j = 1 f Pr ( M w . j ) = 1 [3].
The probabilistic data fusion task this research paper deals with is explicitly considered the pair-wise probabilistic entity linkages and their corresponding probabilistic entity merging results. Section 4 covers the probabilistic data fusion problem based on these linkages and merging results.

3.6. Probabilistic Data Fusion Assumptions

The participated sources are taken under the condition of an independent cause of the error. This is a valid condition since the participated sources are independently maintained, in which the data values in one source are not derived from the data values in other sources. Besides that, the obtained data values for multiple merged instances follow the categorical value condition, s the values that do not match exactly are considered distinct. Thus, the accuracy of the supplied data values in our fusion problem implies the assumptions below:
Assumption 1.
Each participated data value is assumed to be associated with a confidence value that indicates its probability degree of correctness, i.e.,  ( μ a k . g i h ) .
Assumption 2.
Since the data sources’ correlation in sort of data copying is out of this research scope. Hence the errors of attribute values are independent across different data sources. In other words, for any distinct data sources of ( S 1 , , S n )   the value of a k . 1 1 h   recorded in S 1 is not dependent on the values of ( a k .2 2 h , , a k . q i h )   recorded in   ( S 2 , , S n ) , (and so on). Once we know the true value, i.e., a k . T v w . j of   A k w . j   attribute (and vice versa). This means the reliability score of an attribute value does not depend on the reliability scores of other values, once we know the presented event that corresponds to the true world of the fused value, i.e., ( a k . T v w . j = a k . q i h . T a k . T v w . j = a k . q i h . F ) : a k . T v w . j = the possible true data value, T = being true, F = being false and   represents an exclusive OR (XOR).
Assumption 3.
Prior probabilities of   a k . T v w . j = a k . 1 1 h . T = a k . 1 1 h . F , a k . T v w . j = a k .2 2 h . T = a k .2 2 h . F , , a k . T v w . j =   a k . q i h . T = a k . q i h . F ) are the same and not dependent if we have no reason a priori to believe that one value is more likely to occur. These probabilities are given to a k . T v w . j before the reliability scores of A k w . j . a k . g i h are being considered.

4. The Probabilistic Data Fusion Problem

The probabilistic data fusion problem that this paper aims to address is explicitly considered the entities linkage and merging decisions that can be resolved using probabilistic decisions. This means that the representation of entities merging’s answer could produce multi-possible alternatives, where each alternative may have different combinations of possible linked instances, which in turn may contain similar and/or distinct data values in their shared attributes. Moreover, as an attribute’s values are originated from heterogeneous data sources, they can be distinct, misspelt, incomplete, outdated, or even may contain typographical errors. They also can be more or less certain and reliable.
In fact, the quality of the participated sources affects our belief in the correctness of their information items. Often, those sources provide a degree of confidence for their contributed information as generated from statistical tools, thereby lending their prediction a probabilistic interpretation [24,25]. Consequently, multi-valid or invalid global attributes’ values can exist. The assigned data values for each global attribute can differ from one alternative to another, and each distinct value can be originated from one to multiple data sources. This consideration brings an additional representational and computational challenge to the development of the data fusion model. Accordingly, there is a need to consider the accuracy of the attribute values as originated from their data sources and their combined reliability scores based on the included instances of entities’ merging alternative.
To make a clear presentation of our fusion problem and to facilitate the understanding of the following solution space, we first highlight the data fusion cases that this paper aims to identify and address. Then, we provide a formal definition of our data fusion problem, i.e., the probabilistic fusion for multi-valued attributes based on probabilistic entities merging alternatives. The obtained attributes may contain valid or invalid data as they originated from different sources or instances with different reliability scores.

4.1. Data Fusion Cases

Among the different values that could be merged for a particular global attribute ( A k G s ) and within a specific set of entities merging alternative ( n D O w . j ) , i.e., D o m   A k w . j , one value can only be true, or multiple values can be true [7,8,19,71,77]. For example, a person can have one true age, yet they can also have multiple true phone numbers. These differences reflect two fusion cases: (i) an attribute with multi-true values, such as the phone number for global ( n D O w . j ) entity ( 768 4001 ,   567 3211 ) . (ii) An attribute may have inconsistent values. This case forms a conflict classified by previous research into two types of conflict: Uncertainty and Contradiction [8,23,37]. Contradiction is a type of conflict between two or more inconsistent non-null values that refer to the same object attribute, such as the merged Age values for a person from two sources is   ( 35 ,   37 ) , in which a person can only have one true age, while uncertainty represents the conflict between non-null values(s) with one or more null values that refer to the same object’s attribute.
Considering the above discussion of data fusion cases, the non-null and null values, and the CWA and OWA assumptions, the data fusion challenge can be split into six different cases, as shown in Table 1. Each case depicts a specific multi-valued attribute challenge that is required to be handled accordingly. Each case is also given a specific representation, such as ( C 1.1 ) is the representation for the multi-true value case under the closed-world assumption (i.e., MTC-CWA), while ( C 2.1.2 ) is the representation for the contradiction case under the open-world assumption (i.e., ITCC-OWA).
Table 1. The data fusion cases are based on the CWA or the OWA assumptions.

4.2. Problem Formulation

In this paper, the data fusion problem addresses the probabilistic fusion of multi-valid or invalid attribute values within a particular   n D O w   set. This set may contain multiple alternatives of probabilistic entities merging subsets, i.e., n D O w = { ( n D O w . 1 , Pr ( M w . 1 ) ) ,   ( n D O w .2 , Pr ( M w .2 ) ) , , ( n D O w . f , Pr ( M w . f ) ) } : 1 j f ,   to decide which attribute value’s or values’ alternative is more likely the correct answer for n D O w . j entities merging alternative. This n D O w alternatives set is obtained from pair-wise probabilistic linkage decision of local instances ( p D O w x i h ) to a reference instance ( r D O w i h ) , i.e., ( r D O w i h : p D O w 1 i h [ P r ( L w 1 , ) ] ,   p D O w 2 i h [ P r ( L w 2 , ) ] , , p D O w y i h [ P r ( L w y , ) ] ) : 1 w z , 1 x y . p D O w x i h is a possible instance from   L s s   local data source that is probably linked to a particular r D O w i h reference instance. r D O w i h is an iDO instance from   T s t   target data source. Moreover, as the actual data values are assigned with reliability scores, A k G s . a k . g global attribute value that is populated from similar values originating from multiple data sources, i.e., A k G s . a k . g = A k i h . a k . g i h , may have multiple reliability scores at once ( μ A k G s . a k . g = μ a k . g i h ) [3]. Identifying the probabilistic true value or values of data items as obtained from independent data sources and populated based on multiple probabilistic entities merging alternatives is initiated from the following inputs:
  • A set of participated   S = { S 1 , , S n } sources. Each S i , i [ 1 , n ] source can be either a local source L s s , or a target source T s t , and it contains a set of i D O i h objects that represent a particular aspect of a RWO.
  • A participated   i D O i h   object is a triple of a comprise set of attribute’s names, values, and types, i.e., ( A k i h ,   a k . g i h , t y ) , where the type of a participated attribute is obtained based on the type of its corresponding global attribute, i.e., A k i h . t y = A k G s . t y :   t y { I d n , D e s c , S u p p } , and A k i h . t y = A k G S . I d n . An attribute may have a single data value, i.e., ( A k i h . a k . g i h = A k i h . a k . 1 i h = A k i h . a k i h ) , or multiple data values, i.e., ( A k . a k . g : 1 < g q ) .
  • n D O w entities merging set obtained from r D O w   linkage set. n D O w   consists of all the possible subsets of the entities merging alternatives { ( n D O w . 1 , Pr ( M w . 1 ) ) ,   ( n D O w .2 , Pr ( M w .2 ) ) , ,   ( n D O w . f , Pr ( M w . f ) ) } . A n D O w . j subset is depicted as ( ( r D O w i h : p D O w 1 i h , p D O w 2 i h , , p D O w y i h ) , Pr ( M w . j )   ) , and j = 1 f Pr ( M w . j ) = 1 . The assigned p D O w x i h instances in each alternative differ from one to another. Due to the assigned p D O w x i h instances in each entity merging’s alternative, each attribute may include different sets of data values.
  • A confidence degree, which is referred to as a reliability score and is denoted by μ a k . g i h , to indicate the probability of a specific value provided by A k i h attribute being true and associated with a particular A k G s global attribute. Accordingly, there is given a reliability source’s score of μ a k . g i h to be associated with each a k . g i h data value.
  • A matching function returning a precise (Match, or Not Match) decision between a pair of participated data values obtained from iDOs that belong to a particular n D O w merging set. For a specific attribute, the generated data values are the union of all distinct values. Each obtained data value could be derived from multiple similar values originating from multiple iDOs as they are observed from a participating r D O w i h entity with its corresponding p D O w x i h instances. The matching outputs produce a data values’ domain based on the obtained global schema/attributes and the participated data sources. This domain of data values is depicted below and Figure 4d shows an example of this domain generation:
    Figure 4. Example of the probabilistic data fusion problem under the existence of multi-valued attributes. (a) The participated iDO instances. (b) The pair-wise probabilistic Linkages. (c) The probabilistic entities merging alternatives from 4b instances. (d) The populated data values for the global attributes of the participated instances in 4c.
D O M   A k G s = g w = 1 q w ( a k . g w . i h , g i h = 1 11 q n m μ a k . g w . i h ) : i ( T s t , L s s )
where, within a specific   A k G S global attribute and   n D O w entities merging set:
-
a k . g w . i h depicts a populated data value from its corresponding data values that existed at the participated T s t and L s s   sources. a k . g w . i h may assemble a single value of a k . g w . i h = a k . 1 i h , or a combined data value from its corresponding similar values as a k . g w . i h = ( a k . 1 i h , , a k . q i h ) :     a k . g i h n D O w ,   a k . 1 i h = a k .2 i h = = a k . q i h .
-
i h depicts the data lineage of each iDO in the   r D O w   set that has the attribute value. At each generated global data value, i h indicates the data lineage union for those similar data values as originated from the participated data sources and related to one a k . g w . i h   global data value, i.e., i h = ( 1 , , | i h | ) ,   i h     a k . g w . i h . Due to the similar a k . g i h values, i h could indicate one or many lineages.
-
g i h = 1 11 q n m ( μ a k . g w . i h ) or μ a k . g w . i h for representation simplicity, represents the reliability scores set for all included iDOs’ data values in a a k . g w . i h global data value. Depending on the observed i h data lineage for a k . g w . i h data value, the μ a k . g w . i h set may have single or multiple reliability scores. This means, a generated global attribute’s value can be obtained from one or more data sources or iDOs, and hence, it can be assigned with single to multiple reliability scores.
Considering a specific A k G S attribute and a n D O w . j merging alternative, the actual combination of its data values contains the union of all distinct values that are generated from the matching function and that belonged to the assigned instances in a specific n D O w . j alternative. This combination of data values for an attribute produces a data values domain as denoted by the following formulation:
D o m   A k w . j = g w . j = 1 w . j q w . j ( a k . g w . j . i h , g i h = 1 11 q n m ( μ a k . g w . j . i h ) ) :   D o m   A k w . j D o m   A k G s . w
At this domain, a k . g w . j . i h ,   i h ,   and   μ a k . g w . j . i h are, respectively, referred to as the populated data values, lineage and reliability scores for a specific n D O w . j = ( r D O w i h : p D O w 1 i h , p D O w 2 i h , , p D O w y i h ) , Pr ( M w . j ) merging alternative. Due to null value existence, an additional challenge to generate D o m   A k w . j domain is observed. This occurred since one of the data values that belongs to D o m   A k w . j is an atomic null value, i.e., ( a k . g . N l w . j . i h ) . This means the null value is denoted by a domain of possible values of D o m   a k . g . N l w . j . i h , such that either one or none of these possible values can be a true value. Thus, D o m   A k w . j in Equation (2) can be reformulated as stated in Equation (3).
D o m   A k w . j = ( g w . j = 1 q w . j | g . N l w . j | ( a k . g w . j . i h , g i h = 1 11 q n m ( μ a k . g w . j . i h ) ) ,   ( a k . g . N l w . j . i h , g i h = 1 11 q n m ( μ a k . g . N l w . j . i h ) ) ) = ( g w . j = 1 q w . j | g . N l w . j | ( a k . g w . j . i h , g i h = 1 11 q n m ( μ a k . g w . j . i h ) ) ,   N l g = N l 1 N l q ( a k . g . N l g w . j . i h , g i h = 1 11 q n m ( μ a k . g . N l g w . j . i h ) ) ) :     a k . g . N l w . j . i h D o m   A k w . j    
The example of a probabilistic instance integration process for three restaurant instances is given in Figure 4 to illustrate our data fusion problem. These iDOs originated from three structured sources ( S 1 . t p 1 ,   S 2 . t p 1 ,   a n d   S 3 . t p 1 ), where the reliability scores for their data values are ( μ a 1.1 13 = μ a 2.1 13 = μ a 3.1 13 = 0.9 , μ a 1.1 21 = μ a 2.1 21 = μ a 3.1 21 = 0.6 ,   and   μ a 1.1 31 = μ a 3.1 31 = 0.45 ) . Moreover, the Name attribute has been recognized as the main parameter for the instance matching and integration process; hence, t y = I d n for this attribute (i.e., N a m e 1.1 G S . i d n ). In this example, the participated iDO instances are shown in Figure 4a, whereas the produced probabilistic entity linkages and merging alternatives are shown in Figure 4b,c, respectively. In addition, the data value domains for the global Phone and Address attributes are presented in Figure 4d.
From Figure 4a, we see that Name and Address global attributes are obtained from their corresponding restaurant’s Name, and Address, as existed in three participated iDOs, where each iDO originated from a different data source, i.e., ( N a m e 1 13 ,   N a m e 1 21 , N a m e 1 31 ) N a m e 1.1 G s . i d n ,   a n d   ( A d d r e s s 3 13 , A d d r e s s 3 21 , A d d r e s s 2 31 ) ( A d d r e s s 3 G s . t y ) . In contrast, the global Phone attribute is generated from its corresponding Phone attributes as existed in two participated iDOs, where each iDO belongs to a different source, i.e., i D O 13 from S 1 , and i D O 21 from S 2 :   ( P h o n e 2 13 , P h o n e 2 21 ) ( P h o n e 2 G s . t y ) . Due to these correspondences, we note that D o m   A 2 G s in Figure 4d does not contain a null value since i D O 31 from S 3 do not have a Phone attribute, rather it consists of one value of ( 818 / 762   1221 2.1 ( 13 , 21 ) , ( 0 . 9 , 0 . 6 ) ) . This value comes with two lineages and two reliability scores as obtained from i D O 13 and i D O 21 instances.   D o m   A 2 G s   domain presents the combined data values for the global Phone attribute as populated in Equation (1). In contrast,   D o m   A 3 Gs   has two data values with different cardinality of lineages and reliability scores, i.e., ( ( 12335   F i e n e g a   B l v d 3.2 31 ) , ( 0.45 ) ) , and ( ( 12335   F i e n e g a   B l v d 3.2 31 ) , ( 0.45 ) ) . D o m   A 3 G s domain depicts the combined data values for the global Address attribute as populated using Equation (1). Basically, these domains of data values comprise all distinct Phone and Address values that are obtained from running a matching function over the pair of ( i D O 13 ~ i D O 21 ) and ( i D O 13 ~ i D O 31 ) instances to generate { r D O 1 13 : ( p D O 11 21 ,   0.97 ) ,   ( p D O 12 31 , 0.85 ) } linkage set as shown in Figure 4b. These reference and local instances correspond to the participated iDOs as r D O 1 13 i D O 13 ,   p D O 11 21 i D O 21 ,   and   p D O 12 31 i D O 31 . Based on our data fusion problem, the data values in these domains need to be fused, i.e., finding the true data value or alternative values based on the entities merging alternatives in 4c. Accordingly, the data values domains of D o m   A k w . j to be fused are populated based on Equation (2) and are shown in Table 2. If the data fusion case belongs to the ITCU categories as previously presented in Table 1, the data values domains of D o m   A k w . j to be fused will be populated based on Equation (3).
Table 2. Example of the probabilistic data fusion problem under the existence of multi-valued attributes.
From D o m   A 3 G s in Figure 4d and D o m   A 3 1.3 in Table 2, we noticed that these domains share the same values, yet the linages for their first value are different. The lineage for ( 12224   Ventura   Blvd 3.1 i h ) in D o m   A 3 G s is i h = ( 13 , 21 ) but in D o m   A 3 1.3 is i h = ( 13 ) only. This occurs since p D O 11 21 does not exist in the n D O 1.3 merging alternatives, as the above figure shows. The data fusion computation method will be executed based on these domains to obtain the possible true fused value or values.
It is worth stating that this problem definition copes with dynamic and volatile data values that may evolve over time. It could also correspond to the online data fusion problem since the prior entities’ linkage, and merging stages are processed separately, where each stage keeps and stores the obtained probability outcomes alongside their actual data. The fusion process is carried out on these outcomes upon a user’s request, as the obtained probabilities are stored alongside the actual data values.

5. The Probabilistic Data Fusion Model

In this section, we formally describe the data fusion solution and show how we leverage the trustworthiness of data sources and their values in truth discovery. To determine a k . T v w . j that might be the value(s) observed from the participated sources, the production of the data fusion sample space ( Ω w . j . k ) and possible worlds ( P w s T v c a s e .   w . j . k ) within a particular A k w . j attribute’s values, and n D O w . j merging alternative are discussed next. Then, the probabilistic data fusion method is constructed to compute the conditional probability (i.e., updated reliability score) for a possible data values’ world that is probably recognized as the true fused answer, using μ a k . g w . j . i h scores and given a data fusion case to obtain a possible-worlds’ set over a domain of D o m   A k w . j .

5.1. The Probabilistic Data Fusion Sample Space and Possible-Worlds Generation

The form of the data value and its associated reliability scores affect the data fusion sample space production. In addition, recognizing a possible-worlds set over its sample space depends on the applied data fusion case. Accordingly, the data fusion sample space production and possible-world generation are discussed next based on the identified cases in Table 1.

5.1.1. The Data Fusion Sample Space Production

Each value in D o m   A k w . j is assigned with a reliability score ( 0 μ a k . g i h 1 ) to indicate the probability of being a true fused value, i.e., ( a k . g i h . T : μ a k . g i h . T = μ a k . g i h ) . Since μ a k . g i h 1 , then ( ¬ μ a k . g i h = 1 μ a k . g i h ) that indicates the probability of the data value being a false fused value, i.e., ( a k . g i h . F : μ a k . g i h . F = ¬ μ a k . g i h . T = ¬ μ a k . g i h ) . Thus, a pair of ( a k . g w . j . i h ,   μ a k . g w . j . i h ) in D o m   A k w . j can be interpreted as a pair set of two mutual events. This is depicted as:
( a k . g w . j . i h , g i h = 1 11 q n m μ a k . g w . j . i h ) { ( a k . g w . j . i h . T , g i h = 1 11 q n m μ a k . g w . j . i h . T ) , ( a k . g w . j . i h . F , g i h = 1 11 q n m μ a k . g w . j . i h . F ) }
where
-
a k . g w . j . i h . T implies the event when the generated data value that belongs to D o m   A k w . j is a true fused value, i.e., a k . T v w . j = a k . g w . j . i h a k . T v w . j D o m   A k w . j ,   with a probability equals to the union of the original reliability scores for all i h lineage exited in a k . g w . j . T event, i.e., μ a k . g w . j . i h . T = ( μ a k . 1 i h , μ a k .2 i h , , μ a k . q i h ) ,     i h a k . g w . j . i h .
-
a k . g w . j . i h . F implies the event when the generated data value that belongs to D o m   A k w . j is a false fused value, i.e., a k . T v w . j a k . g w . j . i h a k . T v w . j D o m   A k w . j , with probability equals to the union of the reliability score’s complements for all i h existed in a k . g w . j . F , i.e., μ a k . g w . j . i h . F = ¬ μ a k . g w . j . i h . T = ( ¬ μ a k . 1 i h , , ¬ μ a k . q i h ) ,   i h = a k . g w . j . i h .
-
μ a k . g w . j . i h . T \ F set indicates either the original reliability scores’ set or the complement scores’ set for its associated data value’s event a k . g w . j . i h . T \ F .
Due to the i h   observed, the data value in an event may resemble a single data value, i.e.,   a k . g w . j . T \ F = a k . g w . j . i h . T \ F , or a combined data value from its similar a k . g i h values, i.e., a k . g w . j . T \ F = a k . g w . j . i h . T \ F = ( a k . 1 i h . T | F , a k .2 i h . T | F , , a k . q i h . T \ F ) ,     a k . g i h = a k . g w . j . i h n D O w . j . The reliability scores set assigned to a data value’s event   ( a k . g w . j . T \ F ) may also have a single reliability score of μ a k . g w . j . i h . T \ F = μ a k . g w . j . i h . T \ F , or multiple reliability scores of μ a k . g w . j . i h . T \ F = ( μ a k . 1 i h . T \ F , μ a k .2 i h . T \ F ,   ,   μ a k . q i h . T \ F ) ,   a k . g i h = a k . g w . j . i h n D O w . j .
For a domain holding one pair of data value, i.e., D o m   A k w . j = { ( a k . g w . j . i h , μ a k . g w . j . i h ) } , the two events of ( a k . g w . j . i h . T ,   a k . g w . j . i h . F ) comprise the data fusion sample space, where each event represents a possible world of true data fused value, i.e., ( a k . T v w . j = a k . g w . j . i h . T a k . T v w . j = a k . g w . j . i h . F ) , such that ( ) represents an exclusive OR (i.e., XOR). This is noticed since the general data fusion’s sample space of true alternatives is obtained as: Ω w . j . k = l = 1 L Ω t v . l : 1 l L , L = 2 | D o m   A k w . j | [3].
As multi-distinct data values are being observed in D o m   A k w . j , the generation of the data fusion’s sample space requires considering the Cartesian product operation over the events sets that could be obtained from D o m   A k w . j in Equation (4). Therefore, the produced sample space over D o m   A k w . j is the result of the Cartesian product operation over the events’ sets of its data values pairs, as presented below in Equation (5):
Ω w . j . k = { ( a k . 1 w . j . i h . T , μ a k . 1 w . j . i h . T ) , ( a k . 1 w . j . i h . F , μ a k . 1 w . j . i h . F ) } × × { ( a k . q w . j . i h . T , μ a k . q w . j . i h . T ) , ( a k . q w . j . i h . F , μ a k . q w . j . i h . F ) } Ω w . j . k = ( × g w . j = 1 q w . j { ( a k . g i h . T , μ a k . g i h . T ) , ( a k . g i h . F , μ a k . g i h . F ) } | a k . g w . j D o m   A k w . j | i h | 1 )
Ω w . j . k will contain multiple mutual data fusion worlds of Ω t v . l , such that each data value pair included in a Ω t v . l world is encoded by either its true or false event. Moreover, since each event is a pair of true or false forms of the actual data value with its reliability scores, the Ω t v . l world is depicted as a pair that comprises multiple data values’ events with their associated reliability score sets. This representation of alternatives in sort of pairs form is depicted in Equation (6). From this pair representation, the mutual alternatives produced at a particular data fusion’s sample space are as shown below:
Ω w . j . k =   ( { a ( Ω t v . l ) , μ ( Ω t v . l ) } | a ( Ω t v . l ) = g w . j = 1 q w . j ( a k . g w . j . i h . T \ F )                       μ ( Ω t v . l ) = g w . j = 1 q w . j g i h = 1 11 q n m ( μ a k . g w . j .   i h . T \ F ) ) Ω w . j . k = { { ( a k . 1 i h . T , , a k . q 1 i h . T , a k . q i h . T ) , ( μ a k . 1   i h . T , , μ a k . q 1   i h . T , μ a k . q   i h . T ) } , , { ( a k . 1 i h . F , , a k . q 1 i h . F , a k . q i h . T ) , ( μ a k . 1 i h . F , , μ a k . q 1 i h . F , μ a k . q i h . T ) } { ( a k . 1 i h . F , , a k . q 1 i h . F , a k . q i h . F ) , ( μ a k . 1 i h . F , , μ a k . q 1 i h . F , μ a k . q i h . F ) } }
where
-
a ( Ω t v . l ) implies the union of the true and false ( a k . g w . j . i h . T \ F ) data value’s events that are contained in a Ω t v . l world. Depending on D o m   A k w . j , the a ( Ω t v . l ) set denotes one or more of the actual data values’ events, such that none, some or all of them can be combined data values.
-
μ ( Ω t v . l ) implies the reliability scores set for a Ω t v . l   world, as it is gained from the union of the reliability scores sets of all the distinct a k . g w . i h . T / F events that are included in the Ω t v . l world. Depending on the participating events in an Ω t v . l alternative, i.e., all are true, some are true, or none is true (all false), the reliability score set for a data fusion alternative μ ( Ω t v . l ) may have the original or/and the complement reliability scores sets.
Table 3 illustrates the sample space production and is outlined based on D o m   A 3 1.1 from Table 2.
Table 3. An example for sample space production.
Due to null value existence, an additional challenge to generate the Ω w . j . k sample space is observed. This challenge is related to D o m   A k w . j generation in Equation (3). The sample space for this domain is generated throughout the Cartesian product operations over the events pair’s sets of the non-null and the null data values’ sets based on Equation (5). However, since the atomic null indicates that at most one data value can be true, then any generated world that represents more than one data value by its true event would be an impossible world, i.e., I w s w . j . k ( N u l l ) , and it should be eliminated from the ( Ω w . j . k ( N u l l ) ) sample space production. Thus, the sample space production due to null value presence is formulated in Equation (7).
Ω w . j . k = ( Ω w . j . k ( NonNull ) × Ω w . j . k ( Null ) ) :   Ω t v . l w . j . k ( Null )                         Ω w . j . k ( N u l l ) ,   ( | N l g = N l 1 N l q a k . g . N l g w . j . i h . T | a (   Ω t v . l w . j . k ( Null ) ) )                         1 ,   Ω w . j . k ( N u l l ) = l N l = 1 N l L N l Ω t v . l N l w . j . k , L N l = | D o m   a k . g . N l w . j . i h | + 1 ,
where:
-
Ω w . j . k ( NonNull ) = × g w . j = 1 q w . j | g . N l w . j | { ( a k . g w . j . i h . T , μ a k . g w . j . i h . T ) , ( a k . g w . j . i h . F , μ a k . g w . j . i h . F ) } .
-
Ω w . j . k ( Null ) = × N l g = N l 1 N l q { ( a k . g . N l g w . j . i h . T , μ a k . g . N l g w . j . i h . T ) , ( a k . g . N l g w . j . i h . F , μ a k . g . N l g w . j . i h . F ) } I w s w . j . k ( Null ) :   a ( Ω t v . l w . j . k ( Null ) ) ( | N l g = N l 1 N l q a k . g . N l g w . j . i h . T | 2 ) ,   Ω t v . l w . j . k ( Null ) I w s w . j . k ( Null ) .
Each obtained world from this operation will be represented as a pair of multi-possible values’ events with their associated reliability scores sets as previously stated in Equation (6), such that each world consists of a k . g . N l g w . j . i h . T / F and μ a k . g . N l g w . j . i h . T / F events.

5.1.2. The Obtained Possible-Worlds Based on the Data Fusion Cases

The earlier stated sample spaces in Equations (6) and (7) will be utilized to obtain the possible-worlds sets for the data fusion cases. In fact, a possible-worlds set can be equal or lesser than its sample space due to the impossible-worlds incidents ( I w s C a s e .   w . j . k ) . For instance, alternatives containing multi-true events are considered possible worlds under the MTC cases only. However, alternatives that have at most one true event are only recognized as possible worlds under the ITC cases. Due to the data fusion cases, the production of possible worlds set out of a Ω w . j . k sample space is presented in Equation (8), while Table 4 shows the possible-worlds production rules for each data fusion case presented earlier in Table 1.
Table 4. The possible-worlds generations given a data fusion case.
P w s T v C a s e . w . j . k = Ω w . j . k I w s C a s e . w . j . k P w s T v C a s e . w . j . k = p = 1 P P w s t v . p : 1 p P ,   P w s T v C a s e . w . j . k Ω w . j . k ,   &   P w s t v . p ( Ω t v . l I w s w . j . k ) ,   ( P w s T v C a s e . w . j . k P w s t v . p Ω w . j . k ) ,     P w s t v . p I w s w . j . k ,   a n d   a ( Ω t v . l ) a ( P w s t v . p ) ,   μ ( Ω t v . l ) μ ( P w s t v . p )     Ω t v . l = P w s t v . p   &   Ω t v . l P w s T v C a s e . w . j . k
Example 2: To illustrate the sample space and possible-worlds production based on the data fusion cases, the data value domain and sample space from D o m   A 3 1.1 inTable 2 and Table 3 are used. By using D o m   A 3 1.1 domain in Table 2 and the sample space production in Table 3, the below sample space alternatives are generalized: Ω 1.1.3 = { Ω t v . 1 , Ω t v .2 , Ω t v .3 , Ω t v .4 } ,     w h e r e
  • Ω t v . 1 = ( ( a 3.1 1.1 . ( 13 , 21 ) . T , a 3.2 1.1 . ( 31 ) . T ) , ( μ a 3.1 1.1 . ( 13 , 21 ) . T , μ a 3.2 1.1 . ( 31 ) . T ) ) Ω t v . 1 = ( a ( Ω t v . 1 ) , μ ( Ω t v . 1 ) ) Ω t v . 1 = ( ( 12224   V e n t u r a   B l v d 3.1 1.1 . ( 13 , 21 ) , 12335   F i e n e g a   B l v d 3.2 1.1 . ( 31 ) ) , ( 0.9 , 0.6 , 0.45 ) ) .
  • Ω t v .2 = ( ( a 3.1 1.1 . ( 13 , 21 ) . T , a 3.2 1.1 . ( 31 ) . F ) , ( μ a 3.1 1.1 . ( 13 , 21 ) . T , μ a 3.2 1.1 . ( 31 ) . F ) ) Ω t v .2 = ( a ( Ω t v .2 ) , μ ( Ω t v .2   ) ) Ω t v .2 = ( ( 12224   V e n t u r a   B l v d 3.1 1.1 . ( 13 , 21 ) ) , ( 0.9 , 0.6 , 0.55 ) ) .
  • Ω t v .3 = ( ( a 3.1 1.1 . ( 13 , 21 ) . F , a 3.2 1.1 . ( 31 ) . T ) , ( μ a 3.1 1.1 . ( 13 , 21 ) . F , μ a 3.2 1.1 . ( 31 ) . T ) ) Ω t v .3 = ( a ( Ω t v .3 ) , μ ( Ω t v .3 ) ) Ω t v .3 = ( ( 12335   F i e n e g a   B l v d 3.2 1.1 . ( 31 ) ) , ( 0.1 , 0.4 , 0.45 ) ) .
  • Ω t v .4 = ( ( a 3.1 1.1 . ( 13 , 21 ) . F , a 3.2 1.1 . ( 31 ) . F ) , ( μ a 3.1 1.1 . ( 13 , 21 ) . F , μ a 3.2 1.1 . ( 31 ) . F ) ) Ω t v .4 = ( a ( Ω t v .4 ) , μ ( Ω t v .4 ) ) Ω t v .4 = ( (   U n k n o w n ) , ( 0.1 , 0.4 , 0.55 ) )
After producing the above sample space, the possible-worlds production can be generated for a data fusion case based on the production rules presented in Table 4. Accordingly, the possible-worlds sets that can be observed in a given data fusion case are listed below:
-
If the data fusion case is MTC-OWA (C1.2), then multiple data values can be true at the same time, and it is possible to have the true value that does not exist in D O M   A 3 1.1 domain. Therefore, the generated possible worlds ( P w s ( C 1.2 ) . ( 1.1.3 ) ) will be equal to the sample space of Ω 1.1.3 as shown below:
  • P w s T v ( C 1.2 ) . ( 1.1.3 ) = Ω 1.1.3 = { Ω t v . 1 , Ω t v .2 , Ω t v .3 , Ω t v .4 } = { P w s t v . 1 ,   P w s t v .2 ,   P w s t v .3 ,   P w s t v .4 } :   P w s t v . 1 = Ω t v . 1 = ( ( 12224   V e n t u r a   B l v d 3.1 1.1 . ( 13 , 21 ) , 12335   F i e n e g a   B l v d 3.2 1.1 . ( 31 ) ) , ( 0.9 , 0.6 , 0.45 ) ) ,   P w s t v .2 = Ω t v .2 = ( ( 12224   V e n t u r a   B l v d 3.1 1.1 . ( 13 , 21 ) ) , ( 0.9 , 0.6 , 0.55 ) ) ,   P w s t v .3 = Ω t v .3 = ( ( 12335   F i e n e g a   B l v d 3.2 1.1 . ( 31 ) ) , ( 0.1 , 0.4 , 0.45 ) ) ,   &   P w s t v .4 = Ω t v .4 = ( (   U n k n o w n ) , ( 0.1 , 0.4 , 0.55 ) ) .
-
If the data fusion case is MTC-OWA (C1.1), then multiple data values can be true at the same time, and it is not possible to have a true value that does not exist in D O M   A 3 1.1 domain. Therefore, the generated possible worlds ( P w s ( C 1.1 ) . ( 1.1.3 ) ) would be as shown below:
  • P w s T v ( C 1.1 ) . ( 1.1.3 ) = Ω 1.1.3 I w s ( C 1.1 ) . ( 1.1.3 ) :   I w s C 1.1 . w . j . k = { Ω t v .4 } P w s T v ( C 1.1 ) . ( 1.1.3 ) = { P w s t v . 1 ,   P w s t v .2 ,   P w s t v .3 } :   P w s t v . 1 =   Ω t v . 1 = ( ( 12224   V e n t u r a   B l v d 3.1 1.1 . ( 13 , 21 ) , 12335   F i e n e g a   B l v d 3.2 1.1 . ( 31 ) ) , ( 0.9 , 0.6 , 0.45 ) ) ,   P w s t v .2 = Ω t v .2 = ( ( 12224   V e n t u r a   B l v d 3.1 1.1 . ( 13 , 21 ) ) , ( 0.9 , 0.6 , 0.55 ) ) ,   P w s t v .3 = Ω t v .3 = ( ( 12335   F i e n e g a   B l v d 3.2 1.1 . ( 31 ) ) , ( 0.1 , 0.4 , 0.45 ) ) .
-
If the data fusion case is ITCC-CWA (C2.1.1), then one data value can be true at a time, and it is not possible to have the true value from outside D O M   A 3 1.1 domain. Therefore, the generated possible worlds ( P w s ( C 2.1.1 ) . ( 1.1.3 ) ) would be as shown below:
  • P w s T v ( C 2.1.1 ) . ( 1.1.3 ) = Ω 1.1.3 I w s ( C 2.1.1 ) . ( 1.1.3 ) : I w s ( C 2.1.1 ) . ( 1.1.3 ) = { Ω t v . 1 , Ω t v .4 }   P w s T v ( C 2.1.1 ) . ( 1.1.3 ) = { P w s t v . 1 ,   P w s t v .2 } :   P w s t v . 1 = Ω t v .2 = ( ( 12224   Ventura   Blvd 3.1 1.1 . ( 13 , 21 ) ) , ( 0.9 , 0.6 , 0.55 ) ) ,   P w s t v .2 = Ω t v .3 = ( ( 12335   F i e n e g a   B l v d 3.2 1.1 . ( 31 ) ) , ( 0.1 , 0.4 , 0.45 ) ) .
-
If the data fusion case is ITCC-OWA (C2.1.2), then one data value can be true at a time, and it is possible to have the true value from outside D O M   A 3 1.1 domain. Therefore, the generated possible worlds ( P w s ( C 2.1.2 ) . ( 1.1.3 ) ) would be as shown below:
  • P w s T v ( C 2.1.2 ) . ( 1.1.3 ) = Ω 1.1.3 I w s ( C 2.1.2 ) . ( 1.1.3 ) : I w s ( C 2.1.2 ) . ( 1.1.3 ) = { Ω t v . 1 }   P w s T v ( C 2.1.2 ) . ( 1.1.3 ) = { P w s t v . 1 ,   P w s t v .2 , P w s t v .3 } :   P w s t v . 1 = Ω t v .2 = ( ( 12224   V e n t u r a   B l v d 3.1 1.1 . ( 13 , 21 ) ) , ( 0.9 , 0.6 , 0.55 ) ) ,   P w s t v .2 = Ω t v .3 = ( ( 12335   F i e n e g a   B l v d 3.2 1.1 . ( 31 ) ) , ( 0.1 , 0.4 , 0.45 ) ) ,   &   P w s t v .3 = Ω t v .4 = ( (   U n k n o w n ) , ( 0 . 1 , 0 . 4 , 0 . 55 ) )
To present the illustrated example for the uncertainty data fusion cases, we assume that the data value of a 3.2 31 in D O M   A 3 1.1 domain is null, where it comes with additional knowledge about its possible domain of values as D o m   a 3.2 . Nl 1.1 . ( 31 ) = { a 3.2 . N l 1 , a 3.2 . N l 2 } ,   such that a 3.2 . N l 1 = 114   E q a i l a   B l v d and ,     a 3.2 . N l 2 = 105   E q a i l a   B l v d . Based on Equation (7), the data fusion sample space’s sets under uncertainty are outlined as shown below:
Ω 1.1.3 = ( Ω 1.1.3 ( NonNull ) × Ω 1.1.3 ( Null ) ) : Where
Ω 1.1.3 ( NonNull ) = { ( a 3.1 1.1 . T , μ a 3.1 1.1 . T ) , ( a 3.1 1.1 . F , μ a 3.1 1.1 . F ) }         = { ( 12224   V e n t u r a   B l v d 3.1 1.1 , ( 0.9 , 0.6 ) ) , (   U n k n o w n , ( 0.1 , 0.4 ) ) }
Ω 1.1.3 ( Null ) = { ( ( a 3.2 . N l 1 1.1 . ( 31 ) . T ,   a 3.2 . N l 2 1.1 . ( 31 ) . F ) , ( μ a 3.2 . N l 1 1.1 . ( 31 ) . T , μ a 3.2 . N l 1 1.1 . ( 31 ) . F ) ) , ( ( a 3.2 . N l 1 1.1 . ( 31 ) . F ,   a 3.2 . N l 2 1.1 . ( 31 ) . T ) , ( μ a 3.2 . N l 1 1.1 . ( 31 ) . F , μ a 3.2 . N l 1 1.1 . ( 31 ) . T ) ) , ( ( a 3.2 . N l 1 1.1 . ( 31 ) . F ,   a 3.2 . N l 2 1.1 . ( 31 ) . F ) , ( μ a 3.2 . N l 1 1.1 . ( 31 ) . F , μ a 3.2 . N l 1 1.1 . ( 31 ) . F ) ) }
Ω 1.1.3 ( Null ) = { ( ( 114   E q a i l a   B l v d 3.2 . N l 1 ) , ( 0.45 , 0.55 ) ) , ( ( 105   E q a i l a   B l v d 3.2 . N l 2 ) , ( 0.55 , 0.45 ) ) , ( (   U n k n o w n ) , ( 0.55 , 0.55 ) ) } Ω 1.1.3 = { Ω t v . 1 , Ω t v .2 , Ω t v .3 , Ω t v .4 , Ω t v .5 , Ω t v .6 } :
-
Ω t v . 1 = ( ( a 3.1 1.1 . ( 13 , 21 ) . T , a 3.2 . N l 1 1.1 . ( 31 ) . T ,   a 3.2 . N l 2 1.1 . ( 31 ) . F ) , ( μ a 3.1 1.1 . ( 13 , 21 ) . T , μ a 3.2 . N l 1 1.1 . ( 31 ) . T , μ a 3.2 . N l 2 1.1 . ( 31 ) . F ) ) = ( ( 12224   V e n t u r a   B l v d 3.1 1.1 . ( 13 , 21 ) , 114   E q a i l a   B l v d 3.2 . N l 1 1.1 . ( 31 )   ) , ( 0.9 , 0.6 ,   0.45 , 0.55 ) ) .
-
Ω t v .2 = ( ( a 3.1 1.1 . ( 13 , 21 ) . T , a 3.2 . N l 1 1.1 . ( 31 ) . F ,   a 3.2 . N l 2 1.1 . ( 31 ) . T ) , ( μ a 3.1 1.1 . ( 13 , 21 ) . T , μ a 3.2 . N l 1 1.1 . ( 31 ) . F , μ a 3.2 . N l 2 1.1 . ( 31 ) . T ) ) = ( ( 12224   Ventura   Blvd 3.1 1.1 . ( 13 , 21 ) , 105   Eqaila   Blvd 3.2 . Nl 2 1.1 . ( 31 )   ) , ( 0.9 , 0.6 ,   0.55 , 0.45 ) ) . Ω t v .3 = ( ( a 3.1 1.1 . ( 13 , 21 ) . T , a 3.2 . N l 1 1.1 . ( 31 ) . F ,   a 3.2 . N l 2 1.1 . ( 31 ) . F ) , ( μ a 3.1 1.1 . ( 13 , 21 ) . T , μ a 3.2 . N l 1 1.1 . ( 31 ) . F , μ a 3.2 . N l 2 1.1 . ( 31 ) . F ) ) = ( ( 12224   Ventura   Blvd 3.1 1.1 . ( 13 , 21 ) ) , ( 0.9 , 0.6 ,   0.55 , 0.55 ) ) .
-
Ω t v .4 = ( ( a 3.1 1.1 . ( 13 , 21 ) . F , a 3.2 . N l 1 1.1 . ( 31 ) . T ,   a 3.2 . N l 2 1.1 . ( 31 ) . F ) , ( μ a 3.1 1.1 . ( 13 , 21 ) . F , μ a 3.2 . N l 1 1.1 . ( 31 ) . T , μ a 3.2 . N l 2 1.1 . ( 31 ) . F ) ) = ( ( 114   E q a i l a   B l v d 3.2 . N l 1 1.1 . ( 31 ) ) , ( 0.1 , 0.4 ,   0.45 , 0.55 ) ) .
-
Ω t v .5 = ( ( a 3.1 1.1 . ( 13 , 21 ) . F , a 3.2 . N l 1 1.1 . ( 31 ) . F ,   a 3.2 . N l 2 1.1 . ( 31 ) . T ) , ( μ a 3.1 1.1 . ( 13 , 21 ) . F , μ a 3.2 . N l 1 1.1 . ( 31 ) . F , μ a 3.2 . N l 2 1.1 . ( 31 ) . T ) ) = ( ( 105   E q a i l a   B l v d 3.2 . N l 2 1.1 . ( 31 ) ) , ( 0.1 , 0.4 ,   0.55 , 0.45 ) ) .
-
Ω t v .6 = ( ( a 3.1 1.1 . ( 13 , 21 ) . F , a 3.2 . N l 1 1.1 . ( 31 ) . F ,   a 3.2 . N l 2 1.1 . ( 31 ) . F ) , ( μ a 3.1 1.1 . ( 13 , 21 ) . F , μ a 3.2 . N l 1 1.1 . ( 31 ) . F , μ a 3.2 . N l 2 1.1 . ( 31 ) . F ) ) = ( (   U n k n o w n ) , ( 0.1 , 0.4 ,   0.55 , 0.55 ) )
Below are possible worlds for the data fusion’s uncertainty cases as observed based on the above sample space sets:
-
If the data fusion case is ITCU-CWA (C2.2.1), then one data value can be true at a time, and it is not possible to have the true value from outside D O M   A 3 1.1 domain. Therefore, the generated possible worlds ( P w s ( C 2.2.1 ) . ( 1.1.3 ) ) would be as shown below:
P w s T v ( C 2.2.1 ) . ( 1.1.3 ) = Ω 1.1.3 I w s ( C 2.2.1 ) . ( 1.1.3 ) : I w s ( C 2.2.1 ) . ( 1.1.3 ) = { Ω t v . 1 , Ω t v .2 , Ω t v .4 } P w s T v ( C 2.2.1 ) . ( 1.1.3 ) = { P w s t v . 1 ,   P w s t v .2 ,   P w s t v .3 } :
  • P w s t v . 1 = Ω t v .3 = ( ( 12224   Ventura   Blvd 3.1 1.1 . ( 13 , 21 ) ) , ( 0.9 , 0.6 ,   0.55 , 0.55 ) ) ,  
  • P w s t v .2 = Ω t v .4 = ( ( 114   E q a i l a   B l v d 3.2 . N l 1 1.1 . ( 31 ) ) , ( 0.1 , 0.4 ,   0.45 , 0.55 ) ) ,
  • P w s t v .3 = Ω t v .5 = ( ( 105   Eqaila   Blvd 3.2 . Nl 2 1.1 . ( 31 ) ) , ( 0.1 , 0.4 ,   0.55 , 0.45 ) )
-
If the data fusion case is ITCU-OWA (C2.2.2), then one data value can be true at a time and it is possible to have the true value from outside D O M   A 3 1.1 domain. Therefore, the generated possible worlds ( P w s ( C 2.2.2 ) . ( 1.1.3 ) ) would be as shown below:
P w s T v ( C 2.2.2 ) . ( 1.1.3 ) = Ω 1.1.3 I w s ( C 2.2.2 ) . ( 1.1.3 ) : I w s ( C 2.2.2 ) . ( 1.1.3 ) = { Ω t v . 1 , Ω t v .2 }   P w s T v ( C 2.2.2 ) . ( 1.1.3 ) = { P w s t v . 1 ,   P w s t v .2 , P w s t v .3 , P w s t v .4 } :  
  • P w s t v . 1 = Ω t v .3 = ( ( 12224   Ventura   Blvd 3.1 1.1 . ( 13 , 21 ) ) , ( 0.9 , 0.6 ,   0.55 , 0.55 ) ) ,  
  • P w s t v .2 = Ω t v .4 = ( ( 114   E q a i l a   B l v d 3.2 . N l 1 1.1 . ( 31 ) ) , ( 0.1 , 0.4 ,   0.45 , 0.55 ) ) ,
  • P w s t v .3 = Ω t v .5 = ( ( 105   Eqaila   Blvd 3.2 . Nl 2 1.1 . ( 31 ) ) , ( 0.1 , 0.4 ,   0.55 , 0.45 ) ) ,
  • P w s t v .4 = Ω t v .6 = ( (   U n k n o w n ) , ( 0.1 , 0.4 ,   0.55 , 0.55 ) )
Therefore, a probabilistic global entity with its actual data values and reliability scores is created, and the updated reliability scores for those data values can be computed based on the obtained possible-worlds sets and their recognized data fusion cases. The next section shows the constructed probabilistic data fusion computational method.

5.2. The Probabilistic Data Fusion Computational Method

After conceptually representing the actual data values’ merging for a probabilistic global entity in sort of multi-alternatives of possible true data fusion answers within a P w s T v C a s e . w . j . k set, in this section, we formally construct the data fusion computation method and show how we leverage the trustworthiness of sources in truth discovery.
The data fusion method is operated to compute the conditional probability, i.e., updated reliability score, for a possible data value’s or values’ world that is/are most likely recognized as the true fused answer using μ ( Ω t v . l p ) scores and given a data fusion case. Thus, the conditional probability form of ( a k . T v w . j = a ( Ω t v . l ) | C a s e ) ,   μ ( a k . T v w . j = a ( Ω t v . l ) | C a s e ) :   Ω t v . l P w s T v C a s e . w . j . k ,   Ω t v . l P w s t v . p ,   a ( Ω t v . l ) a ( P w s t v . p ) ,   &   μ ( Ω t v . l ) μ ( P w s t v . p ) is used to represent a possible data values’ world that is most likely recognized as the true fusion answer with its conditional probability value that required further computation. In this form,   C a s e   condition implies a given possible-world set in regard to the presented fusion cases in Table 1,   C a s e { M T C - C W A , M T C - O W A , I T C C - C W A , I T C C - O W A , I T C U - C W A , I T C U - O W A } . In order to do that, the listed assumptions in Section 3.5 are considered.
To obtain a single reliability score, i.e., the conditional probability value, for each possible data fusion’s alternative within a particular D o m   A k w . j and n D O w . j alternative, the data fusion computational formula is constructed in Equation (9). This formula is constructed using Bayes’ theorem, the probability distribution from μ ( P w s t v . p ) , and assumptions 1 to 3 (refer to Section 3.6) (the detailed derivations are given in Appendix A):
μ ( a k . T v w . j = a ( P w s t v . p ) | P w s T v C a s e . w . j . k ) = μ ( ( a k . T v w . j = a ( P w s t v . p ) )   P w s T v C a s e . w . j . k ) μ ( C a s e = P w s T v C a s e . w . j . k ) = μ ( ( a k . T v w . j = a ( P w s t v . p ) )   P w s T v C a s e . w . j . k )   μ ( P w s T v C a s e . w . j . k ) = μ ( P w s t v . p ) p = 1 P μ ( P w s t v . p ) = μ ( P w s t v . p ) μ ( P w s t v . 1 ) + μ ( P w s t v .2 ) + + μ ( P w s t v . P ) Substituting   for     μ ( Ω t v . l w . j . k )   μ ( P w s t v . p )   from   Equations   ( 6 )   and   ( 8 ) , we   get; ; μ ( a k . T v w . j = a ( P w s t v . p ) | P w s T v C a s e . w . j . k ) = g w . j = 1 q w . j g i h = 1 11 q n m ( μ a k . g w . j . i h . T \ F ) t v . p g w . j = 1 q w . j g i h = 1 11 q n m ( μ a k . g w . j . i h . T \ F ) t v . 1 + g w . j = 1 q w . j g i h = 1 11 q n m ( μ a k . g w . j . i h . T \ F ) t v .2 + + g v . j = 1 q v . j g i h = 1 11 q n m ( μ a k . g w . j . i h . T \ F ) t v . P
where p = 1 P μ ( a k . T v w . j = a ( P w s t v . p ) | P w s T v C a s e . w . j . k ) = 1 .
This computation is processed to all P w s t v . 1 ,   t v .2 , ,   t v . P P w s T v C a s e . w . j . k worlds that observed from Ω w . 1 . k , , w . f . k sets. Thus, the possible-worlds sets of correct values and their posterior reliability scores are obtained for all merging alternatives that assemble the processed n D O w entity. This concludes that a probabilistic global entity with its actual values is obtained, and the updated reliability for requested attribute values are computed accordingly.
Example 3. To illustrate the computation method for finding the probabilistic true data fusion answer within a specific set of P w s T v C a s e . w . j . k , this example is continued based on the possible-worlds sets that are obtained in example 2 using the I T C C - O W A (C2.1.2), data fusion case.
Since P w s ( C 2.1.2 ) . ( 1.1.3 ) = { P w s t v . 1 ,   P w s t v .2 , P w s t v .3 } , where   P w s t v . 1 = ( ( 12224   V e n t u r a   B l v d 3.2 1.1 . ( 13 , 21 ) ) , ( 0.9 , 0.6 , 0.55 ) ) , P w s t v .2 = ( ( 12335   F i e n e g a   B l v d 3.2 1.1 . ( 31 ) ) , ( 0.1 , 0.4 , 0.45 ) ) , and P w s t v .3 = ( (   U n k n o w n ) , ( 0 . 1 , 0 . 4 , 0 . 55 ) ) , then the conditional probability of the updated reliability score for a specific possible-world’s data value being true is computed as below.
P w s T v ( C 2.1.2 ) . ( 1.1.3 ) = { a ( P w s t v . p ) ,   μ ( P w s t v . p ) } = { ( ( 12224   V e n t u r a   B l v d 3.1 1.1 . ( 13 , 21 ) ) , ( 0.9 , 0.6 , 0.55 ) ) , ( ( 12335   F i e n e g a   B l v d 3.2 1.1 . ( 31 ) ) , ( 0.1 , 0.4 , 0.45 ) ) , ( (   U n k n o w n ) , ( 0.1 , 0.4 , 0.55 ) ) }
  • μ ( a k . T v v . j = a ( P w s t v . 1 ) | P w s T v ( C 2.1.2 ) . ( 1.1.3 ) ) = μ ( P w s t v . 1 ) μ ( P w s t v . 1 ) + μ ( P w s t v .2 ) + μ ( P w s t v .3 ) = 0.9 × 0.6 × 0.55 ( 0.9 × 0.6 × 0.55 ) + ( 0.1 × 0.4 × 0.45 ) + ( 0.1 × 0.4 × 0.55 ) = 0.297 0.297 + 0.018 + 0.022 = 0.297 0.337 0.881 .
  • μ ( a k . T v v . j = a ( P w s t v .2 ) | P w s T v ( C 2.1.2 ) . ( 1.1.3 ) ) = μ ( P w s t v .2 ) μ ( P w s t v . 1 ) + μ ( P w s t v .2 ) + μ ( P w s t v .3 ) = 0.018 0.337 0.053 .
  • μ ( a k . T v v . j = a ( P w s t v .3 ) | P w s T v ( C 2.1.2 ) . ( 1.1.3 ) ) = μ ( P w s t v .3 ) μ ( P w s t v . 1 ) + μ ( P w s t v .2 ) + μ ( P w s t v .3 ) = 0.022 0.337 0.065 .
Based on the updated reliability score for each data value. The probability of 12224   V e n t u r a   B l v d being the true value equals to 0.881, the probability of 12335   F i e n e g a   B l v d being the true value equals to 0.053, and the probability of none of these values being true equals to 0.065. We conclude that the most likely correct address’s value, i.e., the data fusion answer for the global entity of n D O 1.1 = ( r D O 1 13 : p D O 11 21 , p D O 12 31 ) [ 0.8245 ] ( A r t s   D e l i c a t e ( 13 ,   21 , 31 ) , 0.8245 ) , is 12224   V e n t u r a   B l v d with 0.881 probability.

5.3. Probability to Possibility Transformation Method

The main reason for the transformation method is to allow a user to choose the range of true data fused values’ alternatives that they are willing to retrieve and view. This method has the advantage of offering effective retrieved answers; fewer but more likely alternatives can only be retrieved using a threshold value chosen based on user selection. In addition, the possibility theory has the advantage over the probability theory of providing a more efficient information retrieval and ranking strategy by using a possibility threshold value instead of a probability threshold value [93,94].
In fact, determining a probabilistic threshold value for retrieving some alternatives is very difficult due to the variety of probability distribution values ranges. By using a possibility threshold value, the user can randomly select a possibility value to be the threshold value ( β ) for retrieving the data fusion answer’s alternatives. Therefore, alternatives whose possible values are equaled or exceed the selected possible threshold value would only be retrieved.
This transformation method is based on dividing the probability values for the true data fused value’s alternatives that belong to a particular possible-worlds set of ( n D O w . j ) over the highest probability value among. Thus, the transformed possibility value for the alternative with a maximum probability distribution value will equal one. The formula below shows the transformation computation.
P o s ( a k . T v w . j = a ( P w s t v . p ) | P w s T v C a s e . w . j . k ) = μ ( a k . T v w . j = a ( P w s t v . p ) | P w s T v C a s e . w . j . k ) M a x   ( μ ( a k . T v w . j = a   ( P w s t v . p ) | P w s T v C a s e . w . j . k ) ) :   P w s t v . p P w s T v C a s e . w . j . k

6. Proof of Concept: Model Implementation and Discussion

The data fusion computational method has been mathematically proved as discussed in Appendix A. This mathematical approval is established by considering the integration of data values as associated with different reliability scores and as obtained from three different data sources. The data fusion method has been implemented in our probabilistic integration system as an extended merging and computation functions that operates at the attribute level to handle the two inconsistent true value cases under the CWA (i.e., ITCC-CWA (C2.1.1) and ITCU-CWA (C2.2.1)). In this implementation, the data values for a global attribute are matched and grouped under one domain. The decision model for this data fusion method generates a set of possible worlds of fused data values’ alternatives, in which each possible world is associated with an updated reliability value, i.e., ( a ( P w s t v . p ) , μ ( a k . T v w . j = a ( P w s t v . p ) | P w s T v C a s e . w . j . k ) ) . The updated value is computed based on Equation (9). The updated reliability values within each entities merging’s alternatives are transformed into possibility values using Equation (10). This transformation is constructed to facilitate the retrieval process based on the user’s possibility threshold values.
To see the feasibility of the proposed data fusion approach and how the data fusion results may change by obtaining new pieces of evidence, Figure 5 and Figure 6 are presented. In Figure 5, the decision outputs for the data fusion method under ITCC-CWA case are illustrated. It shows in Figure 5c, the possible fused values for City attribute at each possible entities merging alternative as produced from the entities merging set of {(RO1(1,1,1)): PO1(2,1,1) [1.0], PO2(3,1,2) [1.0]}, where the participated entities with their original matching tree are presented in Figure 5a, while their probabilistic entities merging result is shown in Figure 5b. These entities originated from three participating data sources, where the reliability score for “Amsterdam” data value as originating from the first data source is equal to 0.8, and the reliability score for “Enschede” data value as originating from the second data source is equal to 0.8, and the reliability score for “Georgia” data value as originated from the third source is equal to 0.4, i.e., a 2.1 11 = Amsterdam ,   μ a 2.1 11 = 0.8 ,   a 2.2 21 = Enschede ,   μ a 2.2 21 = 0.8 ,   and   a 2.3 32 = Georgia ,   μ a 2.3 32 = 0.4 . Using the merged entities alternative of nDO1.1.1, and the reliability scores of the City data values that correspond to nDO1.1.1 entities’ alternative, the updated reliability score for each possible data fused alternative is computed using Equations (9) and (10) and presented their corresponding values in the attribute’s probability and possibility columns as shown in Figure 5c.
Figure 5. City data fusion alternatives’ example with their updated reliability score based on the C2.1.1 case. (a) The participated entities with their original matching tree. (b) The probabilistic entities linkage/merging results. (c) The possible data fused values for City attribute at each possible entities merging alternative as produced based on Figure 5b.
Figure 6. The updated-City data fusion alternatives’ example is based on the newly added evidence from the fourth data source. (a) The participated entities with their original matching tree. (b) The probabilistic entities linkage/merging results. (c) The possible data fused values for City attribute at each possible entities merging alternative as produced based on Figure 6b.
From Figure 5b,c, we can state that RO1, PO1, and PO2 represent the same RWO named “Peter Pan”, with three different possible alternatives of the city name where “Peter Pan” lives. The original reliability scores, as obtained from the data sources, has determined that “Amsterdam” and “Enschede” are equal to 0.8, but “Georgia” is equal to 0.4. The updated reliability scores, which is the conditional probability of having a certain attribute’s value being the true fused value as obtained based on Equation (9), is approximately equal to 0.46 for “Amsterdam” and “Enschede” values, and it is approximately equal to 0.08 for “Georgia”. Based on these updated reliability scores, the possibility values have been obtained using Equation (10) to indicate that “Amsterdam” and “Enschede” values share the same possibility of either one of them being the true fused value of where “Perter Pan” lives, i.e., P o s ( a 2 . T v 1.1 = a ( P w s t v . 1 ) | P w s T v ( C 2.1.1 ) . 1.1.2 ) = P o s ( a 2 . T v 1.1 = a ( P w s t v .2 ) | P w s T v ( C 2.1.1 ) .1.1.2 ) = 1 . It also indicates the possibility of having “Peter Pan” lives in “Georgia” is very low as P o s ( a 2 . T v 1.1 = a ( P w s t v .3 ) | P w s T v ( C 2.1.1 ) .1.1.2 ) = 0.08 . Based on the given evidence from the participated entities and their attribute values reliability scores, we can state that “Peter Pan” is more likely lives in either “Amsterdam” or “Enschede” with a probability of 0.46 and with a possibility of 1.0.
To show how the data fusion computational results may change according to new information/evidence, a new data source that contains information about a person named “P. Perter” is included in the participated data sources and entities of the presented example in Figure 5. This new entity has a city value of “Amsterdam” with a 0.7 original reliability score. Accordingly, the data fusion’s alternatives result after considering the additional information from the new data source are presented in Figure 6.
In Figure 6, the decision outputs for the data fusion method under ITCC-CWA case is illustrated. Figure 6c shows the possible fused values for City attribute at each possible entities merging alternative as produced from the entities merging set of {(RO1(1,1,1)): PO1(2,1,1) [1.0], PO2(3,1,2) [1.0], PO3(4,1,2) [0.96]}, where the participated entities with their original matching tree are presented in Figure 6a, while their probabilistic entities merging result is shown in Figure 6b. These entities originated from four participating data sources, where the reliability score for the “Amsterdam” data value as originating from the first data source is equal to 0.8, and from the fourth data source is equal to 0.7, and the reliability score for “Enschede” data value as originated from the second source is equal to 0.8, and the reliability score for “Georgia” data value as originated from the third source is equal to 0.4, i.e., a 2.1 11 , 42 = Amsterdam ,   μ a 2.1 11 , 42 = ( 0.8 ,   0.7 ) ,   a 2.2 21 = Enschede ,   μ a 2.2 21 = 0.8 , and a 2.3 32 = Georgia ,   μ a 2.3 32 = 0.4 . Using the merged entities alternatives of nDO1.1.1 and nDO1.1.2, and the reliability scores of the City data values that correspond to nDO1.1.1 and nDO1.1.2 entities’ alternatives, the updated reliability score under each entities merging alternative and for each possible data fused alternative is computed using Equations (9) and (10) and presented their corresponding values in the attribute’s probability and possibility columns as shown in Figure 6c.
Figure 6c shows two possible entities merging alternatives, i.e., nDO1.1.1 and nDO1.1.2. For nDO1.1.1, the four participating entities correspond to the same person of “Peter Pan”, where the “Amsterdam” value is obtained from the first source, and the new added one with a union reliability score of (0.8 and 0.7). Due to the new information from the fourth data source, the updated reliability scores shown in Figure 6c have changed from the ones presented in Figure 5c. Accordingly, the updated reliability score of having “Amsterdam” as the true fused city value for the “Peter Pan” entity is approximately equal to 0.67. Therefore, the probability for “Enschede” becomes approximately equal to 0.28, and for “Georgia” is approximately equal to 0.08. Therefore, “Amsterdam” has the highest possibility of being the true fused value of where “Peter Pan” lives, i.e., P o s ( a 2 . T v 1.1 = a ( P w s t v . 1 ) | P w s T v ( C 2.1.1 ) .1.1.2 ) = 1 . On the other hand, the possibility of “Enschede” or “Georgia” being the true fused value of where “Peter Pan” lives becomes equal to 0.4 or 0.07, respectively. Based on this information, we can state that “Peter Pan” is more likely to live in “Amsterdam” with a probability of 0.67 and a possibility of 1.0. For the second entity merging alternative of nDO1.1.2, three participated entities were corresponding to the same person “Peter Pan”, while PO3(4,1,2) entity that originated from the fourth source did not belong the generated global entity of “Peter Pan”, i.e., {(RO1(1,1,1)): PO1(2,1,1), PO2(3,1,2)},{PO3(4,1,2)}[0.037]. The participated entities in nDO1.1.2 are the same ones presented in nDO1.1.1 in Figure 5; hence, the update reliability scores for the City’s value alternatives within nDO1.1.2 merging alternative are the same as the one presented in Figure 5c.
Figure 7 illustrates the decision outputs for the data fusion method under ITCU-CWA case and for the Phone attribute, where two possible entities merging alternatives were observed, i.e., nDO1.3.1 and nDO1.3.2. Figure 7c shows the possible fused values for the Phone attribute and for each possible entities merging alternatives as produced form the entities merging set of {(RO3(1,1,3)): PO1(2,1,3) [1.0], PO2(3,1,3) [0.73]}, where the participated entities with their original matching tree are presented in Figure 7a, while their probabilistic entities merging result is shown in Figure 7b. These entities originated from three participating data sources; such that the reliability score of “622,222,222” data value as originated from the first data source is equal to 0.8, and from the second data source is equal to 0.7, while the phone value for the record obtained from the third data source is unknown, i.e., NULL, and it is reliability score is 0.8, i.e., a 3.1 13 = a 3.1 23 = 622222222 ,   μ a 3.1 13 = 0.8 ,   μ a 3.1 23 = 0.7 ,   a 3.2 33 =   NULL ,   μ a 3.2 33 = 0.8 . Using the merged entities alternatives of nDO1.3.1 and nDO1.3.2. The reliability scores of the Phone data values that correspond to nDO1.3.1 and nDO1.3.2 entities’ alternatives, the updated reliability score under each entities merging alternative and for each possible data fused alternative is computed using Equations (9) and (10), and presented their corresponding values in the attribute’s probability and possibility columns as shown in Figure 7c.
Figure 7. Phone data fusion alternatives’ example with their updated reliability score based on the C2.2.1 case. (a) The participated entities with their original matching tree. (b) The probabilistic entities linkage/merging results. (c) The possible data fused values for Phone attribute at each possible entities merging alternative as produced based on Figure 7b.
In Figure 7c, two possible entity merging alternatives were stated, and the data fusion value for the phone attribute was conditionally computed accordingly. For instance, nDO1.3.1 alternative indicates the three participating entities correspond to the same person of “John Doe”, with an approximate probability value of 0.73 and with a possibility value of 1.0. In this alternative, the “622,222,222” phone value is obtained from the two records that belong to the first and second sources; hence, its original reliability score is (0.8, 0.7), but for the unknown “Null” value is 0.8 as it is obtained from the third data source. Accordingly, the updated reliability scores, as computed from Equation (9), is 0.7 for “622,222,222” value and it is 0.3 for the unknown “Null” value. Based on these updated scores, the possibility values have been obtained from Equation (10) to indicate that “622,222,222” has the highest possibility of being the true phone number for the “John Doe” global entity, i.e., P o s ( a 3 . T v 3.1 = a ( P w s t v . 1 ) | P w s T v ( C 2.2.1 ) .3.1.3 ) = 1 . The alternative of not knowing John Doe’s number has a low possibility of being true, i.e., P o s ( a 3 . T v 3.2 = a ( P w s t v . 1 ) | P w s T v ( C 2.2.1 ) .3.2.3 ) = 0.43 . However, the Phone data fusion value under nDO1.3.2 alternative is different from the first alternative since the third record, i.e., PO2(3,1,3), does not belong to the generated global entity of “John Doe”. This global entity is generated from {(RO3(1,1,3)): PO1(2,1,3) [1.0]} records only, hence, “622,222,222” is the only Phone value to be observed with (0.8, 0.7) reliability scores. Accordingly, the updated reliability score as computed from Equation (9) is 1.0 for the “622,222,222” value.
Based on the implementation of our fusion method within our developed probabilistic integration system and by using a sample dataset, we managed to show how our data fusion method can operate in different fusion cases and how it can cope with the dynamic nature of new information or evidence. We also managed to conclude that by using the offered fusion method, a data value with a higher confidence score and cardinality will be more likely the true value. This is observed by running varied examples related to the data conflict cases. This claim is true as independency assumption among the participated sources is assumed, in which positive evidence can be obtained from having data values with high confidence scores and existed in multiple data sources. Moreover, by considering the possibility of transformation, a better retrieval mechanism for the probable true data fusion’s answers can be achieved.
In terms of considering source accuracy, probabilistic entity linkage nature, and on-demand fusion process, our data fusion approach is relative to the approaches proposed by [26,32,48,73,76,82]. Even though we identified many aspects that make it different to the previous works. First, while our approach is based on the quality based’ strategy, it proposes to manage and resolve two major fusion cases; multi-true values (i.e., multiple truth assumptions), and inconsistent-true values (i.e., single truth assumptions) based on closed-world and open-world assumptions. Second, the data fusion is processed over probabilistic entity linkage and multi-merging alternatives. Our approach can also support on-demand fusion and cope with dynamic and volatile conflicting and uncertain data by keeping and storing the reliability scores alongside its actual data values and by taking the matching and computation process as a chain of separate processes, where each one has its own inputs and outputs data [3]. Upon a user request, such a fusion process for a selected attribute and over a probabilistic global entity is initiated by matching its values from their corresponding sources and entities to form a global domain of merged attribute’s values pairs, and then the computational fusion is executed separately.

7. Limitations

It is worth noticing that entities integration with uncertainty management is hard in general and comes in many forms due to the variety of ways to be defined and processed and the variety of uncertainty types that might be appeared; hence, no single solution addresses all challenges [19]. It does not seem possible to address the computational and representational challenges in general. Yet, we can still study these challenges for the problem with uncertainty management under specific formalizations, uncertainty cases, and scope.
Accordingly, our proposed model is constructed based on a precise mediated and centralized schema structure, where a probabilistic schema integration and a decentralized integration concern are out of this paper’s scope. Moreover, while the experiment demonstrated that our proposed process works in theory, the formula needs to be implemented in several real case scenarios, such as scientific collaborations or personal information, to determine any challenges, accuracy, and the margin of error. Another limitation in our proposed model is related to the ignorance of data sources correlation; data sources can copy from each other, and errors can be propagated quickly. Therefore, ignoring possible dependency among data sources can lead to biased decisions. Another limitation is related to the implementation of our proposed model due to the utilization of a text matching function only. Further enhancement can be added by including a function to compare and match images. Therefore, the data fused values can be found for different forms of data.

8. Conclusions

This paper presented a new probabilistic data fusion model. It described a specific scenario of a probabilistic fusion problem and solution space, where several representational and computational challenges have been identified and formulated. The problem scenario correlates to the attempts and needs to manage and resolve uncertain and conflicting data for multiple attribute values. These values originated due to the probabilistic entities’ integration outcomes over heterogeneous and autonomous data sources. The proposed data fusion method is implemented within the probabilistic integration system to verify its efficiency and feasibility in resolving different data conflict cases. This implementation demonstrated the ability of the system to manage the static and dynamic environment in managing data from a variety of sources. The method automates the homo economics decision making in selecting the most probable and true value [95]. While the experiment demonstrated that our proposed process works in theory, the formula needs to be implemented in several real case scenarios to determine any challenges, accuracy, and the margin of error.
Several challenges related to the data integration and fusion problem under uncertainty management, data correlations, multiple correspondences and probabilistic merging of correlated entities still need to be addressed, and that will continue to occupy the information integration community for a long time to come. Future works include exploring our method in other data fusion strategies, such as capturing sources and attributes correlation to identify positive and negative evidence while conditionally computing the probability of a data being true. It also includes implementing other fusion cases and allocating a suitable benchmark for evaluation purposes. With emerging approaches to data fusion, the industry is in need of a standardization testing mechanism that could also be explored. This testing mechanism would assess the output quality of such approaches in the form of an index of success rate or margin of error for a given fusion process.

Author Contributions

Conceptualization, A.J. and A.D.; methodology, A.J.; software, A.J.; validation, A.J., F.S., O.A., and A.A.-A.; formal analysis, A.J.; investigation, A.J. and A.D.; resources, A.J.; data curation, A.J. and A.A.-A.; writing—original draft, A.J., F.S., O.A., A.A.-A., and Y.I.A.; writing—review and editing, A.J., F.S., A.D., and A.A.-A.; visualization, A.J. and A.A.-A.; supervision, A.D. and A.J.; project administration, A.J.; funding acquisition, A.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data accessed on 1 April 2022 from RIDDLE: Repository of Information on Duplicate Detection, Record Linkage, and Identity Uncertainty at https://www.cs.utexas.edu/users/ml/riddle/data.html (accessed on 1 April 2022).

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Derivation of Data Fusion Computation Formula

Consider three data sources S 1 , S 2 and S 3 . We denote by A s 1 w . j . k the value of attribute A k for a particular n D O w . j entity merging’s alternative as observed in S 1 , by A s 2 w . j . k the value of attribute A k for the same alternative as observed in S 2 , and by A s 3 w . j . k the value of attribute A k for the same alternative as observed in S 3 . For instance, we may find A . a k . g w . j = a x in S 1 ( A s 1 = a x ), A . a k . g w . j = a z in S 2 ( A s 2 = a z ), and A . a k . g w . j = a x in S 3 ( A s 3 = a x ). For any number of reasons, the data in these sources may be wrong. Hence, μ a x . T = ( μ A s 1 , μ A s 3 ) , μ a x . F = ( ¬ μ A s 1 , ¬ μ A s 3 ) , μ a z . T = μ A s 2 ,   μ a z . F = ¬ μ A s 2 . We would like to determine the probability that a specific value/s (which may or may not be the value observed from a data source) is indeed the true data fused value/s of an attribute.
Depending on D o m   A k w . j and the P w s T v C a s e . w . j . k = p = 1 P P w s t v . p set that are observed based on Ω w . j . k sample space due to the given fusion case, the required probability term can be expressed as:
μ ( a k . T v w . j = a ( P w s t v . p ) | A s 1 = a x , A s 2 = a z , A s 3 = a x , P w s T v C a s e . w . j . k ) = μ ( A s 1 , 3 = a x , A s 2 = a z , | a k . T v w . j = a ( P w s t v . p ) , P w s T v C a s e . w . j . k ) · μ ( a k . T v w . j = a ( P w s t v . p ) , P w s T v C a s e . w . j . k ) p = 1 P μ ( A s 1 , 3 = a x , A s 2 = a z | a k . T v w . j = a ( P w s t v . p ) , P w s T v C a s e . w . j . k ) · μ ( a k . T v w . j = a ( P w s t v . p ) , P w s T v C a s e . w . j . k )
What is available?
Given the background information of the data fusion’s sub-cases and situations we have:
  • μ ( A s 1 , 3 w . j . k = a x | a k . T v = a 1 , P w s T v C a s e . w . j . k ) = ( μ A s 1 , μ A s 3 ) = ( μ a x 1 . T , μ a x 3 . T ) =   μ a x 1 , 3 . T
  • μ ( A s 1 , 3 w . j . k = a x | a k . T v a 1 , P w s T v C a s e . w . j . k ) = ( ¬ μ A s 1 , ¬ μ A s 3 ) = ( μ a x 1 . F , μ a x 3 . F ) =   μ a x 1 , 3 . F
  • μ ( A s 2 w . j . k = a z | a k . T v = a z , P w s T v C a s e . w . j . k ) = μ A s 2 = μ a z . T .
  • μ ( A s 2 w . j . k = a z | a k . T v a z , P w s T v C a s e . w . j . k ) = ¬ μ A s 2 = μ a z . F .
From assumption 3 and within a data fusion case, we also have:
  • μ ( a k . T v w . j = a x . T \ F , P w s T v C a s e . w . j . k ) = μ ( a k . T v w . j = a z . T \ F , P w s T v C a s e . w . j . k ) = ( μ a k . T v ) = 0.5 .
Based on the above Bayes’ formula, the given reliability information about an attribute in the three sources, assumption 2 and assumption 3, the data fusion method presented in Equation (9) is derived as follows:
I. 
From Assumption 2,
μ ( A s 1 , 3 = a x , A s 2 = a z | a k . T v w . j = a ( P w s t v . p ) , P w s T v C a s e . w . j . k )                   = μ ( A s 1 , 3 = a x | a k . T v w . j = a x . T \ F , P w s T v C a s e . w . j . k ) · μ ( A s 2 = a z | a k . T v w . j = a z . T \ F , P w s T v C a s e . w . j . k )                   =   μ a x 1 , 3 . T \ F · μ a z . T \ F
where
  μ a x 1 , 3 . T \ F =   (   μ a x 1 , 3 . T )   (   μ a x 1 , 3 . F ) =   μ a x 1 , 3 . T   μ a x 1 , 3 . F :   μ a x 1 , 3 . T =   ( μ a x 1 . T , μ a x 3 . T ) = μ a x 1 . T · μ a x 3 . T ,   and     μ a x 1 , 3 . F =   ( μ a x 1 . F , μ a x 3 . F ) = μ a x 1 . F · μ a x 3 . F
II. 
From Assumption 3,
μ ( a k . T v w . j = a ( P w s t v . p ) , P w s T v C a s e . w . j . k ) = ( μ ( a k . T v w . j = a x . T \ F , P w s T v C a s e . w . j . k ) · μ ( a k . T v = a z . T \ F , P w s T v C a s e . w . j . k ) ) = μ a k . T v 2 = 0.5 2
III. 
From both Assumptions 2 and 3 we have:
p = 1 P μ ( A s 1 , 3 = a x , A s 2 = a z | a k . T v w . j = a ( P w s t v . p ) , P w s T v C a s e . w . j . k ) · μ ( a k . T v w . j = a ( P w s t v . p ) , P w s T v C a s e . w . j . k )                   = ( μ ( A s 1 , 3 = a x | a k . T v w . j = a x . T , P w s T v C a s e . w . j . k ) · μ ( A s 2 = a z | a k . T v w . j = a z . T , P w s T v C a s e . w . j . k ) · μ a k . T v 2 )                   + ( μ ( A s 1 , 3 = a x | a k . T v w . j = a x . T , P w s T v C a s e . w . j . k ) · μ ( A s 2 = a z | a k . T v w . j = a z . F , P w s T v C a s e . w . j . k ) · μ a k . T v 2 )                   + ( μ ( A s 1 , 3 = a x | a k . T v w . j = a x . F , P w s T v C a s e . w . j . k ) · μ ( A s 2 = a z | a k . T v w . j = a z . T , P w s T v C a s e . w . j . k ) · μ a k . T v 2 )                   + ( μ ( A s 1 , 3 = a x | a k . T v w . j = a x . F ,   P w s T v C a s e . w . j . k ) · μ ( A s 2 = a z | a k . T v w . j = a z . F , P w s T v C a s e . w . j . k ) · μ a k . T v 2 )                   = (   μ a x 1 , 3 . T · μ a z . T +   μ a x 1 , 3 . T · μ a z . F +   μ a x 1 , 3 . F · μ a z . T +   μ a x 1 , 3 . F · μ a z . F ) · μ a k . T v 2
Substituting for   μ ( A s 1 , 3 = a x , A s 2 = a z | a k . T v w . j = a ( P w s t v . p ) , P w s T v C a s e . w . j . k )   from I, for μ ( a k . T v w . j = a ( P w s t v . p ) , P w s T v C a s e . w . j . k ) from II, and for p = 1 P μ from III, we get:
μ ( a k . T v w . j = a ( P w s t v . p ) | A s 1 = a x , A s 2 = a z , A s 3 = a x , P w s T v C a s e . w . j . k ) = (   μ a x 1 , 3 . T \ F · μ a z . T \ F · μ a k . T v 2 ) t v . l p (   μ a x 1 , 3 . T · μ a z . T +   μ a x 1 , 3 . T · μ a z . F +   μ a x 1 , 3 . F · μ a z . T +   μ a x 1 , 3 . F · μ a z . F ) μ a k . T v 2 = (   μ a x 1 , 3 . T \ F · μ a z . T \ F ) t v . p   μ a x 1 , 3 . T · μ a z . T +   μ a x 1 , 3 . T · μ a z . F +   μ a x 1 , 3 . F · μ a z . T +   μ a x 1 , 3 . F · μ a z . F = (   μ a x 1 , 3 . T \ F · μ a z . T \ F ) t v . p p = 1 P (   μ a x 1 , 3 . T \ F · μ a z . T \ F ) t v . p
-
Based on Equations (6) and (8), we get the following:
(   μ a x 1 , 3 . T \ F μ a z . T \ F ) t v . p =   μ ( P w s t v . p ) =   ( g w . j = 1 q w . j g i h = 1 11 q n m ( μ a k . g w . j . i h . T \ F ) ) = g w . j = 1 q w . j g i h = 1 11 q n m ( μ a k . g w . j . i h . T \ F ) t v . p
and
p = 1 P (   μ a x 1 , 3 . T \ F μ a z . T \ F ) t v . p = p = 1 P ( μ ( P w s t v . p ) )                 = g w . j = 1 q w . j g i h = 1 11 q n m ( μ a k . g w . j . i h . T ) t v . 1 + g w . j = 1 q w . j g i h = 1 11 q n m ( μ a k . g w . j . i h . T ) t v .2                 + + g w . j = 1 q w . j g i h = 1 11 q n m ( μ a k . g w . j . i h . F ) t v . P
μ ( a k . T v w . j = a ( P w s t v . p ) | A s 1 = a x , A s 2 = a z , A s 3 = a x , P s w T v C a s e . w . j . k ) = μ ( a k . T v w . j = a ( P w s t v . p ) | P w s T v C a s e . w . j . k ) = (   μ a x 1 , 3 . T \ F μ a z . T \ F ) t v . p p = 1 P (   μ a x 1 , 3 . T \ F μ a z . T \ F ) t v . p = μ ( P w s t v . p ) μ ( P w s t v . 1 ) + + μ ( P w s t v . P ) = g w . j = 1 q w . j g i h = 1 11 q n m ( μ a k . g w . j . i h . T \ F ) t v . p g w . j = 1 q w . j g i h = 1 11 q n m ( μ a k . g w . j . i h . T \ F ) t v . 1 + g w . j = 1 q w . j g i h = 1 11 q n m ( μ a k . g w . j . i h . T \ F ) t v .2 + + g v . j = 1 q w . j g i h = 1 11 q n m ( μ a k . g w . j . i h . T \ F ) t v . P
where
p = 1 P μ ( a k . T v w . j = a ( P w s t v . p ) | P w s T v C a s e . w . j . k ) = 1 ,   and   P = 4

References

  1. Almutairi, M.M.; Yamin, M.; Halikias, G. An Analysis of Data Integration Challenges from Heterogeneous Databases. In Proceedings of the 2021 8th International Conference on Computing for Sustainable Global Development (INDIACom), New Delhi, India, 17–19 March 2021; pp. 352–356. [Google Scholar]
  2. Aggoune, A. Intelligent data integration from heterogeneous relational databases containing incomplete and uncertain information. Intell. Data Anal. 2022, 26, 75–99. [Google Scholar] [CrossRef]
  3. Jaradat, A.; Halimeh, A.A.; Deraman, A.; Safieddine, F. A best-effort integration framework for imperfect information spaces. Int. J. Intell. Inf. Database Syst. 2018, 11, 296–314. [Google Scholar] [CrossRef]
  4. Beneventano, D.; Bergamaschi, S.; Gagliardelli, L.; Simonini, G. Entity resolution and data fusion: An integrated approach. In Proceedings of the SEBD 2019: 27th Italian Symposium on Advanced Database Systems, Grosseto, Italy, 16–19 June 2019. [Google Scholar]
  5. Sampri, A.; Geifman, N.; Le Sueur, H.; Doherty, P.; Couch, P.; Bruce, I.; Peek, N. Probabilistic Approaches to Overcome Content Heterogeneity in Data Integration: A Study Case in Systematic Lupus Erythematosus. Stud. Health Technol. Inform. 2020, 270, 387–391. [Google Scholar] [PubMed]
  6. Zhao, X.; Jia, Y.; Li, A.; Jiang, R.; Song, Y. Multi-source knowledge fusion: A survey. World Wide Web 2020, 23, 2567–2592. [Google Scholar] [CrossRef]
  7. Zhang, M.; Wang, H.; Li, J.; Gao, H. One-pass inconsistency detection algorithms for big data. IEEE Access 2019, 7, 22377–22394. [Google Scholar] [CrossRef]
  8. Bakhtouchi, A. Data reconciliation and fusion methods: A survey. Appl. Comput. Inform. 2020, 18, 182–194. [Google Scholar] [CrossRef]
  9. Papadakis, G.; Skoutas, D.; Thanos, E.; Palpanas, T. Blocking and filtering techniques for entity resolution: A survey. ACM Comput. Surv. (CSUR) 2020, 53, 31. [Google Scholar] [CrossRef]
  10. Papadakis, G.; Ioannou, E.; Palpanas, T. Entity resolution: Past, present and yet-to-come: From structured to heterogeneous, to crowd-sourced, to deep learned. In Proceedings of the EDBT/ICDT 2020 Joint Conference, Copenhagen, Denmark, 30 March 2020. [Google Scholar]
  11. Munir, A.; Blasch, E.; Kwon, J.; Kong, J.; Aved, A. Artificial intelligence and data fusion at the edge. IEEE Aerosp. Electron. Syst. Mag. 2021, 36, 62–78. [Google Scholar] [CrossRef]
  12. Stonebraker, M.; Bruckner, D.; Ilyas, I.F.; Beskales, G.; Cherniack, M.; Zdonik, S.B.; Pagan, A.; Xu, S. Data Curation at Scale: The Data Tamer System. In Proceedings of the Cidr, Asilomar, CA, USA, 6–9 January 2013. [Google Scholar]
  13. Golshan, B.; Halevy, A.; Mihaila, G.; Tan, W.-C. Data integration: After the teenage years. In Proceedings of the Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Raleigh, CA, USA, 14–19 May 2017; pp. 101–106. [Google Scholar]
  14. De Sa, C.; Ratner, A.; Ré, C.; Shin, J.; Wang, F.; Wu, S.; Zhang, C. Deepdive: Declarative knowledge base construction. ACM SIGMOD Rec. 2016, 45, 60–67. [Google Scholar] [CrossRef]
  15. Stonebraker, M.; Ilyas, I.F. Data Integration: The Current Status and the Way Forward. IEEE Data Eng. Bull. 2018, 41, 3–9. [Google Scholar]
  16. Miller, R.J. Open data integration. Proc. VLDB Endow. 2018, 11, 2130–2139. [Google Scholar] [CrossRef]
  17. Lau, B.P.L.; Marakkalage, S.H.; Zhou, Y.; Hassan, N.U.; Yuen, C.; Zhang, M.; Tan, U.-X. A survey of data fusion in smart city applications. Inf. Fusion 2019, 52, 357–374. [Google Scholar] [CrossRef]
  18. Blanco, L.; Crescenzi, V.; Merialdo, P.; Papotti, P. Probabilistic models to reconcile complex data from inaccurate data sources. In Proceedings of the International Conference on Advanced Information Systems Engineering, Hammamet, Tunisia, 7–9 June 2010; pp. 83–97. [Google Scholar]
  19. Magnani, M.; Montesi, D. A survey on uncertainty management in data integration. J. Data Inf. Qual. (JDIQ) 2010, 2, 1–33. [Google Scholar] [CrossRef]
  20. Liu, Y.; Bao, T.; Sang, H.; Wei, Z. A Novel Method for Conflict Data Fusion Using an Improved Belief Divergence Measure in Dempster–Shafer Evidence Theory. Math. Probl. Eng. 2021, 2021, 6558843. [Google Scholar] [CrossRef]
  21. Yuan, Q.; Pi, Y.; Kou, L.; Zhang, F.; Li, Y.; Zhang, Z. Multi-source data processing and fusion method for power distribution internet of things based on edge intelligence. arXiv 2022, arXiv:2203.17230. [Google Scholar] [CrossRef]
  22. Barbedo, J.G.A. Data Fusion in Agriculture: Resolving Ambiguities and Closing Data Gaps. Sensors 2022, 22, 2285. [Google Scholar] [CrossRef] [PubMed]
  23. Dong, X.L.; Naumann, F. Data fusion: Resolving data conflicts for integration. Proc. VLDB Endow. 2009, 2, 1654–1655. [Google Scholar] [CrossRef]
  24. Dong, X.L.; Berti-Equille, L.; Srivastava, D. Data fusion: Resolving conflicts from multiple sources. In Handbook of Data Quality; Springer: Berlin/Heidelberg, Germany, 2013; pp. 293–318. [Google Scholar]
  25. Pochampally, R.; Das Sarma, A.; Dong, X.L.; Meliou, A.; Srivastava, D. Fusing data with correlations. In Proceedings of the Proceedings of the 2014 ACM SIGMOD International Conference on Management of data, Snowbird, UT, USA, 22–27 June 2014; pp. 433–444. [Google Scholar]
  26. Ioannou, E.; Nejdl, W.; Niederée, C.; Velegrakis, Y. LinkDB: A probabilistic linkage database system. In Proceedings of the Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, Snowbird, UT, USA, 12–16 June 2011; pp. 1307–1310. [Google Scholar]
  27. Wang, H.; Ding, X.; Li, J.; Gao, H. Rule-based entity resolution on database with hidden temporal information. IEEE Trans. Knowl. Data Eng. 2018, 30, 2199–2212. [Google Scholar] [CrossRef]
  28. Halevy, A.; Rajaraman, A.; Ordille, J. Data integration: The teenage years. In Proceedings of the Proceedings of the 32nd International Conference on Very Large Data Bases, Seoul, Korea, 12–15 September 2006; pp. 9–16. [Google Scholar]
  29. Papadakis, G.; Ioannou, E.; Palpanas, T. Entity Resolution: Past, Present and Yet-to-Come. In Proceedings of the EDBT, Lisbon, Portugal, 26–29 March 2020; pp. 647–650. [Google Scholar]
  30. Li, L.; Wang, H.; Li, J.; Gao, H. A Survey of Uncertain Data Management. Front. Comput. Sci. 2020, 4, 162–190. [Google Scholar] [CrossRef]
  31. Dumpa, I.K.; Kota, R.S.; Sadri, F. Information Integration with Uncertainty: Performance. DBKDA 2014 2014, 15, 15. [Google Scholar]
  32. Sarma, A.D.; Dong, X.L.; Halevy, A.Y. Uncertainty in data integration and dataspace support platforms. In Schema Matching and Mapping; Springer: Berlin/Heidelberg, Germany, 2011; pp. 75–108. [Google Scholar]
  33. Deng, D.; Fernandez, R.C.; Abedjan, Z.; Wang, S.; Stonebraker, M.; Elmagarmid, A.K.; Ilyas, I.F.; Madden, S.; Ouzzani, M.; Tang, N. The Data Civilizer System. In Proceedings of the Cidr, Chaminade, CA, USA, 8–11 January 2017. [Google Scholar]
  34. Bilke, A.; Bleiholder, J.; Böhm, C.; Draba, K.; Naumann, F.; Weis, M. Automatic Data Fusion with HumMer; Humboldt-Universität zu Berlin, Mathematisch-Naturwissenschaftliche Fakultät II: Trondheim, Norway, 2005. [Google Scholar]
  35. Bleiholder, J.; Draba, K.; Naumann, F. FuSem-Exploring Different Semantics of Data Fusion. In Proceedings of the VLDB, Vienna, Austria, 23–27 September 2007; pp. 1350–1353. [Google Scholar]
  36. Mirza, A.; Siddiqi, I. Data level conflicts resolution for multi-sources heterogeneous databases. In Proceedings of the 2016 Sixth International Conference on Innovative Computing Technology (INTECH), Dublin, Ireland, 24–26 August 2016; pp. 36–40. [Google Scholar]
  37. Dong, X.L.; Berti-Equille, L.; Srivastava, D. Integrating conflicting data: The role of source dependence. Proc. VLDB Endow. 2009, 2, 550–561. [Google Scholar] [CrossRef]
  38. Ioannou, E.; Garofalakis, M. Query analytics over probabilistic databases with unmerged duplicates. IEEE Trans. Knowl. Data Eng. 2015, 27, 2245–2260. [Google Scholar] [CrossRef]
  39. Papadakis, G.; Ioannou, E.; Niederée, C.; Palpanas, T.; Nejdl, W. Beyond 100 million entities: Large-scale blocking-based resolution for heterogeneous data. In Proceedings of the Proceedings of the fifth ACM International Conference on Web Search and Data Mining, New York, NY, USA, 8–12 February 2012; pp. 53–62. [Google Scholar]
  40. Papadakis, G.; Ioannou, E.; Palpanas, T.; Niederée, C.; Nejdl, W. A blocking framework for entity resolution in highly heterogeneous information spaces. IEEE Trans. Knowl. Data Eng. 2012, 25, 2665–2682. [Google Scholar] [CrossRef]
  41. Papadakis, G.; Koutrika, G.; Palpanas, T.; Nejdl, W. Meta-blocking: Taking entity resolutionto the next level. IEEE Trans. Knowl. Data Eng. 2013, 26, 1946–1960. [Google Scholar] [CrossRef]
  42. Papenbrock, T.; Heise, A.; Naumann, F. Progressive duplicate detection. IEEE Trans. Knowl. Data Eng. 2014, 27, 1316–1329. [Google Scholar] [CrossRef]
  43. Papadakis, G.; Svirsky, J.; Gal, A.; Palpanas, T. Comparative analysis of approximate blocking techniques for entity resolution. Proc. VLDB Endow. 2016, 9, 684–695. [Google Scholar] [CrossRef]
  44. Papadakis, G.; Tsekouras, L.; Thanos, E.; Giannakopoulos, G.; Palpanas, T.; Koubarakis, M. The return of jedai: End-to-end entity resolution for structured and semi-structured data. Proc. VLDB Endow. 2018, 11, 1950–1953. [Google Scholar] [CrossRef]
  45. Panse, F.; Naumann, F. Evaluation of Duplicate Detection Algorithms: From Quality Measures to Test Data Generation. In Proceedings of the 2021 IEEE 37th International Conference on Data Engineering (ICDE), Chania, Greece, 19–22 April 2021; pp. 2373–2376. [Google Scholar]
  46. Panse, F.; Düjon, A.; Wingerath, W.; Wollmer, B. Generating Realistic Test Datasets for Duplicate Detection at Scale Using Historical Voter Data. In Proceedings of the EDBT, Nicosia, Cyprus, 23–26 March 2021; pp. 570–581. [Google Scholar]
  47. Vidal, M.-E.; Jozashoori, S.; Sakor, A. Semantic data integration techniques for transforming big biomedical data into actionable knowledge. In Proceedings of the 2019 IEEE 32nd International Symposium on Computer-Based Medical Systems (CBMS), Cordoba, Spain, 5–7 June 2019; pp. 563–566. [Google Scholar]
  48. Ayat, N.; Akbarinia, R.; Afsarmanesh, H.; Valduriez, P. Entity resolution for probabilistic data. Inf. Sci. 2014, 277, 492–511. [Google Scholar] [CrossRef][Green Version]
  49. Motro, A. Imprecision and uncertainty in database systems. In Fuzziness in Database Management Systems; Springer: Berlin/Heidelberg, Germany, 1995; pp. 3–22. [Google Scholar]
  50. Clark, D.A. Verbal uncertainty expressions: A critical review of two decades of research. Curr. Psychol. 1990, 9, 203–235. [Google Scholar] [CrossRef]
  51. Smets, P. Imperfect information: Imprecision and uncertainty. In Uncertainty Management in Information Systems; Springer: Berlin/Heidelberg, Germany, 1997; pp. 225–254. [Google Scholar]
  52. Zimanyi, E.; Pirotte, A. Imperfect knowledge in relational databases. In Uncertainty Management in Information Systems; Motro, A., Smets, P., Eds.; Springer: Boston, MA, USA, 1997; pp. 35–87. [Google Scholar] [CrossRef]
  53. Suciu, D. Probabilistic databases for all. In Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Portland, OR, USA, 14–19 June 2020; pp. 19–31. [Google Scholar]
  54. Suciu, D.; Olteanu, D.; Ré, C.; Koch, C. Probabilistic Databases, Synthesis Lectures on Data Management; Morgan Claypool: San Rafael, CA, USA, 2011. [Google Scholar]
  55. Ceylan, I.I.; Darwiche, A.; Van den Broeck, G. Open-world probabilistic databases: Semantics, algorithms, complexity. Artif. Intell. 2021, 295, 103474. [Google Scholar] [CrossRef]
  56. Sarma, A.D.; Benjelloun, O.; Halevy, A.; Widom, J. Working models for uncertain data. In Proceedings of the 22nd International Conference on Data Engineering (ICDE’06), Atlanta, GA, USA, 3–7 April 2006; p. 7. [Google Scholar]
  57. Chen, R.; Mao, Y.; Kiringa, I. GRN model of probabilistic databases: Construction, transition and querying. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, Indianapolis, IN, USA, 6–10 June 2010; pp. 291–302. [Google Scholar]
  58. Dalvi, N.; Suciu, D. Management of probabilistic data: Foundations and challenges. In Proceedings of the Twenty-Sixth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, Beijing, China, 26–28 June 2007; pp. 1–12. [Google Scholar]
  59. Sen, P.; Deshpande, A.; Getoor, L. PrDB: Managing and exploiting rich correlations in probabilistic databases. VLDB J. 2009, 18, 1065–1090. [Google Scholar] [CrossRef]
  60. Mauritz, R.; Nijweide, F.; Goseling, J.; van Keulen, M. Autoencoder-Based Cleaning in Probabilistic Databases. ACM J. Data Inf. Qual 2021. Available online: https://ris.utwente.nl/ws/portalfiles/portal/256093655/arxiv_preprint_2106.09764.pdf (accessed on 26 September 2022).
  61. Antova, L.; Koch, C.; Olteanu, D. 10^(10^6) worlds and beyond: Efficient representation and processing of incomplete information. VLDB J. 2009, 18, 1021–1040. [Google Scholar] [CrossRef]
  62. Widom, J. Trio: A System for Integrated Management of Data, Accuracy, and Lineage; Stanford InfoLab: Stanford, CA, USA, 2004. [Google Scholar]
  63. Jampani, R.; Xu, F.; Wu, M.; Perez, L.L.; Jermaine, C.; Haas, P.J. Mcdb: A monte carlo approach to managing uncertain data. In Proceedings of the Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, Portland, OR, USA, 14–19 June 2008; pp. 687–700. [Google Scholar]
  64. De Keijzer, A.; Van Keulen, M. IMPrECISE: Good-is-good-enough data integration. In Proceedings of the 2008 IEEE 24th International Conference on Data Engineering, Washington, DC, USA, 7–12 April 2008; pp. 1548–1551. [Google Scholar]
  65. Van Keulen, M.; De Keijzer, A. Qualitative effects of knowledge rules and user feedback in probabilistic data integration. VLDB J. 2009, 18, 1191–1217. [Google Scholar] [CrossRef]
  66. Grohe, M.; Lindner, P. Infinite probabilistic databases. arXiv 2020, arXiv:2011.14860. [Google Scholar] [CrossRef]
  67. Li, Y.; Li, Q.; Gao, J.; Su, L.; Zhao, B.; Fan, W.; Han, J. Conflicts to harmony: A framework for resolving conflicts in heterogeneous data by truth discovery. IEEE Trans. Knowl. Data Eng. 2016, 28, 1986–1999. [Google Scholar] [CrossRef]
  68. Xu, J.; Zadorozhny, V.; Grant, J. IncompFuse: A logical framework for historical information fusion with inaccurate data sources. J. Intell. Inf. Syst. 2020, 54, 463–481. [Google Scholar] [CrossRef]
  69. Panse, F.; Ritter, N. Relational data completeness in the presence of maybe-tuples. Ingénierie Systèmes D’information (2001) 2010, 15, 85–104. [Google Scholar] [CrossRef]
  70. Yong-Xin, Z.; Qing-Zhong, L.; Zhao-Hui, P. A novel method for data conflict resolution using multiple rules. Comput. Sci. Inf. Syst. 2013, 10, 215–235. [Google Scholar] [CrossRef]
  71. Cooper, R.; Devenny, L. A Database System for Absorbing Conflicting and Uncertain Information from Multiple Correspondents. In Proceedings of the British National Conference on Databases, Birmingham, UK, 7–9 July 2009; pp. 199–202. [Google Scholar]
  72. Dong, X.L.; Gabrilovich, E.; Heitz, G.; Horn, W.; Murphy, K.; Sun, S.; Zhang, W. From data fusion to knowledge fusion. arXiv 2015, arXiv:1503.00302. [Google Scholar] [CrossRef]
  73. Liu, X.; Dong, X.L.; Ooi, B.C.; Srivastava, D. Online data fusion. Proc. VLDB Endow. 2011, 4, 932–943. [Google Scholar] [CrossRef]
  74. Singh, Y.; Kaur, A.; Suri, B.; Singhal, S. Systematic Literature Review on Regression Test Prioritization Techniques. Informatica 2012, 36, 379–408. [Google Scholar]
  75. Zhang, L.; Xie, Y.; Xidao, L.; Zhang, X. Multi-source heterogeneous data fusion. In Proceedings of the 2018 International Conference on Artificial Intelligence and Big Data (ICAIBD), Chengdu, China, 26–28 May 2018; pp. 47–51. [Google Scholar]
  76. Yang, Y.; Gu, L.; Zhu, X. Conflicts Resolving for Fusion of Multi-source Data. In Proceedings of the 2019 IEEE Fourth International Conference on Data Science in Cyberspace (DSC), Hangzhou, China, 23–25 June 2019; pp. 354–360. [Google Scholar]
  77. Bleiholder, J.; Naumann, F. Data fusion. ACM Comput. Surv. (CSUR) 2009, 41, 1–41. [Google Scholar] [CrossRef]
  78. Yin, X.; Han, J.; Philip, S.Y. Truth discovery with multiple conflicting information providers on the web. IEEE Trans. Knowl. Data Eng. 2008, 20, 796–808. [Google Scholar]
  79. Jiang, Z. Reconciling Continuous Attribute Values from Multiple Data Sources. PACIS 2008 Proc. 2008, 264. Available online: https://aisel.aisnet.org/pacis2008/264/ (accessed on 26 September 2022).
  80. Dellis, E.; Seeger, B. Efficient Computation of Reverse Skyline Queries. In Proceedings of the VLDB, Vienna, Austria, 16 February 2007; pp. 291–302. [Google Scholar]
  81. Slaney, J.; Paleo, B.W. Conflict resolution: A first-order resolution calculus with decision literals and conflict-driven clause learning. J. Autom. Reason. 2018, 60, 133–156. [Google Scholar] [CrossRef]
  82. Maunder, M.N.; Piner, K.R. Dealing with data conflicts in statistical inference of population assessment models that integrate information from multiple diverse data sets. Fish. Res. 2017, 192, 16–27. [Google Scholar] [CrossRef]
  83. Pasternack, J.; Roth, D. Making better informed trust decisions with generalized fact-finding. In Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence, Barcelona, Spain, 16–22 July 2011. [Google Scholar]
  84. Yin, X.; Tan, W. Semi-supervised truth discovery. In Proceedings of the 20th International Conference on World Wide Web, Hyderabad, India, 28 March–1 April 2011; pp. 217–226. [Google Scholar]
  85. Zhao, B.; Rubinstein, B.I.; Gemmell, J.; Han, J. A Bayesian approach to discovering truth from conflicting sources for data integration. Proc. VLDB Endow. 2012, 5, 550–561. [Google Scholar] [CrossRef]
  86. Galland, A.; Abiteboul, S.; Marian, A.; Senellart, P. Corroborating information from disagreeing views. In Proceedings of the Third ACM International Conference on Web Search and Data Mining, New York, NY, USA, 3–6 February 2010; pp. 131–140. [Google Scholar]
  87. Jaradat, A.; Deraman, A.; Idris, S.; Din, L.; Said, N. Pemodelan maklumat biodiversiti: Pendekatan objek digital informative. In Proceedings of the 6th ITB-UKM joint Seminar on Chemistry, Bali, Indonesia, 17–18 May 2005. [Google Scholar]
  88. Deraman, A.; Yahaya, J.; Salim, J.; Idris, S.; Jambari, D.I.; Komoo, A.J.I.; Leman, M.S.; Unjah, T.; Sarman, M.; Sian, L.C. The development of myGeo-RS: A knowledge management system of geodiversity data for tourism industries. Commun. IBIMA 2009, 8, 142–146. [Google Scholar]
  89. Peng, L. Research on Data Uncertainty and Lineage Through Trio. In Proceedings of the 2019 The World Symposium on Software Engineering, Wuhan, China, 20–23 September 2019; pp. 73–77. [Google Scholar]
  90. Roy, S. Uncertain Data Lineage. Encycl. Database Syst. 2018, 4280–4286. [Google Scholar] [CrossRef]
  91. Kimmig, A.; De Raedt, L. Probabilistic logic programs: Unifying program trace and possible world semantics. In Proceedings of the Workshop on Probabilistic Programming Semantics, Paris, France, 1 January 2017. [Google Scholar]
  92. Fan, W.; Geerts, F.; Tang, N.; Yu, W. Conflict resolution with data currency and consistency. J. Data Inf. Qual. (JDIQ) 2014, 5, 1–37. [Google Scholar] [CrossRef]
  93. Klir, G.J. Uncertainty and Information: Foundations of Generalized Information Theory; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2006. [Google Scholar]
  94. Kuicheu, N.C.; Wang, N.; Fanzou Tchuissang, G.N.; Xu, D.; Dai, G.; Siewe, F. Managing uncertain mediated schema and semantic mappings automatically in dataspace support platforms. Comput. Inform. 2013, 32, 175–202. [Google Scholar]
  95. Doucouliagos, C. A note on the evolution of homo economicus. J. Econ. Issues 1994, 28, 877–883. [Google Scholar] [CrossRef]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.