2.2. Intelligent Semantic Data Agents
The introduction of the software Intelligent Semantic Data Agents (ISDAs) (Figure 2
) is a concept that has the potential to revolutionize the interaction of users and data. The principle idea is comparable to the human agent of, e.g., a movie star, who has the task to promote the actor and to negotiate new engagements for the actor. Ideally, the human agent has all relevant information about the actor, including past engagements, preferred partners, limitations, and preferences, and fully understands the capabilities of the actor. Similarly, an ISDA has all relevant information about a dataset, including comprehensive provenance, related datasets, models and applications to be used by users, user types that might be interested, applicability and limitations, quality and uncertainties, and more. The ISDA has the task to promote the dataset actively to potential users (thus making progress toward the Data Discover Users (DDU) concept), to respond to queries, to inform about the dataset, to provide derived information (e.g., selected statistics, subsets, etc.), receive feedback from users, and to learn from user interactions to be better prepared for future users.
From a semantic point of view, the knowledge base will formulate the semantics of the domain, such that each data product has a meaning attached to it. However, it will go beyond the semantics of datasets to a pragmatic approach, in which a data product is represented by an agent that is aware of the data product’s meaning and is capable of learning potential use cases of the data product. Thus, data products will be represented by agents (the ISDAs) that can act on knowledge within the knowledge base and generate new knowledge.
Data products present in the graphs of the knowledge base will be represented by ISDAs that act on their behalf. The ISDAs are purposive software agents whose aim is to facilitate the interaction between users and the data product. In particular, an ISDA will be able to respond to questions about its data product, provide access to parts or all of the data product, and solicit feedback on the data product. Initially, the ISDAs will be goal-based agents [50
] but they will have to evolve into learning agents. The ISDAs can request specific analytics from the knowledge base to discover potential users and to enter into communication with them. In particular, it can find users with the skills and interest to use the data or who might need these data to corroborate a published study, even if these potential users did not know of the existence of the data. The ISDAs will be able to use the social media and contact information of users in the knowledge base to enter in communication with them. A core research question on the path to implementation is how rich the data description will have to be to enable these capabilities.
The ISDAs are capable of executing complex transaction patterns with users, such as granting access, executing custom queries to aggregate, truncate, convert, randomly sample data, and provide references or meta-data. For that, the agents will adopt a transaction processing framework to manage its interactions with other agents and users [54
].The concept of rough set [55
] can be considered as a capability of the ISDAs.
The ISDAs will be able to grow from initial “seeds” with very limited capabilities into fully developed “adult” agents that have access to all the information related to the dataset, including all uses, experiences, feedbacks. Thus, the agents gain in knowledge as the knowledge base becomes more complete. A deep-learning algorithm will be used to further enrich the information available to an ISDA about the represented dataset so that it can link to users with potentially matching interests and needs and inform users about products of potential interest to them, including the data sharing and access conditions.
The ISDAs will also benefit from a generalization of the concept of digital object identifier which comprehensively identifies a dataset including the relevant metadata, the ISDA, and derived datasets in a consistent identification scheme. Having the main identifier pointing to the ISDA instead of the dataset itself will ensure that a user who aims to access the data always will have access to the full history of transformations and applications of the data.
2.3. The Knowledge Base
The knowledge base is envisioned as an extended version of the existing Socio-Economic and Environmental Information Needs Knowledge Base (SEE-IN KB), which has the main function to construct and analyze graph data capturing the connections between datasets, products, applications, user types, and other elements in scientific communities and society at large. To the extent permissible under privacy and personal data protection regulations (such as the European General Data Protection Regulation (GDPR)), individual persons can be integrated into graph data. This knowledge base provides the graph data and analytical tools to connect users and facilitate collaborations.
Graph data consists of two basic elements: The nodes (or vertices), and the links (or edges) between these nodes. Both the nodes and links are objects that are characterized by a set of properties. Each link is associated with two nodes. Links can be directional with head and tail nodes or bidirectional. In the Socio-Economic and Environmental Information Needs Knowledge Base (SEE-IN KB), the nodes are not limited in terms of what objects can constitute a node. For example, nodes can be as diverse as a specific person, a group or type of humans (e.g., a user type), a dataset, an information need, a societal goal, a modeling software, or a specific observation sensor. The set of properties for each class of nodes and links is dynamic and can be extended as more information about an object becomes available. Importantly, each node and link has a unique identifier.
The knowledge base uses big data analysis techniques to map the user landscape in the communities engaged in research and applications and identify their knowledge and information needs. It generates graph data that describe user types and their potential needs based on publications and social media communications and links them to tools and datasets. In utilizing published information on persons, such as paper authorship and owners of data and processing tools, it will be important to ensure compliance to privacy and personal data protection regulations, such as the General Data Protection Regulation (GDPR). Individual persons can be integrated as nodes into the graphs. During the development of the Global Earth Observation System of Systems (GEOSS) User Requirements Registry (URR), which initially only captured user types, users of the User Requirements Registry (URR) repeatedly requested the possibility to link themselves to user types and establish a social network of users within the User Requirements Registry (URR) [30
]. It is expected that similar requests are made for the knowledge base. The knowledge base also maps the Earth observation (EO) landscape in terms of available datasets, products, and processing tools. The research communities are being mapped in terms of research topics, needs, and challenges, as well as the tools available to process and analyze data and to use data for modeling and simulation. An important source for mapping research communities is the comprehensive publication and citation data compiled in rapidly expanding research knowledge hubs. Increasingly, journals require information on data and tools used for the research published in a paper, see, e.g., [59
]. This information can be exploited to inform the construction of graph data and to increase the knowledge and skills of the ISDAs. The development of the graph data also is based on deep searches and deep learning from scholarly and other publications, social networks, etc. In particular, the knowledge base will employ parallel crawlers to inform the construction of graph data.
The knowledge base requires the capability to provide the information needed to bring data and products to potential users. This capability has to be based on the full spectrum of graph theory. This includes the detection of components and communities applying, e.g., the deep search algorithms depth-first search (DFS) [60
] and Kosaraju, see, e.g., [61
], and the concept of weakly connected components, label propagation, and spacification [62
]. Evaluating community structures can focus on conductance, modularity, and clustering coefficients [63
], and this provides a basis to identify collaboration potentials between research groups and individuals. Ranking and walking along graphs provides a basis for prioritization as well as discovery of relevant nodes in support of data promotion and can be based on algorithms applying pageranks and different centralities, see, e.g., [64
], random walking and sampling. Path-finding facilitates the identification of users who’s requirements could be a match for a dataset, applying, e.g., Dijkstra’s [65
] and Bellman-Ford’s [61
] algorithms. Importantly, detection of unreliable or fake information [66
] has to be integrated into the graph development processes.
The Socio-Economic and Environmental Information Needs Knowledge Base (SEE-IN KB) provides extensive search and feedback utilities and the analysis of both searches and feedbacks with deep learning methods can further improve the capability to add intelligence to the Intelligent Semantic Data Agents (ISDAs). Crowd-sourcing opportunities can be used to gather both primary graph data and feedback on data and the performance of the ISDAs. The lexicon (ontology) contained in the Socio-Economic and Environmental Information Needs Knowledge Base (SEE-IN KB) as the primary source for all semantic aspects will grow based on deep learning from other registries and from user interactions. The Socio-Economic and Environmental Information Needs Knowledge Base (SEE-IN KB) provides access to a large set of user needs (originally collected in the Global Earth Observation System of Systems (GEOSS) User Requirements Registry (URR) [68
]) and observational requirements (partly harvested from OSCAR, see http://www.wmo-sat.info/oscar
). The Socio-Economic and Environmental Information Needs Knowledge Base (SEE-IN KB) explores existing and new data repository in an effort to link Earth observations (EOs) and the global community of potential users.
Big data analytics on the graph data in an extended version of the Socio-Economic and Environmental Information Needs Knowledge Base (SEE-IN KB) is at the core of the DAS concept. In the current DPO concept datasets are passive and isolated in repositories. In contrast to this, the DAS approach will create the graph data of a “Web of things” where each dataset will be represented by a node with semantic and pragmatic descriptors, and meaningfully interconnected with the other entities (other datasets, users, models, instruments, etc.) through complex and dynamic relations, which will be updated as users and ISDAs interact with the graph data and provide feedback.
The graph data requires a generic model for metadata (referred to below as metamodel) that enables the networked representation of a population of entities and their mutual relations. Since the system is open-ended, and the final extent of all datasets that may be added is not known at inception, it would be illusive to attempt to create a fixed and comprehensive ontology that would encompass every future addition of datasets in the knowledge base.
A dataset provides a partial, biased, and time-bounded description of an object of interest in the real world. This means that the dataset expresses a reference in a semiotic relationship that involves the real world object as a referent, and the specific form of the data as symbol. The data provider and data users relate with the dataset both at a semantic level to uncover the meaning expressed in it, and also at a pragmatic level to achieve some practical ends, communicative or otherwise. In this sense, datasets seem to be more complex objects to manipulate and recommend automatically than products on Amazon or videos of Youtube. Even the individuation of the real-world object to which the data is pointing is subject to the researcher’s interests and underlying theories or a user’s preconceptions and world view. Similarly, the characteristics of the object represented by the dataset depend on the technical means of observation, on the methodology adopted, and on the level of fidelity decided by the data provider.
Other aspects to be covered in the DAS approach involve the origin of the data (what actors made it available), how it was obtained, for instance, whether the measurement is punctual or longitudinal, whether the data originated from a model (and what kind of model), a survey, observations (what kind of sensor), and what use-cases the data can support. The Socio-Economic and Environmental Information Needs Knowledge Base (SEE-IN KB) will also have to enforce integrity rules through mechanisms like reputation management, voting, and read/copy/write access rules, to make sure that datasets are not tampered with, and that single source of truth principles are maintained for every given data entity.
An important step towards the implementation of the DAS concept is the introduction of an extensible metamodel that covers these aspects of the graph data, so that the Intelligent Semantic Data Agents (ISDAs) initiated by data providers may represent their associated datasets as precisely as possible, that advanced search capabilities may be implemented, and that the big data algorithms have a rich basis upon which to analyze a continuously growing knowledge base, and ultimately bring the data to those data users who need it.
Besides the graph-data metamodel, an important ingredient for the DAS concept is the introduction of advanced machine learning algorithms to bring the data to potential users. Broadly speaking, machine learning refers to capability of a computer program to learn a knowledge-intensive task while improving its performance on the task as it gains more experience [70
]. The task at hand is the suggestion of datasets and potential collaborators to a set of users. The performance corresponds to the practical value of the suggested datasets to the users, while the experience is derived from the feedback obtained from users regarding the quality of the suggestions. The machine learning algorithms will take advantage of the underlying structure of the graph data, the similarity between datasets, and the similarity between users as obtained from social media and scholarly publications. The machine learning techniques that can be used to achieve this include clustering, collaborative filtering, case-based reasoning, and deep learning.
Clustering is a computing task in which a set of objects is segmented in subsets such that the objects in one cluster are more similar to each other than the objects out of the cluster [70
]. Clustering can be used to create categories of datasets on the one hand, and categories of users and applications on the other hand. The clustering of datasets can be performed by applying the highly connected subgraph algorithm [71
] on the graph data. Datasets will be found in the same cluster if they are highly connected in the graph data, which would mean that the datasets within one cluster will share relevant variables and methodological features. The similarity metric of the clustering algorithm will be continuously adapted based on the feedback received from users. Thus, as the algorithm gains in experience, the clustering of the datasets will result in groups more and more homogeneous, thereby enabling more customized suggestions. Since the graph links have different semantics, the same dataset element will potentially belong to multiple clusters, for instance geographic clusters, data fidelity clusters, topical clusters, etc.
Using social network data (such as Facebook posts or Twitter hashtags), parsed publications, research knowledge hubs with citation data, newspaper articles (particularly those discussing science-related topics), co-citation analysis, as well as past patterns of dataset search and use, it will be possible to similarly cluster the users into multiple groups based on their scientific disciplines, their application domains of interest, their geographic area of focus, etc. Here again, as the algorithm learns more about the relevant properties that users share, they will be placed in clusters that become more and more specific, so that the recommendations will become more accurate.
Collaborative filtering uses the ratings and feedback provided by users of a product to recommend the same product to users with a high level of similarity. A commonly used similarity metric is the Pearson correlation [72
] or the vector cosine-based similarity [73
]. In this approach, crowd-sourced user feedback is exploited to provide better suggestions. This method may be inadequate at the beginning when user feedback data is sparse, but improves exponentially as user data becomes more widespread [74
]. Collaborative filtering works well in combination with the clustering method described before, since, initially, recommendations may be forwarded to users in the same cluster, as they share some similarity.
In case-based reasoning, properties of datasets and of users entities are utilized to match users and products. The cases encode knowledge such as “users sufficiently similar with user u
and who accessed dataset with property x
also used dataset with property y
.” As such, case-based reasoning will exploit the results of the clustering algorithms. Case-based reasoning algorithms are often based on decision trees [75
] and have some major benefits: They are suitable for non-formalized knowledge domains, they are robust and easy to maintain, and they allow for incremental improvement. However, just as with collaborative filtering, the approach becomes computationally inefficient when the domain is too dynamic and when the number of cases becomes very large [76
To remedy these shortcomings, deep learning, based on restricted Boltzmann machines [77
] are emerging as very promising techniques for data intensive learning tasks, owing to the availability of parallelized computational resources. These techniques use successive layers of neural networks and perform computations of increasing levels of abstraction to discover a hierarchy of features, from low-level features to higher level ones [78
], i.e., a bottom-up approach. Deep-learning algorithms have been successfully applied to computer vision and language processing and have only recently begun to be used in commercial recommender systems [79
]. As shown in [80
], deep-learning algorithm can be used to learn about the attitudes of a user toward a dataset from the review text of dataset posted by users and the features of the product itself, and thereby match datasets with types of users to maximize the utility of a dataset for a certain type of user.
2.4. Interaction Platform
The interaction platform is the space in which users and ISDAs interact with each other (Figure 2
) and where a track record of these interactions is being kept. Users of the platform can take on the role of data provider, who want to make datasets available to a community of users, and data users who may be scientists who need some data in the context of their research or other social agents (individuals, governmental bodies, NGOs) who may have interest in knowledge derived from the data to answer practical questions relevant to their problems.
Experience and events should be captured in schemes that provide a complete history of a given dataset. While such a scheme for the recording of the transactions could be based on blockchains, there are concerns that this would be far too demanding in terms of energy, see, e.g., [81
]. Blockchain is an emerging interaction paradigm for transmission and storage of information without centralized control. It is a secure and distributed database that is hosted locally by the human or software agents engaged in a transaction. It contains the history of all transactions performed by these agents, without a centralized intermediary, thereby allowing each participant to independently verify validity of a chain of interactions. Furthermore, blockchains can be made public or limit access to only users with specified credentials.
The first blockchain was introduced by Bitcoin [82
], but its use as an architectural model for secure user interaction has now expanded beyond the domain of digital currencies [83
]. User transactions are structured in blocks. Each block is validated by an algorithmic key or “proof-of-work.” Once a block is validated, it is timestamped and added to the chain of blocks and becomes publicly visible to the members of the network. The decentralized, transparent and robust nature of blockchain makes it particularly well adapted for a distributed and intelligent data search system. However, the choice of whether to use one of the existing blockchains (for a discussion of potential candidates, see, e.g., [84
]) or to develop a new blockchain dedicated to data and knowledge-related transactions would be a difficult one. In addition, there are concerns that the trust in blockchains is not fully justified [85
]. An important application of blockchains is to provide provenance particularly with respect to transfers of ownership in something. This comes with a very high use of resources. In fact, a white paper developed by the World Economic Forum states that the energy consumed in the blockchain network is unsustainable [81
]. Energy consumption can be reduced significantly depending on the consensus algorithms used [86
], and replacing the “proof-of-work” algorithms by “proof-of-stake” or “proof-of-authority” results in drastically reduce energy consumption decoupled from the number of users engaged in a blockchain [87
]. For the access to data, tools to process the data, information derived from data, and knowledge created using the data, the ownership in general remains with the orginator, and only the rights to access, processing, use and further distribution are points of negotiation. For this purpose, provenance may be achieved without blockchains. However, a distributed ledger that validates and records transactions between several ISDAs as well as between ISDAs and human agents seems to be mandatory for the interaction platform.
For the management of interactions between agents (data agents, models, persons, repositories, etc.), a concept similar to that of “smart contracts” could be developed. These “smart contracts” would automatically perform delegated terms of a contract without user intervention. The traceability of blockchains or a similar distributed ledger would allow the capture of events and user experiences as blockchain-based schemes to provide a complete history of datasets addition, access, purchase, updates, etc. To the extent possible, protocols would facilitate, verify, or enforce the negotiation or performance of a “contract” between a user and the ISDA representing the data product. With this concept, many aspects of the transactions could be made partially or fully self-executing, self-enforcing, or both. Conceptually, this approach provides security superior to traditional more open transactions. The “smart contract” concept seamlessly interfaces with a distributed ledger.
However, as noted above, blockchains are very demanding in terms of computational resources and energy, and a careful assessment of the trade-off between the amount of resources needed and the level of security, perseverance, and documentation achieved needs to be carried out to inform the design of the interaction platform.