Frequent Itemset Mining and Multi-Layer Network-Based Analysis of RDF Databases

: Triplestores or resource description framework (RDF) stores are purpose-built databases used to organise, store and share data with context. Knowledge extraction from a large amount of interconnected data requires effective tools and methods to address the complexity and the underlying structure of semantic information. We propose a method that generates an interpretable multilayered network from an RDF database. The method utilises frequent itemset mining (FIM) of the subjects, predicates and the objects of the RDF data, and automatically extracts informative subsets of the database for the analysis. The results are used to form layers in an analysable multidimensional network. The methodology enables a consistent, transparent, multi-aspect-oriented knowledge extraction from the linked dataset. To demonstrate the usability and effectiveness of the methodology, we analyse how the science of sustainability and climate change are structured using the Microsoft Academic Knowledge Graph. In the case study, the FIM forms networks of disciplines to reveal the signiﬁcant interdisciplinary science communities in sustainability and climate change. The constructed multilayer network then enables an analysis of the signiﬁcant disciplines and interdisciplinary scientiﬁc areas. To demonstrate the proposed knowledge extraction process, we search for interdisciplinary science communities and then measure and rank their multidisciplinary effects. The analysis identiﬁes discipline similarities, pinpointing the similarity between atmospheric science and meteorology as well as between geomorphology and oceanography. The results conﬁrm that frequent itemset mining provides an informative sampled subsets of RDF databases which can be simultaneously analysed as layers of a multilayer network.


Introduction
Linked data (LD) represent an essential tool used to organise, store and share data with context [1]. Datasets that are published as LD form the Semantic Web. The part of the Sematic Web which is freely accessible is called the linked open data cloud (LODC). The driver of LD is the resource description framework (RDF) data model [2], which is standardised by the World Wide Web Consortium (W3C). Databases following the RDF standard are called triplestores or RDF stores, the naming is very intuitive; thus, the atomic form of the RDF is an RDF triplet in the form "subject-predicate-object" (s,p,o), which states that "an object o has a relationship p with subject s" [3]. There is little work on formalisation of the RDF besides the official documents of the W3C, particularly RDF Concepts and Abstract Syntax [4] and RDF Semantics [5], due its flexibility and extensibility [6]. There are formalisations towards special representations and formalisation, like the bipartite graphs as intermediate model for RDF [7]. The main concepts of the RDF are self-descriptive data, data about data [4], machine readability [8] and extendibility [9]. LOD offers large quantities of freely available, interconnected, statistical (linked open statistical data (LOSD)) [10], governmental [11], scientific [12,13] and other annotated data [14]. The collection of such cloud by multiple systematic SPARQL queries. This step can be seen as a series of bipartite projections but is described in more detail in the next section. The overall final step is ranking in the multilayer network, as ranking can be considered a translation of highly complex phenomena into short, simple messages that can be easily digested [42]. Ranking, however, not only describes, but also prescribes [43]; therefore, a very careful criteria selection method must be used. Ranking interconnections in the network has also been investigated for finding relevant relationships [44]. Network-based techniques are very understandable; according to a ranking [45] and with the inclusion of the statistically relevant layers, the relevant relationships are guaranteed. The aim of a complex knowledge exploration method in the LODC that takes into account the known hierarchies of the data (e.g., ontologies and taxonomies) as well as their interconnections is thereby achievable. Ultimately, the knowledge extraction performed in this way is a multicriteria, multi-objective ranking system, in contrast to single aspect rankings and ranking only by analysing the structure.
To test and demonstrate the applicability of our methodology, we use the Microsoft Academic Knowledge Graph (MAKG) [46] to investigate the scientific realms of climate change and sustainability. The discovery process also includes a ranking of authors and institutes. The multi-aspect ranking also includes the layer similarities, determining the similarities among research fields and their combinations, which act as the dimensions of the network. The MAKG describes research fields hierarchically. The specialisation of a layer can be determined by incrementing the number of elements in the itemset, interconnecting more disciplines or stepping downwards in the hierarchy tree. The incrementation of the specification yields a lower entity count and increased density and modularity. We inspect both layers and both types of community similarity to reveal and explore overlaps and gain insight into the specifics of climate change and sustainability.
According to the main contributions, the paper is organised as follows. • The RDF databases are represented as multidimensional networks in Section 2.

•
We propose a frequent itemset mining-based method to extract information from the multidimensional network in Section 3. • The resultant frequent itemsets of multidimensional networks can be represented as a multi-layer network that can be analysed by metrics presented in Section 4.

•
We present the methodology through an example in which we uncover the scientific realms of climate change and sustainability, including an alternative co-author, coorganisational network ranking used to measure the impact of authors in multiple disciplines in Section 5.

Multidimensional Network-Based Representation of RDF Databases
Linked Data can be seen as multiple interconnected datasets in RDF format. The atomic form of an LD dataset or an RDF dataset is the RDF triplet; "an object o has a relationship p with subject s", can be seen as a single edge in a network that connects entities, nodes s and o, with a labelled attribute p. A good example is that Isaac Asimov (s) wrote (p) The Foundation (o).
The classic multi-dimensional networks are edge-labelled multi-graphs, which are described as G = (V, E, D), where V represents the set of nodes, E the set of edges and D the set of dimensions. The set of edges can be described as connections between nodes (u and v) along a dimension (d). The set can be written as The nodes in LD are often enriched by properties and descriptions. In the example, Isaac Asimov is both a person and a writer, and "The Foundation" is a fiction novel. These properties, such as "The Foundation" is "fiction", are also described by triplets. These triplets can be merged into a simple node or skipped if they contain irrelevant pieces of information. These ontological properties often act as dimensions. Therefore, to simplify the ideas, the notation and ultimately the analysis, we extend the description of a dimension with two sets, the dimension of the nodes (D V ) and the dimension of the edges (D E ). The union of the sets results in the dimension set (D V ∪ D U = D). Then, the notation of the edges is described as where u and v are the nodes as before and d u and d v are their dimensions, respectively; d e represents the dimension of the edge.
A multidimensional edge is represented as where D α refers to the simultaneously matching dimensions of both the node and edge dimensions D α = D α,V ∪ D α,E . This selection is a direct reference to a layer in a multiplex network The network G α is a network with α dimension selection, where the nodes V α and edges E α take the dimension nodes and edges of the selected dimensions D α , respectively. M corresponds to the number of created layers.
A multiplex network is a particular multidimensional network in which every layer contains every node and the cross-layer edges are identifier edges, which refer to the same cross-layer node. We use this notation, with the addition that not every layer will include every node; therefore, an activity check in a multiplex network-checking whether a node is connected or disconnected in a layer-will effectively be an existence check. Extending the multiplex notation with simultaneous dimension selection, we build the edges as where D α,u , D α,v , and D α,e refer to simultaneously matching node and edge dimensions, respectively. Returning to the example of Asimov, dimension selection would work for the network of books with the simultaneous matching attributes "fiction" and "robots". The expected result would be a network of books containing every book from the Elijah series, "The Caves of Steel" and "Robots and Empire" with the levels and dimensions of the important layer and non-layer constructing properties, such as the author "Isaac Asimov" and the main protagonist "R. Daneel Olivaw". This means that the created network is explainable by the writer or the protagonist and of course the layer constructing properties "fiction" and "robots".
The number of layers of a fully defined multilayer network is large if we consider n e = |D E | as the number of all edge dimensions and n v = |D V | as the number of node dimensions. Then, the number of possible dimensions is ∑ n k k=1 ( n e k ) ∑ n v j=1 ( n v j ). However, the selection of significant layers is the key to reducing the space of the analysis. The previously introduced variable M as the number of layers refers to the number of selected layers.
The transactions can be extended with the environment and the reachable labels and properties in the RDF. The right side of Figure 2, G (2) α , takes the reachable tags into account. The reachability states also enumerate all node attributes, and they take the neighbour attributes into account. Beyond the simple mapping of the dataset, they can also be used to extract information. In the example below, this means that Stefano inherits all attributes of the paper and his institute.
To examine the significant layer selection, we have to understand the reachability concept in the dataset.
To visualise reachability, Figure 2 represents the core idea. In Figure 2, G α represents an RDF dataset. It can be translated as stating that "The structure and dynamics of multilayer networks" (v 1 ), which is a review article (d v1 ) in network science (d v2 ), is written by (d e1 ) Stefano Boccaletti (v 2 ), who is a physicist (d v3 ). Stefano is affiliated with (d e2 ) the Institute for Complex Systems in Florence (v 3 ), which is a research institution (d v5 ).  The presented procedure creates multi-links as well as networks for a given set of attributes. The procedure is similar to multidimensional network-based analysis methods, where RDF databases were analysed in thematic dimensions [47]. The bipartite networkbased analysis of RDF datasets has also been proven useful [7]. Bipartite networks are excellent to study the connections of two sets of objects. However, for multi-objective analysis, a more complex model representation is needed, which motivates the development of our method that forms sets of layers of networks where the layers represent significant subsets of the dimensions of the RDF model. According to this, the next step of the proposed method is selecting these significant sets of dimensions, which will be presented in the following section.

Frequent Itemset Mining in Multidimensional Networks
Frequent itemset mining (FIM) is a mining technique used to uncover frequent correlations in transactional datasets [48]. We can consider the itemset I = {I 1 , . . . , I n e +n v } in the case of the FIM representing the products; in the LD, the itemset represents the dimensions D = {d 1 , d 2 , . . . , d n e +n v }, or more specifically, the labels of the RDF. A transaction is defined as τ = (tid, X), where tid is the transaction identifier and X is a set of items over I (X ⊆ I).
The database is the set of all transactions P = {τ 1 , τ 2 , . . . , τ n }. The support of an itemset is equal to the count of the constellation of dimensions.
As stated before, we are interested in significant layers. We measure significance with the support of the itemsets. The set F = [α, . . . , M] holds the frequent itemsets. The support of an itemset is supp( is called a closed or frequent closed itemset if there exists no proper superset of it that cannot be extended by any dimensional data without losing support. Table 1 shows the technique and its multidimensional counterpart. Table 1. Summary of the frequent itemset mining (FIM) technique notation and its multidimensional counterpart.

Frequent Itemset Mining Multi-Dimensional Network
Items The labels of the RDF An effective representation of the layer selection in a multidimensional network is the multi-link. Multi-link − → m is an enumeration of the selected layers: We can now introduce the multi-adjacency matrix (A − → m ), with elements a − → m ij that are equal to 1 if there is a link between node i and node j, and zero otherwise [31].
Thus, multi-adjacency matrices satisfy the condition ∑ − → m a − → m ij = 1. The enumeration of the layers where nodes are active can serve as the input to most of the frequent itemset algorithms, as they effectively represent the itemsets. The methodology works with any FIM algorithm, including CHARM [49], FPclose [50] and FP-Growth [51]; for an exhaustive list, see the work of Chee, which also studies the scalability of FIM algorithms [52].

Analysis of the Resulted Multilayer Network
The union-the logical aggregation of layers-can be best expressed by the overlapping edges [31] (O α,β ).
where G γ is the layer formed by combining G α and G β , and where a α ij expresses a simple edge in layer α connecting the nodes i and j. The count of the overlapping edges corresponds to the support of the combined layer. The aggregated layers are also frequent, as every subset of a frequent itemset is frequent [48]. Logically aggregating the layers is also an efficient technique in data discovery.
In the upcoming example of author networks, authors interact on the layer of climatology, on the layers of climatology and meteorology, and in every other layer. In this case, it is difficult to keep track of all the different types of multi-links. Therefore, we can calculate the multiplicity of the overlap v ij between nodes i and j, which indicates the total number of layers in which the two nodes are connected.
where the nodes i and j are linked by the multi-link − → m = − → m ij . In weighted multidimensional networks, the weights might be correlated with the structure in a nontrivial way. To study the weights, there are two new measures: the multi-strength (s The multi-strength (s − → m i,α ) measures the total weight of the links incident to node i in layer α that form a multi-link. The inverse multi-participation ratio (Y − → m i,α ) is a measure of the inhomogeneity of the weights of the nodes that are incident to node i in layer α and are also part of the corresponding multi-link. Thus far, we have covered some indicators for multidimensional activities, which are very useful for dealing with many layers. The final step of knowledge extraction is ranking. Before turning to the ranking, we recall that the density, modularity and other structural measures are very different from layer to layer.
Therefore, for each node, we can write an NxM activity matrix (B) of elements b i,α , indicating whether node n i is present in layer α: In this way, we can measure the number of layers where i is present and active [30] as Additionally, the number of nodes present and active in a layer can be given by N α : The correlation between layers can be given by Q α,β , quantifying the fraction of nodes that are present in layer α as well as in layer β.
A straightforward ranking in a network is obtained by calculating the centralities of the nodes, reflecting their importance from different viewpoints. In multidimensional networks, the most common centrality measure is to calculate the centralities of each layer and finally aggregate them according to certain weights [31]. Both the aggregation (maximum selection, minimum selection, summation, etc.) and the centrality measure used depend on the interpretation where θ α i is the calculated centrality measure of node i in layer α and w α indicates the importance of layer α.
Now that the methodology has been described, the next section demonstrates the applicability of the methodology.

Results
The programs of the following case study are available at the github (https://github. com/abonyilab/aprioriSPARQL (accessed on 15 January 2021)) and the raw dataset is available on the Microsoft Academic Knowledge Graph homepage (http://ma-graph. org/rdf-dumps/ (accessed on 15 January 2021)) as well as the SPARQL endpoint (http: //ma-graph.org/sparql (accessed on 15 January 2021)). The goal of this demonstration is to showcase knowledge extraction from vast linked data. Therefore, we selected the LOD catalogue for scientific publications from Microsoft, the MAKG [46]. The MAKG itself contains definitions for 209,792,741 papers and 253,641,783 authors, in RDF terms, more than eight billion triplets. The papers are categorised into 229,716 fields of studies. For the relevant results, we selected the date range 2010 to 2017; the catalogue was last updated in late 2018 [46]. Our aim was to study the realms of sustainability and climate change based on the MAKG dataset, and on the other hand to showcase the importance of the proper focus to not get lost at scale, the applied frequent itemset mining pinpoints and keeps understandable the important areas of the data.
The first test on the dataset is reachability, to discover how to treat the dataset, which is better formulated as, what are the dimensions that we can analyse? For example, in the catalogue, the authors can be connected to universities, research organisations and industrial laboratories. Therefore, the dataset describes the connections from rdf:type Article to rdf:type FieldOfStudy through the connection of fabio:hasDiscipline. The previously mentioned article connects to an rdf:type Author through the connection of dcterms:creator. Reaching the rdf:type Affiliation, an org:memberOf connection is needed.
An rdf:type Affiliation can be connected to an external data source, the Global Research Identifier Database (GRID), to extend the affiliations with the geo-coordinates, regions, establishment dates, etc. Therefore, regional and institutional categorisation could be one aspect of the data. Another, more straightforward analysis is the analysis of the author network. The articles are sorted into multiple categories (rdf:type FieldOfStudy) according to the hierarchical ontology created by the MAKG. The ontology contains five levels of depth: the top level-level zero-is the major category (e.g., mathematics, medicine, engineering, chemistry, etc.), and the next levels are their descendants, the more specialised categories (e.g., nuclear medicine, applied mathematics, etc.). We take the ontology elements and the constellation of the ontology elements as the layers or dimensions of the network. Going downwards in the ontology, increasing the specification of a layer also increases the density of the layer. Not every paper is categorised into as many matching ontology elements as possible. Therefore, the lower levels-three, four and five-are ignored, and the density does not increase. However, it is true that the more specialised a layer is, the denser it becomes, even for horizontal extensions of layers, meaning the extension of an element to another element that is on the same ontological level.
In this paper, we focus on sustainability science and climate change. Therefore, we choose the ontological element "Climatology" as the starting point for our analysis to observe the advancements and analyse the social background of humanity's major problem, climate change. We also restrict ourselves to analysing only the author and organisation networks within the second ontological level, which is easy to understand and sophisticated enough to investigate. Now that we have a rough idea about what we want to do, we execute FIM on the dataset to sample it from multiple angles. With this technique, we want to uncover significant dimensions for the analysis and common constellations of disciplines that go hand-in-hand with the previously selected ontology element, "Climatology". For discovery, we propose to load and execute FIM on the whole data space, as the linked data are large on average.
In this study, the a priori FIM algorithm was used on the offline dump of the RDF database and SPARQL-based queries were utilised for the validation of the results.
FIM was executed on a low setting, with a minimum support of five, to probe the dataset, which means the selected timescale (2010-2018), in order to have slightly less than one article per year in the given frequent constellation of the field of studies. Figure 3 shows the FIM results. The results also show the optimal minimum support of 10, which is also selected for the next steps, where is a significant drop in small itemsets, but not as significant as in the longer itemsets. The lack of longer itemsets is due to the categorisation of the field of study in the dataset, as an article is categorised into 1.52 fields of study, on average, in the second ontological level. The other ontological levels have much the same statistics: 1.01 on the first (top) level, 3.18 on the third, 1.75 on the fourth, 1.32 on the fifth and 1.31 on the sixth. A length of one for the itemset indicates that climatology can be connected with another ontology element; two indicates that it can be connected with two other elements, and three with three, while still reaching the minimum support. No more extensions than three reach the minimum support. The choice could be made here to set the minimum support to a lower position, less than five, or we could be satisfied with the choice and the count of networks, in this case, 665. This is quite a manageable size of networks, and it is also worth mentioning that the edge count of the networks is approximately 5000. The next step is the network creation of authors and organisations on the layers. If the layers do not have enough nodes, then they significantly influence the ranking; therefore, our selection requirement for a layer is at least 40 different contributors, and that for edge formation is at least two contributions between organisations and one for authors where prescribed, based on the correlation measures between the layers (Equation (9)). Figure 4 shows the similarities between the layers using the edge overlap metric. In Figure 4, the darker a region is, the more similar the layers are. The dendrograms on the edge of the figure show the distances between the networks. We see that the extensions of the layers are clustered together as well as similar studies. The top left segment of the figure, including meteorology, atmospheric sciences, hydrology, oceanography, ecology and geomorphology, shows the starting points for the extensions. Those are the closest fields to climatology, which also have the most substantial support from FIM. We can also observe different views; for example, the cluster in the middle, containing agroforestry, environmental planning, economics, soil science and environmental engineering, is formed around the economic side of climatology. The cluster in the bottom right, including remote sensing combined with meteorology, artificial intelligence and mathematical optimisation, is formed around computer-based observations and modelling. A natural question about these clusters is, why are there not more extensions? Remote sensing and artificial intelligence would be a perfect match. There are such extensions, but their support is below the minimum support. The support of artificial intelligence itself is small, 482. Artificial intelligence and remote sensing together have the support of 44 papers from 2010 to 2018; however, their node count is below the selected minimum node count (10 for country networks; 40 for organisation and author networks). The other combinations show the same phenomena. Table 2 shows the important metrics of the significant institutional layers: the number of nodes, number of edges, density, modularity and average clustering coefficient. The average clustering coefficient represents the likelihood that two neighbours of a node are connected, while modularity informs us about the community structure of the network. The higher the modularity is, the more community-centred the graph. We see here that the more specific a layer is, the more community-centred, and the higher its modularity. This can be seen in atmospheric sciences by extending it with geophysics. The modularity of atmospheric sciences is 0.2989, while the extended layer modularity is 0.9436. The same phenomena can be found in all other layers and their extensions, which means that more specific layers and disciplines are owned by more interconnected communities.  Figure 5 shows the multilayer visualisation of (1) atmospheric sciences, (2) meteorology and (3) their interconnection, giving insight into the data. For this visualisation, enrichment of the data was needed to locate the research institutions on the map. The enrichment was performed with the GRID. Not every research institution could be mapped into the GRID, and therefore the unmapped research institutions are not counted in the country-level aggregation; however, this does not influence the overall ranking. With enrichment, we can easily observe the clusters both inside a country and across countries. The following rankings are based on these insights into the data. The multilayer representation clearly shows that the more specialised a topic is, the fewer contributors there are, but the more connected they are. The aggregated networks are denser with a higher modularity, as observed previously. The next artefact of knowledge extraction is the ranking. For the ranking, we calculate the importance of a country, institution and author with the multilayer eigenvector centrality (Equation (10)).
The ranking mostly depends on the layers in which an entity (country, organisation or author) takes part. This is why a strong minimum support and minimum node count are needed for the analysis; otherwise, the very specialised layers will dominate, with very few nodes, which have a very high rank. Therefore, we can use weightings according to the correlations of the layers and the nodes, as described in the methodology section, or other subjective criteria to balance the sparseness of the very specific layers. The top list is represented in Table 3. Next to the ordering in the table, the most important layer column shows where the organisation or author obtained the highest rank, and the "Agg. eigen. centr." column shows the aggregated eigenvalue centrality of the entity. With the aid of this toolset, we can observe specific connections between research areas and pinpoint research constellations describing sectors. The multi-aspect ranking provides the flexibility to take significant topics into account and refine the ranking. The different metrics are the searchlights of importance and focus. Table 4 compares the publication count-based ranks in sustainability science and climate change with the multidimensional network-based ranking. The selected topics are the subset of the FIM-selected topics presented in Table 2. The Chinese Academy of Sciences has the most publications in sustainability science and climate change, has the most publications in most of the layers and it is also highly cooperative, and therefore it occupies the first place for the Academy. In the comparison, we see that interestingly the National Oceanic and Atmospheric Administration is very highly ranked; however, the publication count in the shown layers predicts it otherwise. The Administration is highly embedded into the any Oceanic (e.g., Fishery) and Atmospheric sciences (e.g., Remote Sensing and Geophysics), as the name would predict. Thanks to the substantial co-operations of the institute, this organisation plays a central role in sustainability science and climate change, which would not be highlighted in classical analysis techniques.    Table 4. Comparison between the ranks based on the publication count in sustainability science and climate change and the multi-objective rank created by the multidimensional network.

Discussion and Conclusions
Our work contributes to the knowledge extraction of linked data. It also contributes to the notation of multidimensional networks by extending the nodes with dimensions, in contrast to the formal labelled network notation. This extension is useful in highdimensional data analysis, such as for linked open data, as the nodes are often extended with hierarchical properties and ontologies. The extraction of useful data is validated with on-demand, online, iterative SPARQL-based sampling of the dataset with frequent itemset mining.
We demonstrated the applicability of the methodology through an interesting scientometric example, co-author and co-organisation rankings in sustainability and climate change. The source of the analysis was the linked open database of the Microsoft Academic Knowledge Graph. We discovered multidisciplinary science boards using the proposed multidimensional network-based approach. We showed similarities between disciplines and the layers of the network. We also discovered that the aggregation of the layers in a multidimensional network does not always result in the loss of information, and in contrast, the aggregation of the layers results in denser, more modular information. Finally, we ranked authors and organisations with multidimensional centrality rankings and showed where sustainability and climate change are major research topics and who and which organisations are the main contributors.
The proposed methodology generates a compact and interpretable multilayered network from a linked dataset or another multidimensional network. The methodology is applicable when there are a large number of edge and node labels, with the current reference to eight billion triplets, the dataset of the Microsoft Academic Knowledge Graph. The scalability of the methodology is not limited, however, it is more an engineering challenge, than a research objective. The most time-and memory-consuming operation is the Frequent Itemset Mining, where serious advancement were already made by GPU acceleration [54], Hadoop-based partitioning [55] and Spark-based parallelism [56]. The endpoint capabilities limit the scalability of the FIM against the SPARQL endpoint; as it can be seen, for an Application Programming Interface communication its parallelism and effective scalability have already been proven by all modern web browsers.
With the aid of the proposed methodology and toolset, we can observe, select and analyse particular connections between entities in linked data, taking ontological dimensions and specific properties into account. The multi-aspect ranking provides the flexibility to refine the ranking, while the other proposed tools act as searchlights of focus to interpret a whole set of linked data, with all its extensions and possible enrichments.  Data Availability Statement: Publicly available datasets were analyzed in this study. This data can be found here: http://ma-graph.org/rdf-dumps/ (accessed on 15 January 2021).

Conflicts of Interest:
The authors declare no conflict of interest.