A Brief Survey of Methods for Analytics over RDF Knowledge Graphs

: There are several Knowledge Graphs expressed in RDF (Resource Description Framework) that aggregate/integrate data from various sources for providing uniﬁed access services and enabling insightful analytics. We observe this trend in almost every domain of our life. However, the provision of effective, efﬁcient, and user-friendly analytic services and systems is quite challenging. In this paper we survey the approaches, systems and tools that enable the formulation of analytic queries over KGs expressed in RDF. We identify the main challenges, we distinguish two main categories of analytic queries (domain speciﬁc and quality-related), and ﬁve kinds of approaches for analytics over RDF. Then, we describe in brief the works of each category and related aspects, like efﬁciency and visualization. We hope this collection to be useful for researchers and engineers for advancing the capabilities and user-friendliness of methods for analytics over knowledge graphs.


Introduction
To leverage large scale data for gaining new insights, a recent and promising practice in various domains (environment, health, economy, culture, economics and others), adopted by both academia and industry, is to construct a Knowledge Graph (KG) [1] that aggregates and integrates data from several datasets, as illustrated in Figure 1. The value of such KGs is that they provide a unified view of the domain and enable unified browsing, querying, question answering and analytics. Indeed, there are several KGs expressed in the W3C standard RDF (Resource Description Framework), including general purpose KGs, like DBpedia [2] and Wikidata [3], domain-specific KGs [4], like Europeana [5] for culture, DrugBank [6] for drugs, GRSF [7] for stocks and fisheries, ORKG [8] and OpenAIRE [9] for scholarly work, WarSampo [10] and SeaLiT [11] for historical research, recently also for research related to COVID-19 such as [12], COVID-19 Open Research Dataset (https://github.com/allenai/cord19, accessed on 1 January 2023) and CORD-19 Named Entities Knowledge Graph (https://zenodo.org/record/3827449, accessed on 1 January 2023), and finally KGs from enterprise relational databases [13]. However, the analysis of big and complex KGs is still challenging, as it is also stated in [14]. In particular, users have difficulty in analyzing complex KGs since this requires knowledge of the data terminology (which is wide in case of KGs that integrate data from several datasets) and the syntax of query language. From a system perspective, efficiency is hard to achieve for big KGs, while from an application/domain perspective users usually face completeness and freshness issues [14]. To better understand the situation, in this paper, we review the work that has been done in this area, i.e., by focusing on KGs expressed in RDF.
The rest of this paper is organized as follows: Section 2 provides the required background and refers to past surveys, while Section 3 identifies challenges and provides a categorization of the existing works. Subsequently, Section 4 surveys particular works and systems, whereas Section 5 discusses related aspects, including efficiency and visualization. Finally, Section 6 concludes the paper and identifies directions for further research.

Background and Related Surveys
This section provides a background for RDF (in Section 2.1), for SPARQL (in Section 2.2), for the possible access methods over RDF (in Section 2.3), for OLAP (in Section 2.4), and finally it discusses related surveys (in Section 2.5).

Resource Description Framework (RDF)
The Resource Description Framework (RDF) [15,16] is a graph-based data model proposed for the realization of Semantic Web vision and key format of the Linked Data publishing method. It uses triples, i.e., statements of the form subject´predicate´object , where the subject corresponds to an entity (e.g., a product, a company etc.), the predicate to a characteristic of the entity (e.g., price of a product, location of a company) and the object to the value of the predicate for the specific subject (e.g., "300", "US"). The triples are used for relating Uniform Resource Identifiers (URIs) or anonymous resources (blank nodes) with other URIs, blank nodes or constants (Literals). Formally, a triple is considered to be any element of T " pU Y BqˆpUqˆpU Y B Y Lq, where U, B and L denote the sets of URIs, blank nodes and literals, respectively. Any finite subset of T constitutes an RDF graph (or RDF data set).
RDF Schema. RDF Schema (https://en.wikipedia.org/wiki/RDF_Schema, accessed on 1 January 2023) (RDFS) is a special vocabulary that comprises a set of classes with certain properties based on the RDF extensible knowledge representation data model. Its intention is to structure RDF resources, since even though RDF uses URIs to uniquely identify resources, it lacks semantic expressiveness. It uses classes to indicate where a resource belongs, as well as properties to build relationships between entities in a class and to model constraints. For example, a KG with information about products is shown in Figure 2 (for reasons of brevity namespaces are not shown). The upper part illustrates the schema, while the bottom part illustrates the data.

SPARQL
RDF data are mainly queried through structured query languages, i.e., SPARQL (https://www.w3.org/TR/rdf-sparql-query/, accessed on 1 January 2023), which is the standard query language for RDF data. From version 1.1, SPARQL also supports complex querying using regular path expressions, grouping, aggregation, etc. In particular, and as regards analytic queries, SPARQL supports the modifier GROUP BY and supports various aggregate functions including COUNT, SUM, AVG, MIN, MAX, and GROUP_CONCAT.
For example, the expression of the query "total quantities of products released by company", over the KG of Figure 2, can be expressed in SPARQL as we can see in Figure 3.  We should note that apart from SPARQL, there are a few other languages for querying knowledge graphs, such as Cypher [17] (a declarative language implemented as part of the Neo4j graph database), Gremlin [18] (a combination of SQL, SPARQL and Cypher, which focuses on navigational queries rather than matching patterns), PGQL [19] (an SQLlike pattern-matching query language) and G-CORE [20] (a graph query language that integrates the features provided by the graph query languages Cypher [17] and PGQL [19]) for querying property graphs.

Access Methods over RDF
Apart from structured query languages (i.e., SPARQL), we have Keyword Search systems over RDF (like [21]) that enable users to search using the familiar method they use for Web searching. We can also identify the category Interactive Information Access that refers to access methods beyond the simple "query-and-response" interaction, i.e., methods that offer more interaction options to the user. In this category, there are methods for RDF Browsing (plain or similarity-based like [22]) methods for Faceted Search over RDF [23], as well as methods for Assistive (SPARQL) Query Building (e.g., [24]). Finally, in the category natural language interfaces, there are methods for question answering, dialogue systems, and conversational interfaces (e.g., see [25] for a survey). Figure 4 illustrates the above methods and the distinctive characteristics of each one.

OLAP (OnLine Analytical Processing)
OLAP is a special case of materialized data integration [26], where the data are described by using a star-schema, while "data are organized in cubes (or hypercubes), which are defined over a multidimensional space, consisting of several dimensions" [27]. Especially, in the era of big data, data is often produced faster than it can be consolidated and analyzed, and the data cube was designed to avoid slow processing times for complex data analysis, since it aggregates relevant data, speeding thus data queries. Essentially, a data cube is used to understand and analyze, fast and easily, large amounts of data that is too complex to be understood or interpreted by a table of columns. It enables consolidating or aggregating relevant data for easier handling and fast retrieval since there is no need for many time-consuming calculations when an end-user query is processed. The preaggregated values within the cells of a cube are called measures and they are the values of interest. The measures are aggregated according to dimensions , i.e., attributes of data, and they show the relationship between dimensions. The data into the cube can be viewed from different angles. A number of OLAP data cube operations exist to demonstrate these different views, allowing interactive queries and search of data at hand. Hence, OLAP supports a user-friendly environment for interactive data analysis. The basic OLAP operations are: roll up (aggregate data by ascending concept hierarchy), drilldown (navigate from less detailed data to more detailed data), slice (perform a selection on one dimension of the given cube), dice (describe a subcube by operating a selection on two or more dimensions), and pivot (provide an alternative presentation of the data).

Related Work: Past Surveys
There are several surveys available for RDF KGs. In particular, ref. [28] surveys approaches for large scale semantic integration of linked data, by giving emphasis on how to integrate multiple RDF datasets. Moreover, ref. [29] offers a survey of the RDF dataset profile features and methods, by also mentioning vocabularies for publishing RDF statistical data (which are also described later in this survey). Furthermore, ref. [30] surveys techniques and systems for querying RDF datasets, by mainly focusing on storage, indexing and query processing techniques for evaluating SPARQL queries, while [31] surveys RDF graph generation approaches from heterogeneous data, by focusing on existing mapping languages for schema and data transformations. Moreover, ref. [26] surveys and categorizes OLAP approaches that leverage semantic web technologies according to several criteria, including materialization, transformations and extensibility. Finally, there are also available surveys [32,33] that describe visualization approaches for RDF KGs and surveys for summarization for semantic RDF graphs, e.g., see [34].
All the mentioned surveys can be of primary importance for generating, integrating, querying and visualizing RDF KGs, which are usually prerequisite steps for producing analytics over RDF KGs. On the contrary to the best of our knowledge, there is no survey yet which provides an overview on analytics over RDF KGs, i.e., which is the core objective of this survey.

RDF and Analytics: Challenges and General Approaches
Section 3.1 identifies the major challenges that are related to analytics over RDF, Section 3.2 provides a categorization of the existing works on this topic, and Section 3.3 presents the different types of analytic queries by providing indicative examples.

Challenges
A KG that integrates data from several datasets tends to have a complex structure, in comparison to multidimensional data, since: (i) different resources may have different sets of properties (from different schemas), (ii) properties can be multivalued (i.e., there can be triples where the subject and predicate are the same but the objects are different) and (iii) resources may or may not have types. We should note here that the typical methods for analytics (i.e., over multidimensional data), are not adequate since they presuppose a single homogeneous data set, something that is not the case for RDF data, e.g., as it is stated in [14]: "Analytic tasks would be straightforward, using SQL or SPARQL queries and data-science tools, if the underlying data were stored in a single database or knowledge base. Unfortunately, this is not the case". Furthermore, the analysis of RDF graphs should leverage the semantics of RDF(S), i.e., the inference based on rdfs:subClassOf and rdfs:subPropertyOf, and in many cases quality, completeness and freshness issues should be tackled.

Categories of Works (Related to RDF and Analytics)
We categorize the related works in five basic categories, illustrated in Figure 5. In brief, there are works that focus on the formulation of analytic queries directly over RDF (they will be described in Section 4.2), works that first define Data Cubes over RDF (more in Section 4.3), and works that define domain-specific Pipelines that produce RDF and provide analytic services (will be described in Section 4.4). Finally, there are works that focus only on the publishing of statistical Data in RDF (more in Section 4.5), and approaches that combine data from multiple sources for producing quality analytics (see Section 4.6).

Categories of Analytic Queries
Here, we present the two main categories of analytic queries, by providing some indicative examples: (B) Quality-related analytics (e.g., connectivity, data uniqueness, data verification) of one or more KGs, e.g., through statistics or specialized metrics. They are mainly used in categories C4-C5. Examples of such queries are given below: -Coverage of a dataset: "How many unique triples DBpedia offers for the entity Aristotle?" -Connectivity between Datasets: "Give me the number of common entities among DBpedia, Wikidata and National Library of France" -Distribution of specific elements, such as properties, classes, namespaces, for detecting power-law cases in a KG or at the whole Linked Open Data (LOD) Cloud: "Is there a power-law distribution for the ontologies that are used from the LOD Cloud datasets?". -Dataset Discovery: "Which dataset is the most relevant for the entity Socrates (e.g., offering the most triples)?". -URI Quality: "What is the percentage of URIs that are dereferenceable and not broken?"

Survey of Works and Systems
In this section, we provide some details about the methodology that we followed for finding relevant papers and statistics about these papers (in Section 4.1), and we survey the existing works (in Sections 4.2-4.6) based on the categorization of Section 3.2.

Methodology and Statistics
For finding the related approaches, we used Google Scholar in the period of June 2022-November 2022 without any restrictions on the publication date. We used the following queries: (i)"RDF analytics tool", (ii) "Interactive RDF analytics", (iii) "RDF Data cube analytics", (iv) "Efficiency of RDF data analytics", (v) "Knowledge graph analytics" and (vi) "LOD Cloud analytics". For each query, we analyzed manually papers (from the first pages of Google Scholar results), i.e, by checking their title, abstract and body. Moreover, we found relevant papers from past surveys, e.g., for analytics over multiple datasets belonging to the LOD Cloud [28]. Concerning the selected papers, Figure 6 shows some statistics about the number of surveyed papers for each category and Figure 7, the year of publication of these papers. As we can see, the majority of works that we survey concern the categories C1 and C2, and most of the papers have been published between 2013-2017 (i.e., the most common case for the two mentioned categories). On the contrary, we also survey some more recent approaches (i.e., between 2018-2022), that mainly concern domain-specific pipelines (i.e., category C3) and approaches over multiple datasets at LOD scale (i.e., category C5).   Table 1 lists approaches about the formulation of analytic queries directly over RDF, for enabling the execution of analytical queries of category A. Since both the size of the datasets and the need to process aggregate queries produce challenges for the standard SPARQL query processing techniques, some of the works propose techniques to overcome these limitations. Below, we provide more details for each of the presented approaches of Table 1 (in chronological order). Ref. [38] proposes some techniques, to handle SPARQL queries with aggregate operators over dynamic RDF datasets, efficiently. It stores RDF data as a large graph, and represents a SPARQL query as a query graph. To achieve efficient and scalable query processing, it implements pattern matching queries with the help of two index structures: a VS*-tree, which is a specialized B+-tree, and a trie-based T-index. •

C1. Formulation of Analytic Queries Directly over RDF
Ref. [39] proposes a set of query processing strategies for executing aggregate SPARQL queries over federations of SPARQL endpoints by materializing the intermediate results of the queries. However, participating sources in a federation might be unavailable at some point. Data and schemata of the sources might have evolved since the federation was created; thus, integration rules might no longer be valid or history of the data will be lost. • Ref. [40] shows how to process aggregate queries by using materialized views-named queries whose results are stored in a system (since they are typically much smaller in size than the original data and can be processed faster). These results are then used for answering subsequent analytical queries. • Ref. [41] describes a possible extension of SemFacet [46] to support numeric value ranges and aggregation. The focus is on theoretical query management aspects, related to faceted search; however, it lacks an interface and implementation. From the mockups of the GUI, it seems that no count information is provided, whereas explicit path expansion is not supported. On the contrary, the authors use the notion of "recursion" to capture reachability-based facet restrictions. Since this approach is not implemented, no evaluation results are available. • Ref. [42] presents Spartex, a vertex-centric framework for complex RDF analytics, that extends SPARQL to combine generic graph algorithms (e.g., PageRank, Shortest Paths, etc.) with SPARQL queries. It employs graph exploration and uses intervertex message passing during the query evaluation. • Ref. [43] mentions that the existing federated RDF systems support only basic queries in SPARQL 1.0 and cannot be compatible with complex queries in SPARQL 1.1 well, such as aggregate queries. For this reason, proposes a query decomposition optimization method, which allows combine triple patterns with the same multisources into one subquery. The schema can reduce the number of remote requests to improve the query efficiency by reducing the number of subqueries. • Ref. [44] proposes an approach for guided query building that supports analytical queries in natural language and can be applied over any RDF graph. The implementation is over the SPARKLIS editor [47], and it has been adopted in a national French project (http://data.persee.fr/explore/sparklis/?lang=en, accessed on 1 January 2023). During the query formulation, no count information is provided, reducing in this way the exploratory characteristics of the process. The authors report positive evaluation results as regards the expressive power of the interactive formulator which works well on large datasets and is easier to use than writing SPARQL queries.
• Ref. [45] describes how a high-level functional query language, called HIFUN [48], can be exploited for applying analytics over RDF data. Rules for translating analytical HIFUN queries to SPARQL are presented. However, the interactive formulation of such queries and the evaluation part are missed from that study.
To the best of our knowledge, there is limited work regarding analytics directly over RDF graphs in a user-friendly and interactive environment. We managed to find only two such works [37,44] that let users formulate analytical queries directly in such graphs by specifying the attributes of analysis (i.e., dimensions, measures) and the operations using drop-down menus or natural language and defining their values via checkboxes. The rest of the works [35,36,[38][39][40][41][42][43]45] propose methods entangled with lower-level technicalities, preventing novice users from exploiting them, and this can be time-consuming and burdensome for experts.

C2. Definition of Data Cubes over RDF
To gap the mismatch between the relational data model and the graph data model, there are approaches that define a data cube over existing RDF graphs and then apply OLAP. According to [44], one weakness of this approach is that it requires someone with technical knowledge to define the required data cube(s). Table 2 lists such approaches, whose target is also to enable the execution of analytical queries of category A. Below, we describe them in chronological order. -2015 Jakobsen et al. [54] -2015 CubeViz [55] Various charts, e.g., pie, bar, column, line 2015 Benetallah et al. [56] -2016 Microsoft Power BI [57] Various charts e.g., bar, column, pie, area, treemap ect. 2016 Tableau [58] Various charts, e.g., column, bar, pie, line, area, map etc. 2019 • Ref. [49] introduces Graph Cube to support OLAP queries effectively on large multidimensional networks. However, it usually ignores semantic information in heterogeneous networks. The experimental studies conducted shows that this tool supports decisions on large multidimensional networks, effectively. • Ref. [50] introduces Linked Data Query Wizard, a Web-based tool for displaying, accessing, filtering, exploring, and navigating Linked Data which are expressed in data cube format and stored in SPARQL endpoints. The main innovation of the interface is that it turns the graph structure of Linked Data into a tabular interface and provides easy-to-use interaction possibilities. It supports filtering of the columns (e.g., by a keyword or a numeric value) and simple aggregations. However, the tables are limited to the presentation of the direct neighborhood of entities (columns are entity properties, and column values are the objects of those properties) rather than results of arbitrary queries. Table cells can contain sets of values but not multicolumn tables. The results of the conducted user study showed that the tool had a few weak spots that could be improved, but in general it is usable, both for experts and nonexperts in computer science.
• Ref. [51] presents Payola, a framework for Linked Data analysis and visualization. The goal is to provide end users with a tool enabling them to analyze Linked Data in a user-friendly way and without knowledge of SPARQL query language. This goal can be achieved by populating the framework with variety of domain-specific analysis and visualization plugins Although it encourages collaboration between users, e.g., experts can edit visualizations and SPARQL queries and lay-users can consume a result, it neglects to provide a complete representation of the dataset that is necessary for expressing the queries. At the same time, the amount of manual configuration and the necessary transformation steps between different abstractions might be considered a shortcoming by nontechnical users. Regarding the evaluation of this tool, there is a concise report where the test users asked a couple of questions regarding usability of it and concludes that work on the usability is needed. • Ref. [52] presents Vis-Wizard, a Web-based visualization system able to analyze multiple datasets using brushing and linking methods i.e., combining different visualizations to overcome the shortcomings of single techniques. The tool was designed for two different tasks: (i) explore endpoints like DBpedia and (ii) explore datasets that contain statistical data. Vis-Wizard allows users to group data and aggregate values providing multiple interactive widgets. According to [59], the online version reports a multitude of errors that prevented users to analyze the different visualizations that the tool offers. In fact, console errors rose and no charts appeared. Regarding endpoints like DBpedia, the tool works fine, but the tabular layout they implemented results to be a little messy at first. The evaluation conducted regarding the usability of the Vis-Wizard shows that while several usability issues still need to be fixed, the overall advantage is observable. • Ref. [53] proposes algorithms that use the materialized result of an RDF analytical query to compute the answer to a subsequent query. The answer is computed based on the intermediate results of the original analytical query. However, the approach does not propose any algorithm for view selection. It is applicable for the subsequent queries and not to an arbitrary set of queries [40]. In addition, no evaluation is reported. • Ref. [54] studies the improvement of SPARQL queries over QB4OLAP [60] (an extension of the RDF Data Cube Vocabulary https://www.w3.org/TR/vocab-data-cube/, accessed on 1 January 2023) to fully support OLAP multi-dimensional models and operators) data cubes. The idea behind the proposed approach is to directly link facts (observations) with attribute values of related level members. Although preliminary results in an evaluation study show an improvement in queries performance, this approach prevents level members from being reused and referenced, breaking the Linked Data nature of QB4OLAP data instances. • Ref. [55] proposes CubeViz, a user-friendly exploration and visualization platform for statistical data represented adhering to the RDF Data Cube vocabulary. If statistical data is provided adhering to the Data Cube vocabulary, CubeViz exhibits a faceted browsing widget allowing to interactively filter observations to be visualized in charts. However, it does not support aggregate functions, such as SUM, AVG, MIN and MAX, and blank nodes. According to [61] if the created RDF Data Cube is sparse, it is possible to receive an empty result set after using the data selection component of CubeViz.
As a consequence, CubeViz is not able to process all kinds of valid Data Cubes. In a domain-agnostic tool such as CubeViz, it is not feasible to integrate static mappings between data items and their graphical representations. Most of the chart APIs have a limited amount of predefined colors used for coloring dimension elements or select colors completely arbitrarily. Finally, this paper does not provide any information about the evaluation of this tool. It contains only a link to an online demonstrator letting users evaluate it. • Ref. [56] presents multidimensional and multiview graph data using MapReducebased graph processing. The goal is to facilitate the analytics over the ER graph through summarizing the process graph and providing multiple views at different granularities. The technique, however, always materializes the result as paths with respect to a single entity identifier. The experiments conducted over real-world data sets, showed that the proposed approach performs well. • Ref. [57] introduces Microsoft Power BI, a business intelligence platform that provides nontechnical business users with tools for aggregating, analyzing, visualizing and sharing data. Power BI's user interface is intuitive mainly for users familiar with Excel. It assumes that the ingested data has been cleaned up well in advance, while there is also a limit on its size (cannot import large data sets). After the data hit the limit, you have to upgrade to the paid version of Power BI.  [64,65].
All of these systems follow common techniques in the formulation of the analytical queries. They let users specify the attributes of analysis (i.e., dimensions, measures) and the operations interactively using drop-down menus and define their values via check-boxes.

C3. Domain-Specific Pipelines over RDF
There are numerous works that focus on defining specific pipelines for constructing the desired KG, from various structured and unstructured sources, and then offer particular analytic queries and visualizations to support domain-specific research purposes, e.g., for supporting analytical queries of category A. Since there is a large number of such available cases, e.g., ref. [4] surveys more than 140 papers on KGs from seven different domains, below, we present a few number of indicative works, from the medical, publications and cultural domain (presented according to their domain): • Medical Domain. PhLeGrA [66] has integrated data from several large scale biomedical datasets, for analyzing associations between drugs, i.e., for improving the accuracy of predictions of adverse drug reactions. Moreover, ref. [67] collects both structured and unstructured data for creating an aggregated KG about cancer data. The objective is to provide cancer data analytics through several services, such as treatment sequence analysis, data discrepancy analysis and others. Moreover, ref.
[68] created a KG, from over 50,000 articles related to coronaviruses, by using linked data techniques. The produced RDF dataset can be used for producing analytics through several extraction and visualization tools, e.g., it is feasible to analyze the number of articles that comention cancer types and viruses of the corona family. Finally, ref.
[69] describes a framework called Knowledge4COVID-19, that integrates several RDF sources of COVID-19 related data. The resulting KG is exploited from machine learning methods for providing analytics and visualizations that are used for discovering adverse drug effects and for evaluating the effectiveness and toxicity of COVID-19 treatments. • Publications Domain. OpenAIRE [70] is a Research KG that aggregates a collection of metadata and links, which are offered within the OpenAIRE Open Science infras-tructure, and provides several analytics and visualizations, such as for usage data (https://usagecounts.openaire.eu/analytics, accessed on 1 January 2023). Moreover, Open Research Knowledge Graph (ORKG) [8] exploit manual and automated techniques for creating and processing a scholarly KG. The mentioned KG can be used for further analysis through visualizations that are produced by the offered data science environments (e.g., see https://orkg.org/visualizations, accessed on 1 January 2023). • Cultural Domain. FAST CAT [71] is a collaborative system for data entry and curation in Digital Humanities, and it can be exploited for performing historical analysis over aggregated data. Moreover, ref. [72] describes BiographySampo, an approach that provides analytics for biographical and prosopographical research, by first transforming textual resources (from the National Biography of Finland) to RDF data. Afterward, even users that are nonfamiliar with SPARQL, can perform custom-made complex data analysis through the offered tools.

C4. Publishing of Statistical Data in RDF
This category of works is not for formulating analytic queries but for exchanging statistical results, and they mainly focus on providing analytical queries of category B. However, they can be also used for analytical queries of category A, i.e., for publishing domain-specific statistical data. In particular, we provide two different subcategories, i.e., works that publish statistical data as linked data through either the RDF data cube vocabulary (https://www.w3.org/TR/vocab-data-cube/, accessed on 1 January 2023), or the "Vocabulary of Interlinked Datasets", i.e., VoID [73]. All the approaches are listed in Table 3 and are described below. • Works with RDF data cube vocabulary. To foster the exchange and intelligibility of statistical results (expressed in csv and other formats), approaches such as [74,75], focus on publishing statistical data as linked data through RDF data cube vocabulary. Such statistical data can be visualized and analyzed through the framework Payola [51] (which has been described in category C2). • Works with VoID vocabulary. VoID can be exploited for expressing metadata about one or more RDF datasets, i.e., for representing and publishing several simple statistics, such as the number of triples, properties or classes of each dataset and the number of links between different datasets. Several tools have been published for measuring such statistics for RDF datasets through VoID including Aether [76] for generating, browsing and visualizing statistics, by using SPARQL queries. Furthermore, ref. [77] describes the tool Loupe, which provides summaries and an analysis of vocabulary information about each RDF dataset, e.g., the classes and properties used in each dataset. There have been proposed extensions of VoID, such as [78], for publishing and analyzing connectivity analytics of semantic data warehouses. On the contrary, approaches such as SPORTAL [79] and SPLENDID [80] compute and publish such statistics, for aiding the process of source selection for federated queries. Finally, the application KartoGraphI [81] publishes statistical data through VoID (and extensions of VoID), for SPARQL endpoints and provides several types of visualizations for the results. Table 4 introduces approaches that produce quality analytics, i.e., analytical queries of category B, over single and multiple RDF datasets (even at LOD-Scale). As we can observe from Table 4, most approaches of category C5 produce analytics either for measuring distributions (e.g., power-law cases) or for dataset discovery, i.e., as they are divided (and described below). • Works that measure distributions (e.g., power-law). Ref. [82] measured and analyzed the graph features of Semantic Web (SW) schemas with focus on powerlaw degree distributions, and the main finding was that the majority of SW schemas (at that time 2008) with a significant number of properties (resp. classes) approximate a power-law for total-degree (resp. number of subsumed classes) distribution. Furthermore, LOD-a-LOT [85] is an approach where 28 billion RDF triples from thousands of RDF documents have been collected, for enabling the analysis and the querying of combined data from multiple data sources, e.g., for analyzing the distribution of URIs and triples. Moreover, ref. [86] presents algorithms for computing analytical queries over Linked Open Data, by aggregating the results of queries from running SPARQL endpoints, i.e., for producing analytics over multiple LOD datasets, e.g., they measure the property and class usage on the LOD cloud, and they estimate the number of the available triples in the LOD Cloud. Finally, ref. [87] presents an empirical analysis of linkage among all the datasets of the LOD cloud, by focusing on automated methods for analyzing different link types at scale. The objective was to analyze the availability and discoverability of LOD datasets, i.e., the most commonly used ontologies, namespaces and classes, and many others, e.g., for discovering powerlaw distributions, and to analyze the quality of URIs, e.g., broken links, deferenacable URIs, etc. • Works for Dataset Discovery. LODVader [83] is a system that produces LOD analytics over 491 RDF datasets, for supporting dataset exploration, analysis and dataset discovery. Moreover, LODstats [84] is a service including some basic metadata and statistics for over 9000 RDF datasets, e.g., for measuring the number of datasets of specific property and class elements. Furthermore, LODsyndesis [16] is a suite of services that provides analytics for measuring the connectivity among hundreds of RDF datasets. The target is the produced connectivity analytics to be exploited for improving the discoverability and reusability of the underlying datasets, and for answering coverage queries. Finally, LODChain [88] is a research prototype the computes connectivity analytics for a new RDF dataset at real time to the rest of LOD Cloud through LODsyndesis, and produces several visualizations (including graph visualizations, bar and pie charts, etc.) and dataset discovery measurements. In particular, the target is the analytics to be used for enriching and verifying the content of the input dataset.

Efficiency and Visualization
This section discusses related aspects for the surveyed papers, i.e., efficiency (in Section 5.1) and visualization (in Section 5.2).

Efficiency
First, for the category C1, in [36], the authors measure the efficiency of joining star patterns with grouping operators for executing aggregating queries. They indicate that for complex analytical tasks that combine generic graph processing with SPARQL, vertexcentric graph processing frameworks are at least an order of magnitude faster than existing alternatives [42], whereas they demonstrate significant performance improvements for analytical processing of RDF data over existing Map-Reduce based techniques [35]. They show that decomposing the analytical queries and materializing the intermediate results [39,40] improve the query response time by more than an order of magnitude, and that in these cases, the average query time increases linearly with the increase of dataset size [43].
Concerning the category C2, in [56] the authors show that the size of the dataset as well as the number of function operations in an analytical query influence the execution time of such a query. They prove that running queries on Virtuoso over data cubes in the star pattern is faster than over cubes in the snowflake pattern, which is particularly interesting since the snowflake pattern is the pattern in which most RDF data cubes are available [54].
As regards category C3, in many cases, the authors measure the execution time of the SPARQL queries that produce the analytics [67,69], which are executed over the resulting KG. Generally, these queries are executed quite fast, even in a few milliseconds. On the contrary, the most time-consuming task of such domain-specific approaches is usually the creation of the KG, which requires huge human effort [89].
Regarding the approaches of category C4, which produce statistics usually through SPARQL queries [76,84], their performance highly depends on the underlying SPARQL endpoints, and the size of the datasets (number of triples, URIs, etc).
Concerning the category C5, for enabling the fast computation of analytics, in several cases, specialized indexes are created, e.g., see LODsyndesis [16] and LOD-a-Lot [85]. Indicatively, the indexes of LODsyndesis aggregated KG [16] (which contain more than 2 billion triples), are constructed once in approximately 7 hours. On the contrary, the connectivity analytics are produced quite fast, i.e., even in a few seconds, by accessing the mentioned indexes. Regarding LODChain, it can produce the analytics for hundreds of thousands of triples in a few minutes (indicatively less than a minute for 50,000 triples), by also exploiting the indexes of LODsyndesis.
A complementary topic is that of ranking, in the sense that if the KG is big, or the results are big, then methods that can rank and reveal the more important elements are useful also for visualization purposes. Such ranking methods can be leveraged at both schema and data level, just indicatively, ref. [90] proposed methods for ranking RDF Schema elements (and their applications in visualization), ref. [91] described ranking-induced top-k diagrams for reducing the information overload.

Concluding Remarks
The analysis of big and complex KGs in RDF is challenging. In this brief survey, we reviewed the work that has been in this area. In brief, we identified two main categories of analytic queries (domain specific and quality-related), and five kinds of approaches for analytics over RDF. Then, we described the related works that fall in these categories. In total, we surveyed 45 papers (including more than 15 systems). In general, we observe an increasing trend for analytics over RDF KGs, for both domain-specific (e.g., for medical and publications domain) and domain-independent tasks. In particular, we identified 11 works for applying domain-related analytic queries over general-purpose KGs, whereas we surveyed 10 works that first define data cubes over RDF and then use them for analysis. We have also described indicatively 8 works on domain specific pipelines for analytics from various domains, including health (drugs, cancer and Covid-19), research publications, and digital humanities (historical analysis). Finally, we mentioned 8 works for publishing statistical data through RDF vocabularies and 8 works for quality-related analytics over single and multiple RDF datasets (or LOD scale) for fostering connectivity. Figure 10 summarizes the categories identified, the number of works of each category and the main challenges. We hope this collection to be useful for researchers and engineers for advancing the capabilities and user-friendliness of methods for analytics over knowledge graphs.