1. Introduction
Active research is being performed on how to discover intrinsic values from real-world application services, in which various types of entities form organic relationships with one other and perform meaningful interactions, and further acquire information to use the information as knowledge. Various abstractions or modeling methods are being examined to understand the objects that constitute each service and their relationships, and these research results are used as a means of grasping phenomena or shapes, searching information, discovering knowledge, and predicting the future.
The network structure is a modeling method that is widely used to represent the elements that constitute a service and their interactions. This data structure, also called as a network or a graph, has been used in mathematics to model the paired relationships between objects. A network is composed of two elements—‘nodes’ and ‘relationships’, where a node represents entities, such as people, places, objects, categories, and concepts, whereas a relationship represents the association between pairs of nodes. A network can be regarded as a visual expression technique in which several types of objects form relationships with each other. In particular, the visualization of the interaction of many individual objects with a specific object, or a group of objects, in a network model is called an information network [
1,
2].
There is a variety of data that can be represented by information networks, such as online communities, social networks, computer system network configurations, ontology, and knowledge graphs. Among them, an online bibliographic data indexing service [
3,
4] provides research publications in various fields, along with information, such as title, author, publisher, and publication year, which is a service that is highly utilized by researchers. The bibliographic database is capable of constructing a multidimensional information network having multidimensional information via publications as well as a single object information network, such as a researcher network, a reference relationship network, and a conference network from the data provided, together with the information on the research publications.
The bibliographic databases contain an extensive amount of information that is related to papers written and published by authors. In addition to simple information, such as a manuscript published by a certain author, it provides highly useful data along with various information, such as the author, publisher, academic conference, research institute, and year of publication, besides the title and access path of the publication. By performing various analysis queries on these data, in addition to simple aggregate values, such as the number of papers by author, useful information can be obtained, including influential papers with many references, a search for groups of researchers who investigate similar fields, and classification of the developmental history or topics of representative papers on a certain topic. Accordingly, the bibliographic data are being actively researched in various fields, including subject classification and trend analysis [
5] in relational research [
6,
7], such as collaboration between authors, co-author relationship prediction [
8], in addition to information search [
3,
9] regarding research materials.
DBLP (
https://dblp.uni-trier.de/, (accessed on 16 April 2021)) is, by far, the most representative online bibliographic database service in the field of computer science. DBLP provides an index of 4.5 million or more publications that were published by two million or more authors, which are categorized into tens of thousands of journals, conferences, and workshops, thereby providing free access to researchers in the field of computer science. Furthermore, the index catalog of publication data is open to be downloaded in an XML format. In addition to DBLP, an integrated bibliographic database has been constructed and provided by combining data available from ACM Digital Library (
https://dl.acm.org/, (accessed on 16 April 2021)) and Microsoft Academic (
https://academic.microsoft.com/home, (accessed on 16 April 2021)).
The bibliographic database contains various types of information objects, such as title, author, affiliated institution, academic conference, and publisher for each research paper, and the data herein have a complex structure in which different types of relationships are defined between information objects [
3,
10]. To represent and analyze various types of objects based on their complex relationship, considerable research has been performed to address the bibliographic data as an information network structure [
4,
7,
11,
12,
13]. In bibliographic data, the types of objects being handled are different, and the types of definitions of relationships between objects also vary. When represented as an information network, the entities that are denoted by nodes include paper, authors, publisher, academic conference, and institution, and the edges represent different types of relationships between different entity types. When there are multiple types of nodes and edges that constitute an information network, it can be defined as a heterogeneous information network [
10,
11].
Research on the analysis of heterogeneous information networks has been actively performed in a wide variety of branches over recent decades. Particularly, bibliographic databases contain representative data that can be characterized by heterogeneous information network, which are faced with many challenges [
12] in understanding the structure of information and analyzing behavior, because the data have a significantly large volume and complex structure.
Research that analyzes various types of data has been long performed in the field of on-line analytical analysis (OLAP). Traditional OLAP has mainly been applied to structured data analysis in the form of tables. The bibliographic database has a graph structure, which is, the data in the form of an information network that has interrelationships between various information objects, such as paper and author as well as paper and paper . The OLAP technology that supports new models and operations is required for performing OLAP on this information network-type data. Hence, research on graph OLAP for this information network analysis has been performed [
1,
2,
4,
12].
The bibliographic database has a complex structure that contains both different types of various information objects (such as title, author, affiliated institution, academic conference, and publisher) for each research paper and different types of relationships among these information objects. This paper has modeled bibliographic data as an information network structure and further investigated techniques and tools to analyze the data from the perspective of OLAP. Specifically, this paper has designed and developed a visualization tool that supports online analysis for practical purposes of researchers based on the information network OLAP of the bibliographic data.
To develop an online visualization tool, this paper defines a heterogeneous information network model for bibliographic data, and it further designs a storage structure that can hold and manage the data using a graph database. Moreover, this paper has developed an easy and efficient information network structure visualization tool that is equipped with a user-friendly interface that performs visual search and analysis on stored bibliographic data.
The main contributions of this paper are as follows:
modeling a bibliographic database with the concept of heterogeneous information networks and defining the Bibliographic Information Network in a formal way;
defining navigation and browsing operators for exploratory analysis of bibliographic database on this model; and,
designing and developing a visualization tool, OLGAVis, which provides visual exploration and analysis of bibliographic databases easily and conveniently.
The paper is organized, as follows. In
Section 2, the heterogeneous information network and bibliographic information network analysis consisting of bibliographic data are explained. In
Section 3, a large volume of bibliographic data is designed as an information network, and, in
Section 4, the implementation results of the visualization tool and an example of the operation of bibliographic data analysis using this tool are demonstrated.
Section 5 presents the comparison results with other graph visualization tools,
Section 6 introduces the existing studies for bibliographic data analysis, and finally, in
Section 7, the direction of future research for improvement and expansion as well as conclusions are presented.
2. Background and Preliminaries
2.1. Heterogeneous Information Network
The information knowledge system can be represented by a number of information knowledge entities, their attributes, and the meaning given to related attributes, description, interaction, and relations. Components that are interconnected and interacting form a type of network, and a system that is based on these relationships and connections, is called information network [
1,
2]. Research has been actively performed over recent decades to understand the relevance by representing the interaction between elements constituting the system as relationships, or to analyze the latent patterns and meanings.
Information networks use graphs to model objects and interactions between objects that constitute a system. More specifically, this network defines a graph G = (V,E) by setting an object as a vertex, and the relationship as an edge, where V refers to a vertex and E refers to an edge. For example, in a bibliographic information service, papers can be represented by a vertex and a reference relationship can be represented by an edge, whereas, in a social network service, a user can be represented by a node, and a friend relationship can be represented by an edge. Individual instances of an object set can have a connection relationship. The relationship between the reference and the referenced is configured as a directed network, and the relationship only representing the presence or absence of the relationship is configured as an undirected network.
If the objects and relationships constituting the information network have only a single type, then it is called a homogeneous information network [
12]. Some of the examples include friendship networks in social networks and author collaboration networks in bibliographic information (
Figure 1a). The type of object exists only as unique attributes, such as “user” or “author”, while the edge relationship is represented by single attributes, such as “are friends” or “collaborate”.
However, in the real world, there are not many networks in which only a single type exists. Even in the case of representing and analyzing the data in an homogeneous information network, most cases focus on the discovery of specific information or knowledge through the process of abstraction or reduction of the real world. For example, there are various types of objects that constitute the bibliographic information system, such as author, paper title, keyword, journal or conference, year and volume, and even an author object can have various attributes, including the author’s name and affiliated institution. There are several types of edge relationships because there are various types of objects. The type of edges is highly diverse, depending on the type of relationship between individual instances in an object set, including “write”, “reference”, “publish”, and “collaborate” (
Figure 1b).
Accordingly, a network that consists of multiple types of object sets and relationship sets is called a heterogeneous information network [
10,
11]. In this system, when there is a demand for analysis with focus on the collaborative relationship between authors, a homogeneous information network analysis is conducted in the case of simplifying and representing only the “author” object and the “collaboration” relationship, excluding other object attributes, while a heterogeneous information network analysis is conducted in the case of deriving richer semantics for various objects and relationships.
2.2. Bibliographic Information Network Analysis
Bibliographic data are used by numerous researchers for the purpose of retrieving information on related papers, such as authors, publishers, and research topics. As the volume of accumulated data of research results becomes vast, studies using bibliographic data for various purposes have continued in addition to simple search of information. Examples include a discovery of a pattern of collaboration between authors, an influential researcher in a group, or a relationship between universities or research institutes, as well as an analysis on an interaction of knowledge contained in a research product, research topics or trends, and a prediction of a new relationship or research topic [
8,
12].
The fields of research addressing the analysis of large-volume bibliographic data can be divided into four major branches. First, there is a graph theory that understands the form through modeling of the graph structure based on statistics, which is the most traditional research field. Another study in 1996 was also conducted in data mining, which was a general term for the process of discovering hidden information and meaningful structures from large-scale databases, to discover and predict links or trends between data by applying supervised and unsupervised learning algorithms for description and prediction. In 1997, with the emergence of the concept of a data warehouse, which refers to a large data storage, a data cube was built, and the OLAP technology was utilized to perform multidimensional data analysis.
The OLAP, which is used for efficient analysis of structured data, is being expanded to explore the analysis and visualization of more complex structured data. In particular, an analysis has been conducted using the OLAP method for heterogeneous information networks that represent the complex connection relationship of various information objects that are closely interrelated. There are numerous studies that discover and extract inherent knowledge by analyzing the relationship between various information objects that are related to publications in addition to various statistical analyses on publications.
When modeling heterogeneous information network data, a set of entities constituting nodes and edges has a set of attributes. The attribute that forms a node is called a node attribute, the attribute providing information about viewpoints is called an informational attribute, and the attribute that is given to a relationship is called an edge attribute. The node attribute can include author ID and paper ID, the information attribute can include publisher, publication year, affiliated institution, country, and the edge attribute can include the collaboration frequency and connection strength.
5. Comparison of Graph Visualization Tools
Graph visualization tools addressing the bibliographic data include VOSViewer (
https://www.vosviewer.com/ (accessed on 16 April 2021)), and CitationNetworkExplorer (
https://www.citnetexplorer.nl/ (accessed on 16 April 2021)), which were developed by the same organization. These tools were developed for visualization and analysis of publication statistics and citation networks, which provide a clustering function that is based on the citation relationship and keywords of the paper. However, these tools cannot be deemed as a visualization tool for an exploratory analysis based on a heterogeneous information network model that addresses all of nodes, relationships, and attributes.
There have been studies on OLAP analysis and visualization tools of bibliographic data. A study [
3] investigated the design of a relational database schema for OLAP analysis of bibliographic data, SQL query for OLAP operation, and data warehouse construction. The MDX Query of MS-SQL was used and the result was visualized with tables and charts of MS Excel, which cannot be regarded as a GUI tool.
Another research [
15], which developed an OLAP cube-based graph visualization tool for bibliographic data, constructed sub-networks, such as co-author networks, citation networks, and topic networks from the DBLP data, and further developed a graph visualization tool that reinforced cube processing on nodes and edges. The same research investigated the graph cube generation and graph-oriented operation of OLAP operators, such as Roll-up and Drill-down. The graph database Neo4j was used for data storage, but there was no mention of storage schema modeling.
Although a number of studies addressing an analysis based on graphs or information networks of bibliographic data have been conducted, there are only a few studies that have developed tools as a result of the research, or that intended to develop a GUI tool. The
Table 2 shows a comparison of the graph visualization studies on the bibliographic data.
The first research [
3] in the table is a bibliometric study, which visualized the results of analysis with MDX Query and MS Excel, not a study on GUI tool development. The second research [
16] is related to a bibliographic data visualization tool, which is serviced under the name of VOSViewer. This is a tool in the bibliometric network field, which conducted a clustering analysis that was based on citation and co-author relationships. The third research [
15] is a heterogeneous information network visualization tool that performs OLAP and cube-based operations, which is a GUI tool that does not have keyword search or link navigation functions for the result network.
The visualization tool that is proposed in this research has great significance, in that a complete heterogeneous information network has been modeled that considers all relationships and attributes between major entities that compose complex bibliographic data, and explanatory entities that are directly connected to the papers. Furthermore, this research is also significant in that this tool can use the graph database to improve the development efficiency of relationship-oriented services and, unlike other GUI tools, this tool is capable of keyword search, link navigation, exploratory browsing that provides node expansion, and graph aggregation.
6. Related Works
6.1. Graph OLAP
In 1993, E.F.Codd proposed OLAP [
17], where users directly conduct online multi-faceted analysis on multidimensional data. The OLAP can be regarded as a process in which the end user directly accesses multidimensional information and interactively analyzes and utilizes the information in a conversational manner. In this case, the user conducts data analysis based on decision-making, and utilizes the analysis results as information, which has become a great opportunity to show the possibility of using information beyond the on-line transaction process (OLTP) that made simple transactions in the past.
Graph OLAP is a term that refers to OLAP for information networks, and it was first introduced in 2008 in a study by the research team of Jiawei Han [
1]. Moreover, numerous studies have been conducted to perform OLAP operations on data that are represented by graph structures [
2,
4,
12,
18,
19,
20]. Graph OLAP is a process of representing the nodes and edges that constitute a partial network of a specific viewpoint to be analyzed, as well as a group of network representation results from various analytic viewpoints.
The OLAP conventionally creates analysis viewpoints using core operators, such as Roll-up, Drill-down, Slice, and Dice, to analyze multi-dimensional and multi-level viewpoints, and further produces the summarized results using aggregate functions, such as COUNT and SUM. Several studies that are related to graph OLAP have been conducted to generate various summary networks using conventional OLAP operators.
Figure 11 shows a network that is expressed on the edges by taking the number of papers as a weight for the authors who have worked on papers together at academic conferences. Toward the higher level, this network provides a more summarized analysis by combining the number of papers, while toward the lower level, this network provides an analysis in detail by using the aggregated values from academic conferences.
Figure 12 shows an example of analyzing the collaboration relationship between the authors that are affiliated with the research organization “O1” and the authors affiliated with “O2” by converting from a lower-level author viewpoint network to a higher-level research organization viewpoint network. Unlike the aforementioned example, in this case the shape of the network is deformed to create a partial network of a different shape.
The graph OLAP studies have witnessed that networks with different characteristics are created in performing OLAP operations according to the characteristics of the dimensional attributes that are the viewpoints of analysis. Thus, the type of dimensions is divided into “informational dimension” and “topological dimensions”, and the OLAP that is performed on the informational dimension is defined as informational OLAP (I-OLAP), while the OLAP performed on topological dimension is defined as topological OLAP (T-OLAP); the graphs aggregated and generated by each OLAP were defined as I-aggregated Graph and T-aggregated Graph, respectively.
In graph OLAP, the measured value can be numerical values, such as the number of works, central indicators from the graph theory, and graph diameter, and the results are expressed as a graph. Thus, an operation is required that considers all of the entities, attributes, and relationships other than aggregate functions in the traditional OLAP. In this respect, the concepts of OLAP operator and graph cube to address this need were also introduced [
2,
18].
The graph cube computes the resulting network of aggregates created from all possible combinations of the individual attributes of the nodes that constitute the graph [
18]. Aggregation networks may be used interchangeably with different terms, such as cuboid, view, and query. The set of all cuboids that can be created with a combination of attributes is called a graph cube lattice.
The cost of creating a cuboid is significantly high when the size of the network is large and the attributes of the entities constituting the network are diverse. The cost of producing the cuboid is still a huge burden, even if other methods, such as pre-calculating and storing the cuboid, or using the previous results on the cuboid, are used to improve the query performance. Furthermore, because the attributes that the relationship itself can have, as well as the attributes of nodes can become a viewpoint of analysis, creating a graph cube considering all of these cases may not be a good method in terms of creation, storage, and maintenance under limited resources.
6.2. Bibliometrics
Bibliometrics is a field of research [
10,
21] that conducts an analysis to measure the influence of authors or publications on large-scale bibliographic data. In addition to simple statistical analysis that measures and calculates the frequency, mean, and ratio of citations and collaborations for publications, the impact is determined by analyzing the relationship between authors or papers through citation indices of papers. Citation analysis is the measurement of the frequency in which a specific paper is cited to evaluate the influence or quality of the paper, author, or research institution.
For citation analysis, a citation network analysis research [
22,
23] that expressed citation relationships between papers in graphs was performed, which has been further expanded to develop into bibliometric network analysis studies that measured the frequency of citations and collaborations of publications by defining the relationships between the nodes, such as publications, journals, researchers, and relations of citations and co-authors from the constituent elements of the bibliographic data.
Bibliometric networks [
24,
25,
26,
27,
28] are composed of nodes and edges, where the nodes become publications, such as papers, journals, researchers, and keywords, while the relationship between the two nodes is indicated as edges. The research is mostly performed in the expression and analysis of citation relationships, common keyword relationships, and co-author relationships. In particular, the major pillars of research on citation relation analysis are ‘co-cited [
29,
30]’ and ‘bibliographic coupling [
31]’. In recent research [
32], analysis of ’co-citation’ and ’high-ranked terms’ using bibliometrics and information visualization were performed.
If there is a third publication citing two publications, these two publications are represented by ‘co-cited,’ and the larger the number of co-cited publications, the stronger the relationship [
33,
34,
35]. Bibliographical coupling, on the contrary, is the case where there is a third publication that is cited by two publications, which is, where the references are duplicated. The more the two publications have in common, the stronger the bibliographic coupling between publications [
36,
37,
38].
Figure 13 shows the difference between ’co-cited’ and ’bibliographic coupling’.
6.3. Information Network Analysis
The start of the most basic work of information network analysis is to find an inherent pattern by analyzing the connection relationships from the data that constitutes the network. Similar research fields include social network analysis, graph mining, and web mining, in which the research results from the fields, such as graph theory, network science, and link analysis, are widely used. The techniques of data mining and OLAP are utilized to discover meaningful knowledge and patterns. In addition to the basic operation of OLAP, research is used by applying techniques, such as classification, clustering technology, and ranking of data mining, relationship prediction, and entity similarity search.
Modeling and analyzing the information network structure is for the purpose of estimating structural importance through connection relationships between the nodes constituting the network, and inferring and predicting the underlying relationships, rather than analyzing individual attributes of each entity in depth. To this end, node centrality measurement, network structure, and community detection are the primariliy used network analysis techniques.
Node centrality (
https://en.wikipedia.org/wiki/Centrality, (accessed on 16 April 2021)) is to numerically calculate the importance of the position of each node on the network. The basic indicators that are mainly used include degree centrality, which counts and calculates the number of lines that are directly connected to other nodes, betweenness centrality, which calculates the node that must be gone through to reach another node, and closeness centrality, which measures the node with the shortest path starting from one node to all other nodes. In addition to these indicators, several techniques are utilized, including eigenvector centrality, which is measured by weighting according to the importance of connected nodes rather than obtaining measurement by distance alone, and its application version, page rank (
https://en.wikipedia.org/wiki/PageRank, (accessed on 16 April 2021)).
In analyzing a network, the structure, or shape, of the generated network is quantified and measured. The most commonly used scales for this measurement include radius (
https://en.wikipedia.org/wiki/Distance_(graph_theory), (accessed on 16 April 2021)) , and clustering coefficient (
https://en.wikipedia.org/wiki/Clustering_coefficient, (accessed on 16 April 2021)). Here, the radius mathematically refers to the linear distance from the center of the circle to the boundary line, whereas the radius in the network refers to the measurement of the shortest path from the node with the highest closeness centrality to the node with the farthest path length. The clustering coefficient is a measure of the clustering tendency of nodes, which corresponds to the probability that a specific node and neighboring nodes are connected to each other. The basic unit for measuring this coefficient is a closed triplet, and the clustering coefficient is determined by calculating the number of actually existing closed triplets as compared to that of the closed triplets that the entire network can have.
To understand complex object types and connection types in heterogeneous information network research, this research defines the network schema [
39,
40], as well as the path for objects and relationships that follow the schema as meta path [
13].
Figure 14 shows an example of a heterogeneous information network meta-path for bibliographic data.
In addition to studies that calculate the similarity scores between authors using meta path [
13], or an author’s importance through different meta paths [
41], other studies have been conducted to create unique characteristics and effective meaning of bibliographic networks by introducing various methods such as similarity measurement in data mining [
13,
42], clustering [
43], and classification [
44] methods.
In network theory, a group of nodes with high connection density is called a community, and finding a small group with a relatively high connection density for the entire network is called community detection. Various studies have been conducted in otder to find the community of researchers or research institutes for the bibliographic data. Other various studies have been conducted on detecting a community because community detection contributes to discovering the structure, as well as undisclosed information and knowledge by analyzing the characteristics of the nodes that constitute the community.
7. Conclusions and Future Works
Network visualization is the most suitable way to subdivide various areas that make up the real world into small characteristic worlds, model them conceptually, abstract, and store them into a form that can be processed by a computer in order to provide them in an easy-to-understand shape. This is because objects with individual attributes that make up the real-world form meaningful relationships with other objects that are different component, which are intuitively represented by the network model. Abstracting and visualizing a complex worldview with each information object and relationship helps humans to intuitively grasp the structure and shape.
Network visualization and analysis have received great attention with the emergence and development of social network services, and various applications using them have been implemented. Research on techniques for more effective and accurate analysis has been performed after confirming that network analysis can be widely used in understanding phenomena and predicting the future, as in discovering specific influential entities or groups of entities and exploring patterns of information flow. As the types of information objects that constitute the network become at least two and, thus, the types of relationships between objects increase accordingly, the conventional analytic technique with a single relationship and a single node has become inapplicable to the network, resulting in significant divergence in research.
Online bibliographic information service, which is one of the data services most frequently used by researchers, provides information on papers in research fields of interest, past and current trends in research, author-related information, and information on journals. The DBLP, the leading online bibliographic information service, provides the data on academic journals and events categorized into 4.5 million or more publications, two million or more authors, and tens of thousands of journals, conferences, and workshops. Simple keyword search is efficient because all of these bibliographic information services are provided in the form of an index list, while it is impossible to grasp and analyze the relationship between information objects.
Numerous studies have been conducted in accordance with the need for research on the visualization and analysis of complex and large-scale bibliographic information services. However, there are few cases in which the results of excellent research have reached the development of practical tools and open services. Thus, this study has established a model in which the data are downloaded from aminer.org that shares data files organized by integrating numerous online publications provided by DBLP, ACM, and MS Academy, and further stored into Neo4J, which is the most efficient graph database for network visualization.
This paper has designed an integrated storage schema centered on the paper object, which is the center of the network, by defining information objects, which are the components of bibliographic data, as node types and the relationship between each object as an edge type. In the defined BIN, an operator has been defined that can be directly implemented as a user-friendly interactive interface to enable anyone to easily access, search, and analyze. The types of operators are classified and defined according to the type of query for the information network, and examples of output are shown. Various cases were presented, and the results were confirmed to examine what combinations of operators and operation sequences can be used for search and analysis. For the purpose of developing and utilizing the interactive visualization tool of the BIN, the integrated data schema and storage structure were designed, and the operator that can match one-to-one with the user interface was defined to enable operators to be added and reinforced according to the case of analysis.
In fact, it is expected that many researchers can use the BIN visualization tool designed and implemented in this paper to obtain the flow of research and various research-related information, and to perform analysis that is helpful for decision making. This system has high potential for development, such as improvement of performance and interface, and the expansion of operators. Further studies are expected to be conducted using this tool and to expand and develop the tool that is proposed in this paper.