3.1. Background
Chen et al. [
8,
9] presented a comprehensive information-theoretic pipeline to illustrate the visualization process using information theory, as shown in
Figure 1. This pipeline can be generally applied to graph visualization. The pipeline shows that a graph visualization first encodes raw data, and then sends the encoded visual description, e.g., a drawing (image) via a visual channel. An observer receives the visual description and tries to decode the description for final comprehension. The general visualization pipeline has been used in many existing work. Since the knowledge about the decoder (human perception and cognition) in
Figure 1 may require a tremendous number of user studies, we are reluctant to conduct a comprehensive study that covers the full span of the visualization pipeline. Similar to the recent work [
71], we only focus on the encoder subsystem. We discuss how the raw network data are encoded and described as an image by edge bundling algorithms and how to optimize the information transferring using Chen et al.’s pipeline and information theory.
In the encoder subsystem, the process of filtering, visual mapping, and rendering can be considered to be a transformation or encoding process, and a good graph visualization should use a visual description to tell as much information about the raw data as possible. Simply, we define the raw data as a random variable
U, and consider the encoder to be an encoding process. To describe
U, the encoding process may first use some simplification methods such as filtering and clustering to preprocess
U, and then present the preprocessed data visually as labels, points, lines, or areas. The output of this process is a visual description
O. Information may lose and noise may be introduced in this process. In many existing works, a good graph visualization is concluded to make
O tell the most about
U. To do that, the mutual information
between the visual description and the raw data should be maximized. Formally,
where
is the total information of
U, and the conditional entropy
is the amount of additional information needed to describe
U given
O. It can also be regarded as the information loss in the encoding process. To maximize the mutual information
, we thus need to minimize
, i.e., to minimize the information loss.
Figure 2 illustrates
,
and
, and their relationship. Many visualization studies [
58,
59,
60,
71,
72,
73,
74,
75,
76,
77] have proposed solutions to maximize the mutual information to improve their visualization results.
3.2. Uncertainty in Edge Bundling Visualizations
The pipeline of
Figure 1 can also be applied in edge bundling visualizations. In many existing works, edge bundling algorithms are used to visualize large graphs since the algorithms help reduce visual clutter. We illustrate the advantage of edge bundling visualizations over traditional node–link diagrams in large graph visualizations. A classic example is to visualize the U.S. airline routes where domain experts want to see the airline routes between different cities with their geo-locations, as shown in
Figure 3a. The dataset of
Figure 3a has 2100 edges and 235 vertices. The drawing in
Figure 3a shows a traditional node–link diagram that uses segment-based edges to encode the relations of the graph, resulting in severe visual clutter because of the edge crossings and edge overlapping. The visual clutter mainly affects human perception to track the edge between a pair of vertices. For example, observers cannot easily tell if
Miami and
Chicago have a connection based on the drawing.
Traditional node–link diagrams may fall short in visualizing large graphs since they do not meet some readability criteria. As mentioned in
Section 2, there were several readability criteria to evaluate the quality of a graph drawing. To optimize two or more of the readability criteria is an NP-hard problem [
21]. Among the criteria, edge crossing is widely acknowledged as the most important one [
2]. In graph drawings, edge crossings would cause visual ambiguity to observers. The ambiguity mainly affects human perception to identify the relations between pairs of vertices in graphs. In
Figure 3a, the area between
Miami and
Chicago are occupied by many edges. Observers can hardly identify if
Miami and
Chicago are connected because of the visual clutter. To reduce the clutter, edge bundling techniques are often employed in the visualizations of large graphs. Edge bundling techniques mainly group similar edges to form bundles, such that the area used and edge crossing are significantly reduced.
Figure 3b shows a force-directed edge bundling (FDEB) drawing using the U.S. airline dataset. It becomes much easier to identify the edge between
Miami and
Chicago compared to
Figure 3a. Additionally, using color-encoded methods can help better identify paths and structural patterns. Color-encoded methods can customize the transparency and color of edges in a drawing based on the attributes of the edges.
Figure 3c shows a directional color-encoded method for the node–link diagram of the airline dataset. Comparing
Figure 3a,c, the edges with different directions are more salient in
Figure 3c, whereas the path between
Miami and
Chicago can still be hardly tracked. Using the same color-encoded method in the FDEB visualization of the same dataset, we can clearly see the blue path connecting
Miami and
Chicago, as shown in
Figure 3d. The overall result of
Figure 3d is even better than
Figure 3b. Intuitively, we can conclude that color-encoded method can help track the relations between vertices in visualizations.
Although edge bundling techniques can improve the readability of graphs in terms of edge crossing and area used, they are not without disadvantages. Edge bundling methods visually create bundle effect to reduce edge crossings and area used, but the relationship details are thus hidden in the bundle. An example is shown in
Figure 4.
Figure 4c shows an edge bundling drawing.
Figure 4b,e,f shows three possible network results that can generate the same edge bundling drawing in
Figure 4c. This disadvantage is also discussed in several papers. Wang et al. [
14] provided a visual analytic tool to show the ambiguous regions using heat map in graph drawings. They considered edge lengths, vertex and edge aggregations, and community structures; however, they did not consider the inter-bundle ambiguity, i.e., the uncertainty of two vertices from different bundles.
Figure 4c illustrates an example about the inter-bundle condition. We can visually perceive that there are approximately two bundles in
Figure 4c. Two examples of intra-bundle ambiguity are that the relation between vertices
and
, and the relation between vertices
and
are unknown. Meanwhile, using the aforementioned method cannot identify the ambiguity between vertices
and
in the drawing, which corresponds to a case of inter-bundle ambiguity. Hence, we argue that the work is not sufficient to evaluate edge bundling drawings. Nguyen et al. [
6] defined the uncertain presentation in an edge bundling visualization as information loss in the edge bundling visualization by introducing information faithfulness. A visualization is information faithful if the visualization can uniquely represent the original graph. In their paper, they concluded that edge bundling visualization is inherently not information faithful and stated that it will be increasingly difficult for users to perceive the original network from an edge bundling visualization when more edges are bundled together. They gave a model to illustrate this situation. Given a graph
, the edge bundling visualization partitions the edges
E into
K bundles
. Let
presents a subgraph of
G that consists of only
.
is essentially a bipartite graph where the set of vertices are
and the set of links are
. For any two subgraphs
and
,
. According to the definition of bipartite graph, we have
,
, where
is the source vertices and
is the sink vertices in the bipartite graph
. Enumerating the bipartite graphs with all possibilities gives
combinations. The number of graphs that have the same link structure as the final edge bundling drawing of
G is
, which means there are
different original networks. However, we argue the
different ways that a bundle may have is just a loose upper bound. The tight bound requires further investigation. On the other hand, it is hard to define an exact number of bundles in a drawing. Additionally, their work also did not consider the inter-bundle uncertainty, as they assumed
, which often is not held in practice. Thereby, to evaluate the quality and goodness of edge bundling visualizations is still an open problem. In
Section 3.3, we introduce a formal information-theoretic metric to evaluate the drawing result of edge bundling techniques.
3.3. An Information-Theoretic Metric for Edge Bundling Visualizations
We introduce a general model to quantify the uncertainty delivered by graph visualizations. We define the objects of a graph as nodes or vertices, and the relationships among objects as links or edges. Take a graph
where there are
vertices and
edges. In our study, we only consider simple paths in graph structures, and represent
G as an adjacency matrix
A:
Let
denote a graph drawing of
G, the edges and vertices of G are encoded by visual symbols (such as segments, curves, polylines, labels, points, etc.) with colors in
. Based on the pipeline of
Figure 1, a graph drawing method has a visual encoding process that transforms the underlying network relations and structure into visual symbols. Color mapping functions are also used in the process. The encoder process outputs a visual description, i.e.,
. Observers need to observe
in order to guess the value of
of
G. Visually,
presents an adjacency matrix
indicating the relations among the vertices in the drawing:
To understand the relations of the underlying network, observers need to observe , and guess the value of based on the value of .
Figure 4 shows an example.
Figure 4a shows an adjacency matrix
A of a graph
G. In
Figure 4b, a node–link drawing
correctly reveals the relations among vertices with the least ambiguity for this simple graph. In
Figure 4c, an edge bundling drawing
encodes the edges with curves. One observation is that there seems to be an edge between the vertices
and
. One possible reason is that there indeed is an edge between
and
. However, it is also possible that there is no edge between
and
, but an edge between
and
and an edge between
and
, and these two edges are bundled together, causing an illusion edge between
and
. Hence, the ambiguity arises that the relation between
and
is uncertain in
Figure 4c. Note that even if we only consider simple paths in the visualization of a graph
G,
may inadvertently have multiple edges between a pair of vertices. By using certain intuitive criteria, e.g., readable bendiness of curves (
Section 3.4), we can guess that there is no edge among
,
, and
in
. The same intuitiveness can be also applied to
,
, and
. However, all other relations among the vertices remain uncertain in
. We can use Equation (
3) to construct an adjacency matrix
, as shown in
Figure 4d. If an entry
, we are not sure if there would be an edge between
i and
j in the original graph
G, and possibly drive multiple interpretations, such as
Figure 4e,f, that can generate the same edge bundling draw in
Figure 4c.
To assess an edge bundling drawing
, based on
A and
, we first introduce a coverage rate
to evaluate the percentage of how many edges in
A are covered by
. The idea is intuitive: we want to know how many edges in an original network are presented in a corresponding drawing. In our definition, we only require the drawing to show at least an edge between two vertices (i.e.,
) if the two vertices do have an edge in the underlying network (i.e.,
). Equation (
4) expresses this idea as (the
n/a entries of
A and
do not enter the computation of the following equations):
where
m is the number of vertices in
G, and
is a simple Heaviside step function:
The higher the value of
is, higher the coverage is. If
, we say the corresponding
is saturated. However, only using Equation (
4) cannot assess an edge bundling drawing effectively. For example, although
Figure 4b,c covers the matrix
A in
Figure 4a, the degrees of uncertainty are significantly different in
Figure 4b,c. Hence, we need to introduce another metric to evaluate the uncertainty of edge bundling drawing results.
We provide an information-theoretic model to quantify the uncertainty of the above situation. Our information-theoretic model first assumes that graph drawing and visualization algorithms do not intend to underdraw a graph, i.e., given a graph
G, graph drawing and visualization algorithms do not intend to show a wrong value of
in
. If
in
G, the encoder of visualization process should always intend to show
in
, such that the observers may guess that
in
G, and vice versa. In addition, if
in
, it becomes more uncertain to determine whether
since
i and
j may overlap some edges that are not between
i and
j in
G.
Figure 4c shows an example. We know that even if only simple paths are allowed between a pair of vertices, a drawing result may still have multiple paths between this pair of vertices, which makes the visual description uncertain. We also define that the edge in Equation (
2) is just a relation while the edge in Equation (
3) can be a segment, curve, or polyline in the drawing.
We denote the relation between two vertices as a random variable
X. A graph drawing or visualization algorithm fully understands the graph and tries to encode the relation with visual symbols and colors. After the encoding process, the algorithm outputs an image, i.e., a drawing
.
provides a result that indicates the original relation of the two vertices. The visual result can be represented by an adjacency matrix
based on Equation (
3). Given two vertices
i and
j, we can quantify the amount of uncertainty of the relation of vertex
i and
j given
using information theory. Generally, if
, there are
N paths between the vertices
i and
j in the drawing
. Since we only consider simple paths, one of the
N paths may be encoded as an edge to present
in
G. Another possibility is
, which means there is no edge between the vertices
i and
j in the original graph, but
i and
j overlap
N other edges in the drawing
. Therefore,
can have
different ways to present
(i.e.,
N possible paths or no path).
Let
Y denote the visual description of the relation between a pair of vertices. The amount of uncertainty of knowing the relation between the vertices
i and
j given
, i.e., the conditional entropy
, can be quantified as:
where
is the value of the
i column and
j row entry of
. Equation (
6) indicates the necessary bits to visually describe the relation of the corresponding vertices
i and
j. The more bit the visual description uses, the more uncertain the description is.
We use the graph drawings of two simple graphs
and
in
Figure 5 to illustrate Equation (
6). As shown in
Figure 5a, visually, there are two paths,
and
, between the vertices
and
, and thus
. As we consider simple paths, there are three possible cases of connection between
and
in the original graph
:
,
, or no path. The probability of each case is
. Therefore, given the visual description
, we can compute the amount of uncertainty to describe the real relation
as
. Similarly, there are three paths between the vertices
and
of
in its drawing
, as shown in
Figure 5b. This leads to a higher amount of uncertainty,
, for us to tell the real relation
of
.
We denote the total uncertainty of
as
W.
can be formally written as:
where
represents the amount of uncertainty of knowing
based on
. It can also be interpreted as how much information about
is still uncertain after observing
. As discussed in
Section 3.1, the best description
Y should tell the most of
X (i.e., to maximize the mutual information
, we need to minimize the conditional entropy
). Hence, to holistically evaluate an edge bundling drawing, we argue that a good edge bundling visualization should minimize
, and, at the same time, keep the coverage rate
as high as possible. For a undirected graph
G,
A and
are symmetric, and thus only the upper right half of each matrix is used.
We use Equation (
7) to quantify the values of
of
Figure 4b,c. On one hand, as
Figure 4b shows, the relations among objects are clear and correct, and the amount of uncertainty of the corresponding
is 3.
Figure 4b can use the least bits to describe a network since each edge can be distinguished in this drawing. On the other hand, the amount of uncertainty of
of
Figure 4c is 8. This comparison matches the results of the drawings.
Generally, Equation (
6) can also be used to quantify the amount of uncertainty of relation between a pair of vertices in an edge bundling drawing
.
Figure 3a shows a more complex example. In the figure, we want to quantify the amount of uncertainty of relation between
Miami and
Chicago. Equation (
6) counts the number of edges (paths) between the cities in the drawing. Recall that the paths can be segments, curves, or polylines in our definition. We can find multiple such paths between
Miami and
Chicago. In
Figure 3b, the paths between
Miami and
Chicago are significantly reduced because of the bundle effect. The rationale behind Equation (
6) is that, if more paths can be detected between the two vertices, the used area between the two vertices becomes larger. This reflects that a larger number of edge crossings and overlapping, which means the visual description of the relation between the two vertices is more uncertain. The greater the value
, the more uncertain the description is. In a general case, no matter
or 1. If the number of paths between the two vertices is larger than one, i.e.,
, the visual description of the relation of the two vertices is uncertain. In
Section 3.4, we introduce a method to count the number of paths between a pair of vertices based on the drawing.
3.4. Algorithms and Implementation
We introduce an algorithm to approximate the number of edges or paths between a pair of vertices in a drawing. Our algorithm mimics an observer’s perception to track the edges or paths between two vertices. As it would be tedious for an observer to manually count all paths in a drawing of a large graph, we propose a computational method for this task. We also use a heuristic method to discuss the parameters in our algorithm in
Section 4. As described in
Section 3.3, a path between two vertices in a drawing can be segments, spline-curve, polylines, or a hybrid presentation of the above three. Generally, a qualified path between two vertices should meet two criteria: (1) the bendiness of the path should be reasonable; and (2) the color of the path should be similar. The bendiness criteria ensures that a qualified path have a reasonable smoothness and do not contain loops and abrupt turning angles, while the color criteria ensures that the color along the path is similar. Complying the two criteria, a path can be identified and tracked by observers. Although many studies have proposed path and road location and detection in remote sensing and image processing fields, approximating the number of paths between two vertices is a unique and non-trivial task in this study.
To approximate the number of qualified paths between two vertices, we need to find the region in the image connecting them. We first locate the pixel positions of a pair of vertices in an edge bundling drawing (image). Then, starting from one of the vertices, we conduct a region growing method to find a piece of region that connects the two vertices. We design two parameters in our region growing method to comply with the aforementioned criteria. Generally, given a drawing result, which is an image I with a resolution of , we first locate the pixel positions of the two vertices in the image, and then use Algorithm 1 to find the number of paths between the two vertices. Assume the start pixel is and the target pixel is . We specify a color threshold C and an angle threshold L. The target region R contains only initially. We start from the current pixel , and search through all the neighboring pixels in a window, where is the size of the window. We need to find all the neighboring pixels that meet three conditions: (1) the angle between the vector and the vector is not greater than L; (2) the angle between the vector and the vector is not greater than L; and (3) the Euclidean distance between the color of and the mean color of the region is not greater than C. Conditions (1) and (2) ensure that the region will be growing from the start vertex towards the target vertex with a specified angle limitation. L determines the sharpness of the paths in the region. Condition (3) simply ensures the color criterion. The qualified pixels that have not been visited are added to a candidate set. We then set the pixel that is closest to to be the new current point and add to the region R. The process continues until or the candidate set is empty. The region growing algorithm is illustrated in Algorithm 2.
Figure 6a shows a magnified and highlighted area of a FFTEB visualization. We want to find the region connecting two vertices
a and
b.
Figure 6a.1 is the output of Algorithm 2. It shows that the region between
a and
b can be perfectly extracted. Another more complex example is shown in
Figure 6b.1,c.1 presenting the impact of the input parameters
C and
L. In the highlighted area, we want to find the region connecting the vertices
c and
d. Using
and
, we get the result of
Figure 6b.1. Using
and
, we have the resulting region of
Figure 6c.1. The difference is obvious. In
Figure 6b.1, we do not have the big hole in the middle of the region because the input color threshold
C and angle threshold
L are relatively small.
C controls the acceptable color difference between candidate pixels and the region, while
L determines that the sharpness of a portion of the region. They are very important parameters in our algorithm. The window size
or 2 (1 or 2 pixel(s)) can generate very good results. However,
C and
L may impact the grown region largely, as in
Figure 6b.1,c.1. We further discuss
C and
L in a heuristic study in
Section 4.
Algorithm 1FindAllPaths. |
1: | // Initialization |
2: | Ps // The start pixel |
3: | Pt // The target pixel |
4: | W1 // The size of sliding window for Algorithm 2 |
5: | W2 // The size of sliding window for Algorithm 3 |
6: | C // The color threshold |
7: | I // The M × N image |
8: | R // The growing region |
9: | K // The clusters |
10: | P // The number of paths |
11: | N // The number of node in graph |
12: | VISITED[N] // The flag array that indicates if vertices are visited |
13: | Find the source pixel Ps and target pixel Pt. |
14: | // Given I, W1 and C, use region growing to find the region R connects Ps and Pt |
15: | R ← RegionGrowing(I, Ps, Pt, W1, C) |
16: | // Given the region R, use mean shift to calculate the clusters K |
17: | K ← MeanShift(R, W2) |
18: | Find the number of vertices N based on the separate components of K. |
19: | // Based on the clusters K, find the source region Rs and the target region Rt |
20 | P ← Depth-firstSearch(P, K, Rs, Rt, VISITED[Rs]) |
Algorithm 2RegionGrowing(; ; ; ; ; ). |
1: | Assign the color of Ps to Cm. |
2: | R // The growing region |
3: | Cm // The mean color of the growing region |
4: | Pc ← Ps // Assign the source pixel to be the current pixel |
5: | S ← ∅ // Initialize the candidates set |
6: | Push Pc into R. |
7: | whilePc! = Pt or S! = ∅ do |
8: | for each neighboring pixel Pn of Pc using the window size W1 do |
9: | if the angle θ1 between and <= L and the angle θ2 between and <= L and the color of Pc − Cm <= C then |
10: | Push Pn into S. |
11: | end if |
12: | end for |
13: | // Compute the next Pc |
14: | Compute the pixel in S whose color is closest to Cm, and assign the pixel to Pc. |
15: | Compute the mean color of S, and assign the mean color to Cm. |
16: | Pop Pc from S. |
17: | Push Pc into R. |
18: | end while |
19: | returnR. |
After the region
R is gained, we can consider how to find the number of paths between two pixels. Here, the problem is typically a graph problem that finding the number of paths between a source vertex and a target vertex in a graph where every pixel in
R can be modeled as a vertex, and the connectivity of pixels can be modeled as edges. However, estimating the number of source-to-target paths in a graph is #P-complete [
78]. To approximate the number of source-to-target paths, we could use depth-first search to enumerate all the unique paths from a source to a target, or a dynamic programming to statistically calculate the total number of unique paths from a source to a target. However, in our experiment, we found out that simply using the above methods is problematic in the applications of edge bundling visualization. For example, in
Figure 6a.1, intuitively, there should be one path connecting
a and
b.
Appendix A shows that a simple depth-first search will generate an incorrect (significantly large) number of paths for
Figure 6a.1. Additionally, simply modeling every pixel as a vertex will make the computation time-consuming since a resulting region could consist of a considerable number of pixels.
Appendix A also shows that dynamic programming is not appropriate to solve this problem in the applications of edge bundling visualization. Another problem is that neither the simple depth-first search nor dynamic programming cannot solve the problem that the small holes in the generated region, which is illustrated in
Figure 6c.2. For instance, in
Figure 6c.2, intuitively, there is only one path from
c to
d, whereas the small holes can generate unnecessary loop paths, which should be addressed.
We propose to use a simple mean shift to cluster the resulting region, then model the clusters into vertices, and conduct a modified depth-first search to approximately find the number of paths between two vertices in a region R. After the region R is found, we check that if the two vertices are in the region. If so, we use mean shift to cluster the region R. Otherwise, we conclude there is no path between the two vertices. The basic idea is that we first use mean shift to cluster R into distinct regions. Second, we construct a transform graph T that shows the connectivity of the distinct regions. We define the source region containing the pixel , and the target region containing the pixel . Finally, we use a modified depth-first search to calculate the number of paths between and in T.
The mean shift algorithm takes
R as input. For each pixel, we define a window with a size
around it and compute the mean of the pixels that have some color other than the background color. Then, we shift the center of window to the mean and assign the new position to the current pixel. We repeat this process until all pixels converge or the iterations exceed a certain amount of times. The simple mean shift algorithm is illustrated in Algorithm 3. Then, we consider the non-connected regions as distinct clusters. The distinct clusters can be considered as a graph
T, where each cluster can be considered as a vertex in
T. The connectivity of vertices in
T is determined by the connectivity of the pixels. For example, if two pixels from two different clusters are neighbors, the two clusters have an edge. The output results of Algorithm 3 is demonstrated in
Figure 6a.3–c.3.
Figure 6a.4–c.4 shows the corresponding transform graphs of
Figure 6a.3–c.3, respectively. In the three graphs, different colors mean distinct cluster labels. The results in
Figure 6a.4–c.4 avoid the problem in
Appendix A. In
Figure 6b.3,c.3, the small hole problem is also addressed, where the hole is too small to be considered a branch or another path. However, if a hole is big enough, it can be considered as a branch or path, which is shown in
Figure 6c.3. The window size
determines the acceptable hole threshold. We find that it should be set to only 1 or 2 pixel(s), and
Figure 6b.3,c.3 demonstrates the results.
Algorithm 3MeanShift(; ). |
1: | K // the cluster result |
2: | Pc // The position of the current pixel |
3: | S // The temporal set |
4: | ITR // The iteration number |
5: | STOP // The flag that indicates all pixels do not move in the last iteration |
6: | STOP ← False |
7: | whileITR < 300 and STOP = False do |
8: | for each pixel Pc of R do |
9: | S ← ∅ |
10: | for each neighboring pixel Pn of Pc using the window size W2 do |
11: | if the color of Pc does not equal to the background color then |
12: | Push Pn into S. |
13: | end if |
14: | end for |
15: | Compute the new position for Pc based on S. |
16: | end for |
17: | // Check if some of the pixels have new positions |
18: | if none of the pixels in R moves then STOP ← True |
19: | end if |
20: | end while |
21: | Give every separate component a distinct number, and assign the result to K. |
22 | returnK. |
Finally, a modified depth-first search algorithm is used to count all possible paths between and in T. In this algorithm, it first sets the flag of every vertex to be unvisited. The algorithm starts from the source region , and find the adjacent regions in a depth-first search manner until it reaches the target region . Every time the algorithm reaches a new region, it sets the flag of the region to be visited. Hence, it will not form a loop path in T. If is reached, the counter of all possible paths increments one. The modification from the traditional depth-first search algorithm is that after a region is visited, we reset the flag of the current region to be unvisited, making this region available to other paths. Finally, if all other regions are visited, the algorithm ends. The modified depth-first search is illustrated in Algorithm 4.
Algorithm 4Depth-firstSearch(; ; ; ), . |
1: | P // The number of path between Rs and Rt |
2: | VISITED[Rc] ← True |
3: | ifRc = Rt then |
4: | P ← P + 1 |
5: | else |
6: | for each adjacent region Rn of Rc do |
7: | if VISITED[Rn] = False then Depth-firstSearch(P, K, Rn, Rt, VISITED[Rn]) |
8: | end if |
9: | end for |
10: | end if |
11: | VISITED[Rc] ← False |