The COVID-19 Infection Diffusion in the US and Japan: A Graph-Theoretical Approach

Simple Summary In this study, we conducted a quantitative assessment and compared the COVID-19 pandemic spread in two countries based on selected methods from the graph theory domain. The results indicate that while the applied experimental procedures are useful, we could draw limited conclusions about the dynamic nature of infection diffusion. We discussed the possible reasons for the above and used them to formulate research hypotheses that could serve the scientific community in future research efforts. Abstract Coronavirus disease 2019 (COVID-19) was first discovered in China; within several months, it spread worldwide and became a pandemic. Although the virus has spread throughout the globe, its effects have differed. The pandemic diffusion network dynamics (PDND) approach was proposed to better understand the spreading behavior of COVID-19 in the US and Japan. We used daily confirmed cases of COVID-19 from 5 January 2020 to 31 July 2021, for all states (prefectures) of the US and Japan. By applying the pandemic diffusion network dynamics (PDND) approach to COVID-19 time series data, we developed diffusion graphs for the US and Japan. In these graphs, nodes represent states and prefectures (regions), and edges represent connections between regions based on the synchrony of COVID-19 time series data. To compare the pandemic spreading dynamics in the US and Japan, we used graph theory metrics, which targeted the characterization of COVID-19 bedhavior that could not be explained through linear methods. These metrics included path length, global and local efficiency, clustering coefficient, assortativity, modularity, network density, and degree centrality. Application of the proposed approach resulted in the discovery of mostly minor differences between analyzed countries. In light of these findings, we focused on analyzing the reasons and defining research hypotheses that, upon addressing, could shed more light on the complex phenomena of COVID-19 virus spread and the proposed PDND methodology.


Introduction
China officially reported the first case of a new coronavirus disease, COVID-19, on 8 December 2019 [1]. As China failed to control the outbreak, the virus responsible for this Step 1. Defining the nodes of the graph. For the US graph, 54 nodes represented US states plus New York City, the District of Columbia, Puerto Rico, and Guam; for the Japan graph, 47 nodes represented prefectures.
Step 2. Collecting the time series COVID-19 datasets. To build two COVID-19 diffusion graphs for the US and Japan, we used the number of daily confirmed cases of COVID-19 from 5 January, 2020 to 31 July, 2021. For the US, we used daily records for all states plus other territories in the US from the Centers for Disease Control and Prevention [6]. For Japan, we used data for all prefectures from the Japan Ministry of Health, Labour, and Welfare [11].
Step 3. Defining the edges of the network. The edges represent connections between nodes. In the COVID-19 dataset, edges indicated a connection between two locations in terms of synchrony of their COVID-19 dynamics. Network edges can be classified as binary or weighted, and can show the directionality among regions (directed or undirected) [24]. We assumed that all geographical entities were connected with each other.
Step 4. Selecting a method to discover synchronized location. A statistical method, correlation analysis, was used to identify if there was a strong relationship between the COVID-19 time series datasets of different geographical entities. Following numerous other studies [21], we adopted a lag of 0 between the analyzed time series.
Step 5. Forming the connectivity matrix. Computed connectivity between nodes can be used to create a connectivity matrix, which is also known as an adjacency matrix. In this matrix, nodes are represented by rows (i), and columns (j) and edges are represented by matrix entries (aij), as presented in Figures 1 and 2. The US COVID-19 adjacency matrix. The green color represents a strong correlation between the time series of the regions, the yellow color represents moderate correlation, and the red color represents a weak correlation. The correlation of each region, with itself, is considered zero.  The green color represents a strong correlation between the time series of the regions, the yellow color represents moderate correlation, and the red color represents a weak correlation. The correlation of each region, with itself, is considered zero.
Step 6. Forming a binary matrix. The adjacency matrix can be used to create an unweighted unidirectional matrix, called the binary matrix. To develop this matrix, a threshold value must be selected. The value of the edge between two nodes was modified to 1 if the value of the correlation between nodes in the connectivity matrix exceeded the threshold, otherwise the value was set to 0. In our study, the threshold value of 0.7 was selected to simplify the network and remove weak and insignificant edges from the matrix.  Step 6. Forming a binary matrix. The adjacency matrix can be used to create an unweighted unidirectional matrix, called the binary matrix. To develop this matrix, a threshold value must be selected. The value of the edge between two nodes was modified to 1 if the value of the correlation between nodes in the connectivity matrix exceeded the threshold, otherwise the value was set to 0. In our study, the threshold value of 0.7 was selected to simplify the network and remove weak and insignificant edges from the matrix. The resulting binary matrices are demonstrated in Figures 3 and 4.
Step 7. Constructing the final diffusion graphs.
Once the PDND graphs are created, it is possible to compute and analyze their topological properties. Global (graph) and local (nodal) graph theory metrics can be used to achieve this. In our study, we selected the clustering coefficient (CC), characteristic path length (PL), local efficiency (Elocal), network density, global efficiency (Eglobal), modularity (Q), and assortativity (r); the nodal measures included degree centrality (K), nodal centrality [24]. First, brief descriptions regarding the adopted network measures are provided in Table 1, along with detailed definitions.

Metrics Description
Path length (PL) Average of the shortest path lengths over all nodes In this formula, d (v i , v j ) indicates the length of the shortest path between two nodes. To calculate the average shortest path in a graph, the sum of the shortest paths between all nodes is divided by the number of all possible paths.
Global efficiency (Eglobal). Eglobal, the inverse of PL, is another metric used to quantify COVID-19 spread in a network.
Clustering coefficient (CC). The clustering coefficient is used to better understand the function-structure of the network and is associated with the number of triangles in a network [26]. The clustering coefficient of a graph can be calculated with the following equation: In this formula, a set of two edges connected to node i is called a triple center around node i. For the whole graph, the CC is the average of the local values C i .
Network density. Network density is another metric used to evaluate the effectiveness of a network. This metric is the actual number of connections in the network divided by its maximum capacity [24].
Assortativity (r). Assortativity is used to determine whether high-degree nodes are primarily connected to low-degree nodes or whether nodes with the same magnitude of degree tend to connect to each other [27].
Modularity (Q). The modularity metric measures the structure of a network on the basis of the statistical arrangement of nodes [28]. The modularity can have values from −1 to 1, and a value close to zero indicates that the community (modularity) division is not better than that expected at random, whereas a value close to 1 or −1 indicates a strong community structure. The modularity of a graph can be calculated with the following equation: where e ii is the number of edges that have both ends in community i, k is the number of communities, and a i is the number of edges with one end in community i [28].
Local efficiency (Elocal). Efficiency in graph theory describes networks from the perspective of information flow [29]. The local efficiency of a graph is measured as follows: where E glob (G i ) is the Eglobal of only node i's immediate neighbors, but not node i itself [29]. Degree centrality. Nodal centrality quantifies the importance of a node in a network and can be measured by various metrics, such as nodal efficiency, degree centrality, closeness centrality, and betweenness centrality [29]. Among these metrics, degree centrality is one of the most commonly used and is defined by the number of edges of a node. The greater the number of edges, the more central the node.
Apart from the discussed network measures, we also conducted a 0-1 test for chaos to verify the chaotic properties of both US and Japan PDND networks. The test was first proposed by Gottwald et al. [30,31] and later improved by Gottwald et al. [32]. The test results closer to 0 indicate a lack of chaos, while close to 1 indicates the presence of chaotic system properties [32,33]. The whole analysis was carried out in the Python programming language on a single computing machine.

Results
Most of the metric values computed in the course of the conducted graph theory analysis for both COVID-19 PDND networks are shown in Table 2.

Discussion
The underlying processes of the pandemic are complex, and understanding them requires analyzing the available COVID-19 data on a global scale. The COVID-19 pandemic diffusion can be considered a nonlinear process that originated in China and spread worldwide [34,35]. Therefore, to identify the main patterns of COVID-19 behaviors, the nonlinearity of COVID-19 data must be taken into account. To discover the insights and implications hidden in COVID-19 data, the applied methods should be adaptive to the

Discussion
The underlying processes of the pandemic are complex, and understanding them requires analyzing the available COVID-19 data on a global scale. The COVID-19 pandemic diffusion can be considered a nonlinear process that originated in China and spread worldwide [34,35]. Therefore, to identify the main patterns of COVID-19 behaviors, the nonlinearity of COVID-19 data must be taken into account. To discover the insights and implications hidden in COVID-19 data, the applied methods should be adaptive to the underlying nature of the data [35]. To our knowledge, this is the first study to apply synchronized connectivity to analyze the behavior of the COVID-19 pandemic.
In this study, we developed COVID-19 diffusion networks (graphs) by adopting the PDND approach, and analyzed the graphs properties, including path length, global and local efficiency, clustering coefficient, assortativity, modularity, network density, hubs, and degree centrality.
The path length metric shows the efficiency of information transport in a developed network. A low PL indicates greater integration among geographical regions and the ease of information flow [25]. In the COVID-19 network, the path length represents the diffusion integration of states or prefectures and ease of virus spreading. The average path length for the US COVID-19 network was 1.46, and that for Japan was 1.37. Based on these values, the COVID-19 pandemic spread slightly more easily between prefectures in Japan than between states in the US. A similar observation can be drawn from the global efficiency values of 0.68 for the US and 0.73 for Japan.
The clustering coefficient (CC) is another metric for measuring the ease of information transport in a network, especially on a local scale (states or prefectures) [26]. In general, a higher CC value indicates faster flow of information in the network. As the discussed metric value was 0.72 and 0.74 for the US and Japan, respectively, we conclude that the differences were marginal and do not allow us to draw strong conclusions regarding the speed of virus spread on a local scale.
Similar observations can be drawn based on the computed network density parameter. For the US COVID-19 network, the network density was 0.249, and for Japan 0.253.
Assortativity can be used to determine whether high-degree nodes are primarily connected to low-degree nodes [27]. To calculate assortativity, we used the method described in [30]. The assortativity for the US COVID-19 network was 0.0055, and that for Japan was 0.019. A higher assortativity indicates the preference of a node in a network to connect to others that are similar. Based on the obtained assortativity values, and thorough discussions and demonstrations of networks characterized by various assortativity values available in [36], we conclude that the nature of virus flow in Japan might be slightly more focused on the high-degree hub nodes (prefectures) when compared to the US. However, we note that 0.019 cannot be considered a high assortativity value for the discussed networks.
The modularity metric represents the structure of a network based on the arrangement of the nodes [28]. This metric can have values from −1 to 1, where a value close to 1 or −1 indicates a strong community structure and a value close to 0 indicates a weak and random community structure. The modularity for the US COVID-19 networks was 0.32, and that for Japan was 0.0077. We observe that the analyzed US PDND network was more structured and more module-based than that in Japan.
The Elocal measures the ability of a network to spread COVID-19 at the local level [29]. A higher Elocal value indicates superior integration and faster transfer of COVID-19 spreading at the local scale. The Elocal for the US COVID-19 networks was 0.83, and that for Japan was 0.84. This outcome does not allow drawing any strong conclusions regarding the analyzed PDND networks.
Degree centrality is defined by the number of edges of a node; the greater the number of edges, the more central the node. In the COVID-19 PDND networks, for Japan, the regional node with the highest degree centrality was Kyoto, and for the US was Kentucky, as represented in Tables 3 and 4.
Following our previous publication [32], in the case of the (here studied) PDND networks, we evaluated chaotic behavior in the US and Japan using the 0-1 test for chaos. The results were 0.183 and 0.269 for the US and Japan, respectively. From the tangible difference of this metric, we conclude that the spreading of the virus was more chaotic in Japan than in the US. However, in both countries, the absolute value of the test does not allow classifying the pandemic behavior as chaotic.
Based on each adopted measure, it is possible to formulate a general observation that the discovered differences between the analyzed PDND networks were vague and prohibited the formulation of strong conclusions. We believe that such results may be caused by several factors that should be explored in future studies: (1) the proposed graph metrics could not account for subtle differences between networks. In our study, we focused on traditional graph metrics. In contrast, recent progress in the field of graph theory offers a plethora of other metrics and node and graph representation techniques, such as graph node embeddings with Deep Walk [37] or whole graph embeddings [38], as possible examples.
(2) The adopted threshold value of 0.7 used for simplification of the adjacency matrix might have caused excessive loss of important information. To verify this hypothesis, a separate search for an optimal threshold value should be carried out. (3) The analysis period was too broad for an approach with a single network representation. The analyzed time series represents almost 18 months and covers several waves of the COVID-19 pandemic. It is possible that virus diffusion patterns evolved over the analyzed time and differed between the waves, for example, in the case of the influenza epidemic studied in [16]. Moreover, as other studies addressing the COVID-19 pandemic distinguished and focused on its various phases [39][40][41], this may indicate that analyzing the whole pandemic in a single procedure may cause bias. If so, the adopted calculation of a single correlation coefficient value between states and prefectures in a too-long period could result in a hindered information extraction process. (4) Influence of lag on the analysis. In our study, a lag of 0 was adopted, as in numerous studies focusing on COVID-19 spread [42,43]. However, it is possible that adopting other lag values will shed more light on the virus spread phenomenon.

Limitations of the Study
Certain limitations should be acknowledged regarding this study. First, we only focused on two countries: the US and Japan. Investigations of other countries in different parts of the world could produce different results. Second, although we used a correlation analysis to understand connectivity and develop COVID-19 networks, other methods, such as coherence analysis, should also be considered. Finally, the analyzed COVID-19 confirmed case dataset was subject to COVID-19 testing bias, in that the number of confirmed cases was the function of the number of tests conducted in different US states and prefectures in Japan.

Conclusions
This study adopted the pandemic diffusion network dynamics (PDND) approach to develop COVID-19 diffusion networks for the US and Japan. A graph-theoretical approach was used to understand the behavior of the pandemic in these two countries. The quantitative comparison and assessment of two networks and corresponding COVID-19 diffusion phenomena showed modest benefits of employing the proposed PDND approach. In most cases, the differences between countries measured utilizing adopted graph metrics did not lead to strong conclusions regarding the virus diffusion. Given such findings, we formulated several research hypotheses to be analyzed in future studies, which could determine the utility of the proposed PDND approach regarding the COVID-19 pandemic spread. We hope that other researchers will follow our direction and participate in the opportunity of examining the described methodology.