Finding Asymptomatic Spreaders in a COVID-19 Transmission Network by Graph Attention Networks

In the COVID-19 epidemic the mildly symptomatic and asymptomatic infections generate a substantial portion of virus spread; these undetected individuals make it difficult to assess the effectiveness of preventive measures as most epidemic prevention strategies are based on the detected data. Effectively identifying the undetected infections in local transmission will be of great help in COVID-19 control. In this work, we propose an RNA virus transmission network representation model based on graph attention networks (RVTR); this model is constructed using the principle of natural language processing to learn the information of gene sequence and using a graph attention network to catch the topological character of COVID-19 transmission networks. Since SARS-CoV-2 will mutate when it spreads, our approach makes use of graph context loss function, which can reflect that the genetic sequence of infections with close spreading relation will be more similar than those with a long distance, to train our model. Our approach shows its ability to find asymptomatic spreaders both on simulated and real COVID-19 datasets and performs better when compared with other network representation and feature extraction methods.


Introduction
The COVID-19 epidemic has caused the most serious threat to global health since the early twentieth century. In this pandemic, health care authorities relied on preventive measures to reduce the spread of SARS-CoV-2 [1]. However, assessing the effectiveness of these preventive measures was difficult due to the presence of mildly symptomatic and asymptomatic individuals. These undetected individuals generated a substantial portion of disease spread due to SARS-CoV-2 viral shedding and transmission before the onset of symptoms [2,3]; thus, effectively identifying undetected patients in COVID-19 transmission networks will be of great help in disease prevention and control, especially in China, in which a dynamic zero-COVID-19 strategy was adopted. The cost of achieving this goal will be very high without the help of advanced technology to find these asymptomatic spreaders in local transmission when facing the Omicron mutant [4], although the rest of the world has mostly adopted a strategy of living with SARS-CoV-2 [5].
Recent advances in next-generation sequencing (NGS) platforms emphasize their application value in tracking emerging infectious disease outbreaks [6]. The combined approach of using genomic sequencing data with epidemiological data has successfully revealed transmission events for various viral outbreaks [7,8]. In determining critical features in the transmission pattern, such as the origin and the emergence of variants, viral sequencing can infer closely related isolates in an outbreak and identify unsampled cases in ongoing outbreaks [9]. Rapid viral sequencing can therefore provide real-time surveillance of transmission events and circulating viral variants in the ongoing COVID-19 pandemic. Network modeling tools such as Bayesian phylogenetics [10] and TransPhylo [9] have been utilized to capture the evolutionary and infection dynamics of SARS-CoV-2. Research using these tools have been able to establish phylogenetic pipelines using published SARS-CoV-2 genomic data to examine reasonable estimate transmission networks with the inference of unsampled infection sources. However, the computational cost is high for the calculation of Bayesian inference when dealing with large amounts of data.
Viral genomic approaches, including viral genomic sequencing and phylogenetic analyses, allow us to investigate fundamental characteristics in the transmission of an infectious disease. This is made possible by detecting the genetic variation in the viral genomes of infected individuals as a result of high rates of mutation and replication in transmission events [11]. Since RNA viruses have high mutation rates when they spread [12], the genetic sequence of the virus carried in each patient will be different; pertinently, SARS-CoV-2 is also one kind of RNA virus. As the mutation rate of SARS-CoV-2 is low [13], these species signatures of different subtypes are then passed on to those they infect, and all of the individuals in a local module in a network share common signatures. Our approach makes use of sequence variation within individuals. It is obvious that the genetic sequence of infections with close spreading relationships in a transmission network will be more similar than those with long distances. Based on this, the similarity of two neighbor nodes having asymptomatic or undiagnosed nodes between them must be lower than those without undetected nodes. Therefore, the main idea for finding asymptomatic spreaders in a COVID-19 transmission network is based on the similarity of each pair of neighbor nodes in the network.
By using a long short-term memory (LSTM) network, which is a deep neural network for modeling sequential data [14,15], the proposed model can learn the sequential information contained in these subgraphs for each target node. This information will be combined into a new embedding by an attention mechanism [16,17], and the embedding also captures information of the graph structure. As we expect nearby nodes to have similar representations and distant nodes to have dissimilar representations, graph context loss function, which is well matched with the characteristics, is used to train this model. By using our trained model to measure the similarity-the distance in the embedding space-between pairs of nodes via their representations, we can discover which pairs are unusually different for their given location in the transmission network, indicating that there are undetected nodes in between.
We first test our model on simulated datasets. The simulation transmission network is generated based on the rule of virus spread and the corresponding genetic sequence is simulated according to the characteristics of the SARS-COV-2 gene and the mutation when it spreads. The transmission network and gene sequence datasets are used to train our model. Then we randomly remove a certain proportion nodes and reconnect it to form a test network, the new connected edges, being removed nodes, form the test label set. Through different kinds of experiments, RVTR can effectively find undetected nodes in simulation transmission networks. We further show the model's performance in real situations by training and testing the model on a COVID-19 dataset from Australia. The prediction of our proposed model is better than other comparison algorithms. Of note, more experiments have been performed on datasets from Canada, Alberta, New York State and New Zealand; all these experimental results indicated the model's ability to find asymptomatic spreaders in SARS-CoV-2 transmission networks.

Background
The goal of finding asymptomatic spreaders is to infer undetected nodes in a COVID-19 transmission network, and our approach is based on the representation of the transmission network. In this paper, the goal of network representation is to correctly express the gene sequence information and network transmission characteristics of nodes in low dimensions, and then use the new representation information to discover undetected nodes.

COVID-19 Transmission Network Representation
In a COVID-19 transmission network, virus transmitters are regarded as the network nodes, and the genetic sequence of SARS-CoV-2 can be regarded as the attribute of each node. A COVID-19 transmission network can be expressed as G = (N, E, A), where N is a set of nodes in the network. E represents the set of connected edges, and each edge (n i , n j ) ∈ E means that the virus is transmitted from node n i to node n j . A represents the node attribute set, a i ∈ A represents the gene sequence of node n i .

Undetected Nodes and Abnormal Edges
Due to these asymptomatic or unsampled patients, the amount of detected COVID-19 infections is smaller than the actual number in the transmission network. Figure 1 shows a complete transmission network, but mostly it could not detect unsampled or asymptomatic spreaders, which are marked by red dotted circles. In this paper, we name these asymptomatic or unsampled infections undetected nodes for convenience. These undetected infections lead to abnormal connections in an observed transmission network. As shown in Figure 2, the parent nodes and child nodes that should not have a direct relationship but produce the connection are marked by red edges. Finding asymptomatic or unsampled infections could be seen by finding abnormal edges in a transmission network. In our approach the representation of an RNA virus transmission network is learned first and then node similarity, which has a connection, is calculated. Finally, we use these similarity scores to find abnormal edges and locate the undetected infections in a transmission network.

Model
In this section, we introduce the RNA virus transmission network representation model based on the graph attention network (RVTR). The framework of the RVTR is depicted in Figure 3, in which a red node serves as an example of how this model generates a new representation by learning the information of neighbor nodes.

Representation Forward Representation
Node attribute (High dimension gene sequence) is projected to low dimension space W LSTM Mean Self Representation

Information aggregation by using attention mechanism
Node Vector = AttenF×VF + Attenself×VS + AttenB×VB New representation V

Node Representation
Graph context loss function

Subgraph Extraction
Subgraph extraction aims to create a node set with the highest correlation for the target node, to ensure that the model can extract enough topology information from the transmission network. Previous studies [18,19] showed that the influence between nodes with a distance of more than three steps in a network is small, and we select nodes less than three steps from the target node n i to form subgraphs for the target node in the RVTR with computing efficiency. As shown in Figure 3, nodes, except the target node, are divided into two categories-forward set FS n and backward set BS n , based on the propagation relationship. The forward set FS n contains the second-order afferent nodes of the target node, namely the parent node and the grandfather node of the target node. The backward set BS n is the second-order efferent nodes of the target node, namely the child nodes and the grandchild nodes of the target node.

Subgraph Representation
As shown in Figure 3, subgraph representation aims to learn a lower dimension vector for the incoming node set and outgoing node set. In the calculation of forward representation, due to the propagation relationship between nodes, we used LSTM to aggregate forward information, and selected the last hidden layer of LSTM as the forward vector V F . When calculating the backward representation, we adopted the backward LSTM method similar to the forward representation for each child branch. The last hidden layer of LSTM was selected as the representation vector for each branch, and we took the mean value of all branch vectors as the backward representation vector V B . For the target node itself, the self-representation V S was converted to the specific dimension through a multilayer full connection layer.
Among these variables, BN represents the batch normalization operation, FC represents the full connected layer, and these three vectors are denoted as V, which have the same dimension. Through this operation, the model obtains the representation of the structural information extracted from the network subgraph for each node.

Information Aggregation Based on Graph Attention Mechanism
Information aggregation aggregates the representation of each part in the subgraph. We used the self-attention mechanism [16] to learn the aggregation weight of the forward, self and backward representation vectors V k , and k ∈ {F, S, B}; then, we can obtain an aggregated vector V n for each node.
The corresponding attention values α k , k ∈ F, S, B and α k are learnable parameters as follows: where is the concatenation operation and a T ∈ R 2d×1 and a T l ∈ R 2d×1 are the learnable attention parameters. σ is the LeakyReLU function.

Graph Context Loss Function
Considering the mutation of SARS-CoV-2 when it spreads, it is obvious that the similarity of neighbor nodes' gene sequences is higher than that of non-neighbor nodes. To achieve this goal, the graph context loss function [20] is well matched with the characteristics of the virus transmission network. The loss function transformation is defined as: where V n is the output node embedding formulated by the RVTR. Among them, n, i, j is a triple, n ∈ N is the target node, i ∈ W n is the context neighbor in graph G, j ∈ N are the negative sampling nodes, ||w|| 2 2 is the L2 regularization function, β is the weight parameter. Specifically, we used a random walk to obtain the set of context nodes for each node, with a restart probability of p r , and a walk length is L w . Then a negative node j(j / ∈ W n ) was sampled randomly from the network. To improve the computational efficiency, we chose a specific proportion of nodes for sampling during each generation of training.

Similarity Calculation
After obtaining the new representation V for each node, we can evaluate whether the edge is abnormal by calculating the similarity of its two corresponding nodes. For nodes i and j, the similarity is calculated as follows: After obtaining the similarity scores for all the edges in the transmission network, these edges whose scores are relatively low are more likely to be abnormal.

Two Kinds of Test Experiments
To show the performance of the RVTR, we performed two kinds of test experiments. One was training and testing on the same network, and the other was training and testing on different networks. In the simulation experiment, first, we generated and trained our model in this network. Then, we randomly removed certain nodes from it and rebuilt the network. The new reconstructed network was our test work. In the other experiment, some different networks were generated for testing; the steps to generate the test label were the same as those in the first one. Similar to the simulation experiment, we first trained and tested on the Australia dataset in the real data experiment, and then we tested the RVTR on different datasets.

Comparison Algorithms
The RVTR is based on a virus transmission network that encodes both the graph structure and features of nodes. It is one kind of network representation method; therefore, two network representation methods, graph convolutional neural network (GCN) [21] and structural deep network embedding (SDNE) [22], were selected for comparison. In this work, we used a two-layer GCN model, in which the dimension of the hidden layer and the dimension of the output layer were the same as those of the RVTR. We used a three-layer neural network in SDNE, the dimensions of two hidden layers are 1024 and 512, and the dimension of the output layer is the same as that of the RVTR. Besides, RVTR also reduces the dimension of the attributes of the network, so we also chose to perform principal component analysis (PCA) [23] and autoencoder (AE) [24], which can reduce the dimension and extract features for high-dimensional gene sequences. In PCA, the principal component of the gene sequence was used as the input of the task of finding undetected nodes. The output dimension of PCA was the same as that of the RVTR model. AE was used to learn a representation for a gene sequence directly without considering the network structure. We first trained the AE model in a manner similar to that used for our model, and then the middle layer of the trained AE model, namely, the sequence representation after dimensionality reduction, was used as the input of the task of finding missing nodes. In addition, we used a 'DIRECT' method, which means that the high-dimensional gene sequence was used for calculation directly without any loss. The description of the comparison algorithms is shown in Appendix A.1 of Appendix A, the setting of RVTR is described in Appendix A.2 of Appendix A.

Evaluation Metrics
It should be noted that, as the RVTR is an unsupervised learning method, it cannot directly predict the number of abnormal edges. To calculate the prediction accuracy, we set the model to predict the same number of abnormal edges as the label set, and then we compared the prediction results with true labels. Precision is calculated by comparing the prediction with the true label and the proportion of correct predictions in all labels.
where m is the amount of correct prediction, and k is the number of abnormal edges in the label set. As RVTR is a kind of network representation method, we also calculated the AUC value [25], which is commonly used to measure the effect of algorithms in link prediction, to evaluate the performance of different models.

Simulation Data
First, we needed to design a simulation experiment to evaluate the performance of the RVTR model as it was difficult for us to obtain a complete COVID-19 transmission network to train our model in real situations. The simulated dataset was generated based on the character of the SARS-CoV-2 virus gene and spread to test our model first.

Training Data Generation
• Sequence simulation We set the length of the simulated SARS-CoV-2 gene to L. According to the in-house filter present in GISAID, complete sequences were comprised of genomes with lengths greater than 29,000 nucleotides [26]; here, we set L as 30,000. For the value of each gene, we used A, T, G and C to generate the whole gene sequence at each position randomly. Although there were missing symbols ' ' or gap symbols '-' in a real sampled sequence in a real situation as a limitation of high-throughput genome sequencing, we did not consider this situation for simplicity in simulation experiments. Assuming that each variation per gene is independent, the mutation rate of the whole sequence at each transmission was p.
Regarding the mutation rate of SARS-CV-2 being low when it spreads [13], here, we set p = 0.1%.

• Transmission network simulation
As the value of the basic reproduction number (denoted R0) of SARS-COV-2 is approximately three in the early stages, here we set the range of R from 0 to 6, in which R represents the number of child nodes created when we simulate the transmission network in one generation. The value of R at each transmission belongs to a Poisson distribution. To simplify, we set a fixed value, the probability p (R=0) , p (R=6) = 5%, p (R=1) , p (R=5) = 10%, p (R=2) , p (R=4) = 20%, p (R=3) = 30%. Assuming that all nodes in the network start from "patient 0", the pseudocode of generating a simulated COVID-19 transmission network and its corresponding gene sequence is described in Algorithm 1. A simulated transmission network that has 1000 nodes is shown in Figure 4. We generated three training networks with 1000, 2000 and 3000 nodes.
Algorithm 1 Generating procedure of simulated data. Input: Model parameters N, R n , L, p Output: Transmission network and the corresponding gene sequence while i < L do randomly choose A, T, G or C end while Return gene sequence for node n 0 while j < N do generate child nodes by choosing an R0 value from the range of R0 values following the probability for each value while k < R n do copy the gene sequence of node n 0 Randomly change the p × L gene to generate a gene sequence for child nodes end while Return gene sequence of each node end while Return transmission network

Test Data Generation
To show the performance of the RVTR model, we simulated a test network similar to the transmission network detected in a real situation. We randomly chose and removed some nodes from a transmission network, and these removed nodes were considered undetected nodes. Then, the network was rebuilt, and the reconnected edges were abnormal edges and marked as our true labels to calculate the prediction accuracy. As nodes at the margin of the network indicate the end of virus spread, it did not make sense to choose these nodes as undetected nodes. The pseudocode of test data generation is shown in Algorithm 2. The red nodes in Figure 5 are the nodes selected to be removed from the network in Figure 4. We removed 10% of the nodes from the network, and the number of red nodes is 100. Figure 6 shows the reconnected network after removing the red nodes. The red edges are abnormal edges, and they form a label set to test the accuracy of RVTR's predictions.

Algorithm 2 Test transmission network and label data generation.
Input: a transmission network, the proportion of removed nodes p r Output: a test network, label data set while i < p r × N do randomly choose p r × N nodes in the network end while if the selected node is at the margin of the network then retain it in the network else remove it from the network connect the parent node of the removed node to its child nodes Add new connected edges into the label set end if Return a test network and label data set

Data Resource
As it is difficult to obtain a complete transmission network with undetected infections in real situations, we used the transmission networks inferred by Perera's work [27], which used published SARS-CoV-2 genomic data to estimate reasonable transmission networks with the inference of unsampled infection sources. For the gene sequence data, we used the FASTA data, which was also used to infer the transmission network. The selected FASTA sequence data of Canada, New Zealand, New York State and Australia were downloaded from GISAID (https://www.gisaid.org/ Accessed on 1 September 2020) [26]. The Alberta dataset was obtained from the Provincial Laboratory of Alberta. The details of these datasets can be seen in Appendix A.3 of Appendix A.

Real Data Processing
There are sequences with labels that contain complete collection dates and locations in the FASTA data. Before putting the FASTA sequence data into our model, we needed to transfer the labels corresponding to the inferred transmission network. In addition, the symbols ',' between each gene also needed to be cleaned before being put into the models. Moreover, the assumption is that all infections have a common ancestor [28] that uses the Bayesian program TransPhylo, which is a dedicated software designed to reconstruct transmission networks from timed phylogenetic data to infer transmission trees. Therefore, the inferred spread started from patient "0". However, it is almost impossible to find patient "0" in an area during the pandemic, so we needed to create a gene sequence for it. Moreover, as the length of the sequence in the FASTA data is slightly different for different datasets, we needed to adjust the sequence of different datasets for alignment. The details of the real data process and figures of these transmission networks can be seen in Appendix A.4 of Appendix A.
The spread of SARS-CoV-2 varies from region to region due to the different control measures in the COVID-19 pandemic, and the distribution of transmission data, such as the reproductive number, differs. We initially evaluated the inferred transmission data of Canada, Australia, Alberta, New Zealand, and New York State for testing the methodology. The details of these datasets are shown in Table 1. As the undetected infections of the Australia dataset are relatively small in these datasets, we chose the Australian dataset, for which the transmission network is relatively complete, as our training dataset. Figure 7 shows the inferred transmission network of Australia.  Considering the inferred transmission networks contain undetected nodes, we needed to remove the inferred unsampled infections before using them to test. Similar to the process in the simulation experiment, the reconstructed network was our test network, and the new reconnected abnormal edges constituted our label set. Figure 8 shows the inferred transmission network of Australia.

Simulation Experiment Results
We compared the performance of the RVTR model with that of other methods on networks in which the initial sizes were 1000, 2000 and 3000. Test networks were generated by removing specific proportion nodes from initial networks. As the removed nodes were chosen randomly and the test network was different in each generation of the test network, we performed test experiments 10 times to prevent an uneven distribution of test data. The prediction results are shown as the mean value and variance of 10 results. As the RVTR, GCN and SDNE are all network representation methods, to be fair, we trained and tested the models on the same network in this experiment. The test results on networks of 1000, 2000 and 3000 are shown in Table 2, Table 3 and Table 4, respectively. The best results are highlighted in bold  After we compared the performance of the RVTR model with that of other methods through training and testing them on same networks, we tested the RVTR model by using different datasets to test its ability. We trained two different RVTR models: RVTR-1K, which was trained on a network with a size of 1000, and RVTR-3K, which was trained on a network with a size of 3000. Then, the trained model was used to test on networks of different sizes and the results are shown in Table 5. The best results are highlighted in bold.

Real Experiment Results
Although the transmission network with inferred undetected infections implemented in TransPhylo had been proven reasonable in previous work on an HIV dataset [28], it is hard to prove the inference is absolutely correct. To show the RVTR's ability to find abnormal edges and capture the location of undetected nodes, we first trained and tested our model on the same network, and we randomly removed some nodes to perform tests similar to the simulation test experiments. Regarding the parameter settings for the real dataset experiment, we set the dimension of the output as 256, and the other parameters of the RVTR were the same as the parameters used in the simulation experiment. The best trained RVTR model is achieved among 2400 epochs and 3000 epochs. Table 6 shows the test results on 10%, 20%, and 30% of the nodes removed from the Australia network, the best results are highlighted in bold. The results show that RVTR performed the best in finding undetected nodes in a real COVID-19 transmission network. The DIRECT method, which performed the best in the simulation experiment, cannot achieve a good performance in the real data experiment; this outcome may be related to the large invalid gene in the sequence, and it is necessary to extract the key features from the high-dimensional gene sequence. After we trained and tested on the Australia dataset to prove the ability of the RVTR model to find undetected spreaders in a real transmission network, we tested the proposed model on four datasets of different regions. The prediction results can be seen in Table 7, the best results are highlighted in bold. From the test results on different datasets, we can see that the RVTR model achieves the best performance on transmission networks for almost all regions except New Zealand. According to the analysis of the New Zealand dataset, described in Appendix A.3 of Appendix A, the sampled infections in New Zealand are quite different from those in other datasets. SARS-CoV-2 barely spread in May, June and July 2020, although the duration of spread lasted more than one year, and the inferred transmission determined by TransPhylo may be questionable when compared with other datasets. The results also show that the prediction is better when there are more undetected nodes in the transmission network, such as in the New York and New Zealand datasets.

Discussion
From the precision value of prediction results in Tables 2-4, we can see that RVTR performs better than other network representation methods and feature extraction methods, although the best result was achieved by the DIRECT method. This may be related to the mechanism of sequence generation; embedding in lower dimensions will reduce the features of the simulation sequence. From the AUC value in Table 2, we can see that the AUC values of these models are also good when the precision values are great. However, good AUC values cannot guarantee fine precision values from the results of the RVTR on different test networks.
We can see that the model trained on a large network has a better performance when tested on different networks from the results in Table 5. From these simulation experiments, we can see that the performance of all algorithms is better when the removed proportion is larger, which means that the difference in sequence is larger when the transmission distance is longer. The more undetected nodes existed between two nodes, the lower the similarity score of this pair of nodes and it will be much easier for the RVTR to detect the difference. To show the performance of the RVTR in more detail, we also analyze the training details of RVTR and the influence of different network structures in this section. Figure 9 shows the detailed loss change in the RVTR in the training step. We can see that the loss value drops rapidly at the early training stage, and then the change in the loss value becomes stable after 1000 epochs.

Training Details of Simulation Experiment
We also saved the trained parameters of the models every 200 epochs when training; then, these different trained models were used to test and change the performance of the different models. Figure 10 shows the test results of different trained models on test networks with an initial size of 1000 and 10% nodes removed. We can see that the prediction accuracy increases when the training epoch increases. The changes in the AUC value in Figure 11 also show a good performance when the RVTR is well trained.

The Influence of Network Structure
Based on [29], the asymptomatic nodes are connected differently from connections of other symptomatic nodes. For example, asymptomatic infections usually cause super spread as they do not show any symptoms and will contact with others as usual, while symptomatic infections will quarantine themselves and reduce the spread. The sensitivity of the results may be related to the selected nodes for removal and also the network structure, we also simulated two different transmission networks based on different R values to analyze the influence of network structure. The R value of the first one is approximately equal to 1.7, which is generated based on a fixed probability p (R=1) = 50%, p (R=0) , p (R=2) = 15%, p (R=3) , p (R=4) , p (R=5) , p (R=6) = 5%; we named this network N1.7. The R value of the second one is approximately equal to 4.8, which is generated based on a fixed probability p (R=6) = 40%, p (R=5) = 30%, p (R=4) = 15%, p (R=1) , p (R=2) , p (R=3) = 5%, p (R=0) = 0%, we named this network N4.8. To be fair, the RVTR is trained on a network of size 3000 and then tested separately on different networks. From the results shown in Table 8, we can see the RVTR achieves different performance on test networks with different transmission structures, the best results are highlighted in bold. The results are bad when the model is tested on the network with lower R value, while the predictions are good on networks with a higher R value.  The reason for the prediction accuracy in Table 7 fluctuating greatly in real dataset experiments is related to the distribution of the abnormal edges. Figure 12 shows that different transmissions result in different kinds of abnormal edges. One outcome is "fewer edges", which means that there is more than one undetected node in one abnormal edge. The outcome of "more edges" means that there is a "superspreader" in the detected transmission network. Although there were many abnormal edges in the sampled transmission network, there was only one undetected node. From Table 1, we can see that the distributions of the data for New Zealand and New York was similar; they both were of the "fewer edges" category. The distribution of the data for Alberta and Australia belonged to the "more edges" category, which means that there was a "superspreader" in the transmission network. In the first kind of abnormal edge, the difference in similarity of the two nodes was large, and the precision of finding the location of undetected nodes in the network is high. In the second kind of abnormal edge, the similarity value of these abnormal edges was close to the normal edge, as only one undetected node exists in them.

Conclusions
In this work, we propose a graph attention-based RNA virus transmission network representation model, i.e., RVTR, to find asymptomatic spreaders in the COVID-19 transmission network. The RVTR model achieves a good performance not only in simulated datasets but also in real COVID-19 transmission networks that were inferred by TransPhylo; we can see the ability of RVTR to find the location of undetected nodes in COVID-19 transmission networks. It means that we can use the RVTR model to find where undetected infections exist in this network after we construct a real COVID-19 transmission network by detected infections. For some areas and countries such as China, Singapore and Japan that take strict epidemic prevention and control measures, the epidemic departments usually conduct tracing extensively and publish detailed records of more than thousands of anonymized patients. A huge cost will be taken on to find the asymptomatic or undetected spreader hidden in transmission networks when some new infections are detected in an area. Our proposed method can be used to reduce the cost if it can tell where undetected infections exist in a transmission network, workers of epidemic control can focus on looking for asymptomatic spreaders in specific transmission relationships instead of looking for undetected spreaders across the entire network. Not only can it reduce the cost of conducting nucleic acid testing for all people in a region, but it can also save time to find the undetected spreader more quickly, thereby controlling the spread of COVID-19. However, as the RVTR is trained based on graph context loss, which entails unsupervised learning, it has the ability to find the locations of undetected nodes in the network but cannot tell us how many of them are present. In future work, we will change the loss function to use supervised learning by giving each edge a label with the number of undetected nodes in it; then, the proposed model will provide more information to help control the spread of SARS-COV-2. Data Availability Statement: Publicly available datasets were analyzed in this study. This data can be found here: https://github.com/lzylyn/COVID-19-Transmission-Network.
In these figures, the red nodes represent inferred undetected spreaders in a transmission network. The red edges are abnormal edges with undetected nodes in them and are also used as test labels. Figures 7, A1, A3, A5 and A7 show the inferred transmission networks of Australia, Canada, Alberta, New York State and New Zealand respectively. Figures 8, A2, A4, A6 and A8 shows the observed transmission networks of Australia, Canada, Alberta, New York State and New Zealand respectively, which were also used as our test network. From these figures, we can see that the spread in different from region to region, and the distribution of the number of abnormal edges is different.