Fine-Scale Space-Time Cluster Detection of COVID-19 in Mainland China Using Retrospective Analysis

Exploring spatio-temporal patterns of disease incidence can help to identify areas of significantly elevated or decreased risk, providing potential etiologic clues. The study uses the retrospective analysis of space-time scan statistic to detect the clusters of COVID-19 in mainland China with a different maximum clustering radius at the family-level based on case dates of onset. The results show that the detected clusters vary with the clustering radius. Forty-three space-time clusters were detected with a maximum clustering radius of 100 km and 88 clusters with a maximum clustering radius of 10 km from 2 December 2019 to 20 June 2020. Using a smaller clustering radius may identify finer clusters. Hubei has the most clusters regardless of scale. In addition, most of the clusters were generated in February. That indicates China’s COVID-19 epidemic prevention and control strategy is effective, and they have successfully prevented the virus from spreading from Hubei to other provinces over time. Well-developed provinces or cities, which have larger populations and developed transportation networks, are more likely to generate space-time clusters. The analysis based on the data of cases from onset may detect the start times of clusters seven days earlier than similar research based on diagnosis dates. Our analysis of space-time clustering based on the data of cases on the family-level can be reproduced in other countries that are still seriously affected by the epidemic such as the USA, India, and Brazil, thus providing them with more precise signals of clustering.


Introduction
A novel coronavirus was first reported in China in December 2019 and was named "COVID-19" by the World Health Organization (WHO) [1,2].The COVID-19 was discovered in the capital of Hubei province, a Central China city named Wuhan, where the traffic system is well developed, and spread rapidly to nearly every part of the world causing global pandemic [3]. The coronavirus is a large virus family, which is known to cause Middle East respiratory syndrome (MERS), severe acute respiratory syndrome (SARS), and other serious diseases [4]. The COVID-19 virus is a new strain of coronavirus that has never been found in a human body before. The virus is mainly transmitted by droplets of coughing or sneezing. In view of this mode of transmission, most cases are related to direct contact, mainly during close contact [5][6][7].
Geography Information System (GIS) is an effective means to visualize and analyze spatial characteristics based on epidemic data [8][9][10]. Combined with spatial statistics, GIS can be used to help mitigate the epidemic through scientific information, find spatial correlations with other variables, and identify transmission dynamics and clustering [11][12][13][14].
Chinese regions differ from one another quite to a great degree, especially in terms of population density, meteorology, transmission net, industry, and economy [15][16][17][18][19][20]. Exploring spatio-temporal patterns of disease incidence can help to identify areas of significantly elevated or decreased risk, providing potential etiologic clues [21,22]. Only through mastering the temporal and spatial distribution of the epidemic can we achieve the target of accurate prevention and control. Quantitative research on the spatial and temporal characteristics of the clusters and the internal diffusion of the virus can not only comprehensively and profoundly help to understand the spatio-temporal law and internal mechanism of epidemic transmission, but also reflects the impact of emergency prevention and control measures for the spread of epidemic and serve as scientific evidence for policy adjustment.
However, most of the studies focus on the single analysis of spatial or temporal clustering. Single analysis of spatial clustering cannot reflect the dynamic change of epidemic situation, and single analysis of temporal aggregation can provide only some fuzzy information, namely whether there exist clusters in a certain time. However, the analysis of space-time clustering can not only indicate whether there exist clusters, but also can detect the spatial location and time of duration. Therefore, the detection results of space-time clustering are more conducive to the disease control departments taking timely response measures. If an outbreak causes the incidence of the whole area to increase at the same time, the temporal clustering method can easily detect the existence of clustering. However, for an outbreak that started in a local area and then gradually spread to the whole area, the incidence curve of the whole area will show an upward trend due to the increase in number of cases in the local area, so the clusters will be detected by single temporal clustering with a lag. The warning results of space-time clustering are more accurate and timelier due to the full use of the temporal and spatial information in the data. It is of great importance for early warning and prevention of future outbreaks [23][24][25].
Currently, epidemiologists pay more attention to the identifying of space-time clusters on a small scale, such as communities, counties, or provinces, and there are very few research works on large-scale analysis of space-time clustering, especially at the national scale [26] Even where there is, due to the availability of data, many studies are conducted with counties or provinces as the smallest unit [27,28]. In fact, it is very important to analyze the clustering of an epidemic on a national scale at the unit of family-level because it is of great significance for the development of accurate prevention, control, and work resumption policy at the national level.
Although we can roughly observe the clustering areas of COVID-19according to the existing epidemic map, the results of clusters number, clustering sizes, and time of duration obtained by different clustering methods at different spatial and temporal scales are very different. Currently, most of the national scale epidemic clustering detection is based on the administrative boundaries of counties, cities, or provinces. However, in the process of infectious disease transmission, the separation of administrative divisions cannot become the barrier of disease transmission. If the detection is carried out in isolation with provinces, cities, and counties as the units, some important clusters may not be detected timely due to the lack of information in the surrounding areas. Therefore, it is necessary to take the family as the smallest unit of clusters detection.

Materials
The COVID-19 cases are collected from the database of diagnosis and suspected cases of COVID-19 in mainland China established by a special group of big data analysis, which is subordinate to the Joint Prevention and Control Mechanism of the State Council. The database is generated based on China's National Infectious Disease Information System (IDIS), which requires each COVID-19 case to be reported electronically by the responsible doctor as soon as a case has been diagnosed. It includes cases that are reported as asymptomatic, and data are updated in real time. The dataset includes the records of all confirmed COVID-19 cases from the onset of the outbreak on 2 December 2019 to 20 June 2020 in Mainland China (excluding Hong Kong, Macao, and Taiwan regions). Each record contains the information of the patient's name, gender, ID, date of onset, date of diagnosis, administrative code, home address, and so on. The administrative boundary map of China was acquired from National Geomatics Center of China (http://ngcc.sbsm.gov.cn/, accessed date: 3 March 2018), which is in the format of shapefile.
The daily number of new cases of COVID-19 in mainland China between 2 December 2019 and 20 June 2020 was the data used for Figure 1. The total number of COVID-19 cases is 83,377 as of 20 June 2020. There are two peaks in the variation curve. One is on 24 January 2020 with a count of 3756 confirmed cases. The other one is on 1 February 2020 with a count of 5089 cases. That is because the prevention and control plan of COVID-19 (5th edition) issued by General Office for National Health Commission of China has added "clinical diagnosis" in the case diagnosis classification of Hubei province so that the patients can be diagnosed as early as possible according to the epidemic characteristics of Hubei on 13 February 2020, which led to a surge in new cases near 1 February 2020 as the date of onset.

Spatialization of the Case Date
The traditional method of spatialization for epidemic data is based on the administrative code of the patient's home on county, city, or province scale. However, this method cannot catch the exact location of the case. In this paper, we employ the software of XGeocoding v2 to translate the patient's home address into coordinate information. In this way, all the recorded confirmed cases were inputted into Microsoft Excel 2010 (Microsoft, Redwoods, WA, USA), geo-coded according to their residential addresses and determined its longitude and latitude coordinates by using ArcGIS10.2 (ESRI Inc., Redlands, CA, USA), which was assumed to represent the location of case outbreak.

Space-Time Cluster Detection
The method employed for cluster detection is space-time scan statistic. Scan statistic is a method widely used in epidemic clustering analysis. It can effectively detect the increase of local time and/or spatial incidence of cases, and test whether the increase is caused by random variation. It can not only detect whether there is clustering in a certain area, but also accurately locate the clustering.
Space-time scan statistics is an extension of spatial scan statistics put forward by Kulldorf, professor of Harvard Medical School in 1997 [29][30][31]. It adds the time dimension to the original spatial scanning statistics so that the scanning statistics can detect the clustering in time and space at the same time. Therefore, compared with the circular window of the spatial scanning statistics, its scanning window is also correspondingly changed into a cylinder, where the bottom of the cylinder corresponds to the spatial range, while the height corresponds to a certain length of time segment. Because the size and position of the cylinder scanning window are constantly changing, the space-time scan statistics can be used for the time and place of epidemic onset. The size of the point and its scale are analyzed in depth, so as to realize the early identification of epidemic outbreak.
The specific detection process of Space-time Scan Statistic can be divided into four steps. Firstly, select a random spatial point in the study area as the center of the bottom surface of the cylinder scanning window. Then, they gradually increase the radius and height of the bottom surface of the cylinder scanning window. The continuous change of the bottom area of the cylinder corresponds to the change of the geographical area covered by the scanning window, and the continuous change of the height of the cylinder corresponds to the change of the bottom area until reaching the maximum space and time limit of the scan window. All positions of the cylinder scan window in the study area repeat the same scanning process. For each scan window, the expected incidence can be calculated according to the actual number of cases and population. Thirdly, the expected incidence can be calculated according to the number of cases in the scan window and outside the scan window. The log likelihood of the test statistics can be constructed from the actual and expected incidence Ratio; LLR (Log Likelihood Ratio) is used to evaluate the abnormal degree of the number of cases in the log likelihood ratio scanning window. It is necessary to select the window with the largest log likelihood ratio as the window with the highest abnormal degree of the number of cases will produce a large number of scanning windows. Finally, use the method of Monte Carlo simulation to evaluate the statistical significance of the window.
The retrospective analyses of space-time permutation is employed as the probability model of clustering detection. Its principle is as follows.
First of all, assume that the number of infections in an area z during d days is C zd , which corresponds to a scanning cylinder. Then, the total number of infections C in the whole study area during all time segments can be expressed as the following function.
Thus, the number of infections per day in each region µ zd can be described as: Therefore, we can calculate the expected number of infections µ A per scanning window A according to the number of infections of each unit µ zd .
If the observed number of infections in cylinder A is C A , then C A obeys the hypergeometric distribution of mean µ A . The probability function of C A can be calculated as: When ∑ z∈A C zd and ∑ d∈A C zd are very small relative to the total number of infections C, C A approximately obeys the Poisson distribution of mean µ A . Based on this approximation, the generalized likelihood ratio (GLR) is used to measure whether the number of infections in cylinder A is abnormal.
In this paper, we show how information about the location and spatial extent of such events can be estimated from the spatial and temporal array of all calls by using the space-time permutation scan statistic.
The probability that a cluster will form by chance is assigned using Monte Carlo hypothesis testing by employing the likelihood of the statistic in question. For this, the time stamps of data points are shuffled and the statistic is calculated again. The process is then repeated 999 times.
The conventional techniques of space-time scan statistic commonly used the administrative boundary such as province, city, and county to be the minimum spatial unit of detection and use the regional center coordinate as all the cases location, which outbreak in this region [27,28].However, the separation of administrative divisions may become the barrier of disease transmission, because some important clusters may not be detected timely due to the lack of information in the surrounding areas if use the provinces, cities, and counties as the minimum unit of the detection. Our method employed the locations of the patients' community or family to be the basic statistic unit by translating the patients' home addresses to coordinate in order to detect the space-time clusters in finer scale.

Results
All 83,377 of the COVID-19 new cases in mainland China between 2 December 2019 and 20 June 2020 are geocoded and spatialized. The spatial distribution of COVID-19 cases are shown in Figure 2. The COVID-19 cases are distributed throughout all the provinces of China. No province is immune. Most of the cases are located in Hubei province, where Wuhan is its provincial capital. Tibet has only one case, which is the least of any province in China. All the geo-coded cases were inputted into the model retrospective analyses of spacetime permutation using the software of SatScan V9.6 (Department of Population Medicine, Harvard Medical School and Harvard Pilgrim Health Care Institute, Boston, MA, USA). The space-time permutation scan statistic employed in the study utilizes millions of overlapping cylinders to define the scanning window, each being a possible candidate for an outbreak. It can use different parameter settings of the maximum scan cylinders radius and height to consider the spatio-temporal independence of the clusters. When the outbreak location distance of two cases exceeds the maximum scanning radius or the interval time of them exceeds the scanning height, they are considered to be independent, and will not be detected in the same cluster. In this study, we set the maximum cluster size of the spatial window to a circle with a 100 km radius. The maximum temporal cluster size is seven days. The time precision is one day. Table 1 summarizes the characteristics of the statistically significant space-time clusters of COVID-19 at the family-level with a maximum spatial scanning window size of 100km in mainland China. Figure 3 illustrates the distribution of all the corresponding space-time clusters. According to the statistics, 43 clusters were detected with a p-value less than 0.05, in which 10 clusters were identified in Hubei province and 4 clusters were in Hebei province. Inner Mongolia and Shandong province had three clusters in each. Heilongjiang, Gansu, Sichuan, Liaoning, Guangdong, Fujian, and Zhejiang province had two clusters in each. Tianjin, Jilin, Jiangxi, Shaanxi, Guizhou, Anhui, Chungking, Shanghai, and Shanxi province had one cluster in each. All the clusters were detected in the central or eastern provinces. No cluster was detected in the western provinces.
The earliest space-time cluster was identified in Zhejiang province from 15-21 January 2020 with a radius of 88.1 km. The earliest cluster in Hubei province was detected from 19-25 January with a radius of 97.1 km. The third province, autonomous region, or municipality directly under the central government that shows a cluster appear is in Inner Mongolia from 2-4 February 2020 with a radius of 7.9 km. Jiangxi and Anhui are the adjacent provinces of Hubei and detected clusters separately from 2-3 February 2020. There were nine clusters detected in Zhejiang, Chongking, Hebei, Hubei, Shandong with a radius of 0 km. This means the cases of the cluster are located in the same communities, and it is likely to be a cluster of the same family. The last cluster was identified in Hebei province from 10-16 June 2020 with a radius of 68.2 km. The observed number of cases is 171, while the expected one is 1.1. In fact, the last cluster is caused by Beijing's new confirmed COVID-19 cases related to Xinfadi market. Cases in Hebei province account for only a small proportion. With the maximum spatial scanning window size set as 100 km, this cluster contains both areas in Beijing and Hebei, and the cluster center is located in Hebei. Table 1. Characteristics of the statistically significant space-time clusters of COVID-19 at the family-level with a maximum spatial scanning window size of 100 km in mainland China from 2 December 2019 to 20 June 2020 (clusters where all cases happen in the same geolocation are reported as having 0 km radii).  In order to discover finer-scale clusters, we set the maximum cluster size of the spatial window to a circle with a 10 km radius. The maximum temporal cluster size and time precision are still set as seven days and one day. Table 2 summarizes the characteristics of the statistically significant space-time clusters of COVID-19 at the family-level with a maximum spatial scanning window size of 10 km in mainland China. Figure 4 illustrates the distribution of all the corresponding space-time clusters. As statistics, 88 clusters were detected with a p-value less than 0.05. The clusters are distributed in 19 provinces, autonomous regions, or municipalities directly under the central government, as shown in Table 3. The earliest clusters are identified in Hubei from 20-25 January, and the last one is in Beijing from 14-20 June. There are 25 clusters located in Hubei province and 8 clusters were in Shanghai city. Each of Beijing, Guangdong, and Inner Mongolia has seven clusters. These areas including Shanghai, Beijing, and Guangdong are China's top-three well-developed provinces or cities. They have developed transportation network with Hubei province especially Wuhan city. The large population of trade and migrant flow lead to the virus transmitting from Hubei to these areas more quickly than other provinces. Table 2. Characteristics of the statistically significant space-time clusters of COVID-19 at the family-level with a maximum spatial scanning window size of 10 km in mainland China from 2 December 2019 to 20 June 2020 (clusters where all cases happen in the same geo-location are reported as having 0 km radii).    In order to discover more detailed characteristics of the clusters, we enlarged the clustering maps of both Wuhan and Beijing city, which are reported as the original epidemic areas, respectively, in the first and second waves of COVID-19 in China as shown in Figures 5 and 6.There are 16 clusters identified in Wuhan city. All the clusters are detected in February 2020. These clusters are more uniformly distributed in all areas of Wuhan, regardless of whether in urban or suburban areas. In order to help to understand the detail spatio-temporal structure of the clusters, we colored the points of cases by the cluster according to the time from the first case reported in China. The early clusters formed in the east areas of the Yangtze River during 6-10 February 2020, which is about 10 weeks after the first case was reported, although the early cases were reported in Hankou district, which is in the west of the Yangtze River, as shown in Figure 5.

SN
Although there are seven clusters identified in Beijing city, the clustering times range from February 2020 to June 2020. The clusters in Beijing are mainly distributed in the urban area. Similarly, we colored the points of cases by the cluster, and found the clusters can be divided into two stages. The first stage includes four clusters from February to March that are located in the northern urban area of Beijing, which is 12 to 17 weeks from the first reported case. These clusters belong to the first wave of COVID-19 in China, which originated in Wuhan. The second stage includes three clusters of outbreak in June 2020 and is located in the southern urban area of Beijing, which is 27 to 29 weeks from the first reported case. These clusters belong to the second wave of COVID-19 in China, which may have originated in Xinfadi market, which is a large wholesale market that sells fruits, vegetables, and meat located in Beijing's Fengtai District and has been caught in the spotlight after new COVID-19 clusters were linked to it in June 2020, as shown in Figure 6.

Discussion
The study employed the retrospective analyses of space-time permutation to detect the space-time clusters of COVID-19 in mainland China on a fine scale. Based on the data obtained from China's National Infectious Disease Information System (IDIS), we geo-coded each case and translate the patient's home address into coordinate information in order to catch the exact location of the case. Former studies on space-time clusters detecting of COVID-19 in a county are mainly at the county-level [27,28]. It is the first time for a study to identify the clusters at the family-level in a large country.
Epidemic diagnosis time can scientifically evaluate and comprehensively reflect the emergency level and physical therapy capacity of a national or local health department. Most studies use COVID-19 case data based on diagnosis time from Johns Hopkins University's Center for Systems Science and Engineering GIS dashboard to do the retrospective analysis. Although these data are updated daily, the statistics of daily cases are based on the diagnosis date. In fact, the average epidemic diagnosis time for COVID-19 outbreak from early onset to diagnosis is 7.35 days in mainland China [32]. It means the detected clusters start and end times will delay 7.35 days using the same space-time scan model based on the diagnosis dates rather than dates of onset. Therefore, it is more scientific and effective to use the case dates of onset when detecting the space-time clusters of an epidemic.
In order to account for the characteristics of the disease in a small region and to improve the probability of detecting smaller clusters, we set the maximum cluster size of the space-time clusters to two fine scales: One is 100 km and another one is 10 km. On the maximum cluster size of 100 km, we have detected 43 clusters during the study period, 10 of which were located in Hubei province. However, to our surprise, the earliest space-time cluster was identified in Zhejiang province from 15-21 January 2020 with a radius of 88.1 km. This may be explained by there being a large flow of population between Hubei and Zhejiang, especially in late January, with many infected students and migrant workers returning to Zhejiang from Wuhan before the closure of the city on 23 January. It also verified one table result in the former study that identifies a significant number of people who entered Wenzhou from Hubei Province, which explains why this city was the first outside the epicenter where confinement was adopted. While the maximum cluster size of 10 km, 88 clusters were detected by space-time scan statistic. The detected clusters are finer than that of 100 km. We have compared the characteristic of the clusters in the city of Wuhan with Beijing. The clusters identified in Wuhan are all detected in February and they are uniformly distributed in all areas of Wuhan, regardless of whether in urban or suburban areas. However, the seven detected clusters in Beijing are mainly distributed in the urban area. Four clusters ranging from February to March are located in the northern urban area. The other three clusters from the outbreak in June are located in southern urban area of Beijing, which may have originated in Xinfadi market and belong to the second wave of COVID-19 in China.
Regardless of whether it isa maximum clustering size of 100 km or 10 km scale, the province where the most clusters are located is Hubei, and the month with the most clusters is February, which indicates that China's COVID-19 epidemic prevention and control strategy is effective and has successfully prevented the virus from spreading from Hubei to other provinces and lasting too long. When the first wave of the pandemic hit, the virus-testing capabilities were not strong, as it was unbeknownst to us at the time. However, China did a satisfactory job in data tracking, patient tracking, community quarantine, and the early warning from front-line fever clinics, which ensured there were no loopholes left. China was under great pressure when Wuhan city was closed on 23 January 2019. That strategy relieved the situation that the epidemic might transmit into the other areas of China through the Spring Festival holiday and formed more clusters.
There are also some limitations in the study. Firstly, because we use the home coordinate as the unit of space-time scan statistic, no population data at the home address-level can be collected in China. Therefore, we can only select the space-time permutation rather than the Poisson model in the discrete scan statistics. Secondly, the clusters detected are circular. In fact, changes in geography and cultural practices will in many cases invalidate this. Non-circular clusters may help to improve the detection. Thirdly, it is difficult to present all the fine space-time clusters in such a large country as China, especially when there are 88 clusters with a maximum cluster size of 10 km to be exhibited on the map. Finally, because of the change of statistical standards, previously only the number of people who had been diagnosed by accounting instruments were counted as confirmed cases. However, from 12 February 2020, clinical diagnosis cases recognized by doctors were also included in the statistics of confirmed cases, thus causing a sharp increase of cases and clusters in early February for a week from early onset to diagnosis. Lastly, using the conventional space-time scan statistic to detect the fine-scale space-time clusters at the family-level in whole China takes a lot of time. Combined with some of the optimization algorithms such as the particle swarm optimization [33], the probabilistic cellular automata model [34] and the coupled spring forced bat algorithm [35] may help improve the efficiency of the method.

Conclusions
In this study, we use the retrospective analysis of space-time scan statistic to detect clusters of COVID-19 in mainland China on two fine scales: With maximum clustering radii of 100 km and 10 km. Different from the other study, our analysis is based on case dates of onset, which are collected from the database of diagnosis and suspected cases of COVID-19 in mainland China established by the special group of big data analysis, which is subordinate to the Joint Prevention and Control Mechanism of the State Council. In addition, it is the first time to identify the space-time clusters of COVID-19 in a large country at the family-level. The results show that the detected clusters vary with the maximum clustering radius. Forty-three space-time clusters were detected with a maximum clustering radius of 100 km and 88 clusters with a maximum clustering radius of 10 km from 2 December 2019 to 20 June 2020. Using a small clustering radius may identify finer clusters. Hubei has the most clusters regardless of scale. Most of the clusters were generated in February. That indicates China's COVID-19 epidemic prevention and control strategy is effective and has successfully prevented the virus from spreading from Hubei to other provinces and lasting too long. Well-developed provinces or cities that have large populations and developed transportation network are more likely to generate space-time clusters. The analysis based on the data of cases of onset may detect the start times of clusters seven days earlier than the same research that is based on the diagnosis dates. Our analysis of space-time clustering based on the data of cases of onset on the family-level can be reproduced in other countries that are still seriously affected by the epidemic, such as the USA, India, and Brazil, providing them with more precise signals of clustering.