Epidemiological identification of a novel infectious disease in real time: Analysis of the atypical pneumonia outbreak in Wuhan, China, 2019-20

Objective: Virological tests indicate that a novel coronavirus is the most likely explanation for the 2019-20 pneumonia outbreak in Wuhan, China. We demonstrate that non-virological descriptive characteristics could have determined that the outbreak is caused by a novel pathogen in advance of virological testing. Methods: Characteristics of the ongoing outbreak were collected in real time from two medical social media sites. These were compared against characteristics of ten existing pathogens that can induce atypical pneumonia. The probability that the current outbreak is due to "Disease X" (i.e., previously unknown etiology) as opposed to one of the known pathogens was inferred, and this estimate was updated as the outbreak continued. Results: The probability that Disease X is driving the outbreak was assessed as over 32% on 31 December 2019, one week before virus identification. After some specific pathogens were ruled out by laboratory tests on 5 Jan 2020, the inferred probability of Disease X was over 59%. Conclusions: We showed quantitatively that the emerging outbreak of atypical pneumonia cases is consistent with causation by a novel pathogen. The proposed approach, that uses only routinely-observed non-virological data, can aid ongoing risk assessments even before virological test results become available. Keywords: Epidemic; Causation; Bayes' theorem; Diagnosis; Prediction; Statistical model

leading us to believe that the cluster of cases was due to "Disease X" (i.e., an infectious 57 disease of previously unknown viral etiology). However, rigorous quantitative 58 assessment of the chance that the disease is in fact Disease X has not previously been 59 undertaken. The present study addresses this, demonstrating that non-virological 60 information can lead to an objective classification of Disease X, using a simple 61 statistical model that exploits the well-known Bayes' theorem.

62
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not peer-reviewed)
The copyright holder for this preprint . https: //doi.org/10.1101/2020.01.26.20018887 doi: medRxiv preprint 4 METHODS 63 As the outbreak unfolded, we calculated in real-time the probability that the pathogen 64 responsible for the atypical pneumonia was novel (Disease X), or whether instead the 65 outbreak was generated by a previously known pathogen that can cause pneumonia. Our 66 analysis began on 30 December 2019, when the Wuhan Municipal Health Commission 67 announced that there had been a surprisingly large number of atypical pneumonia cases.

68
At that time, we assumed the causative agent could have been one of seven known viral 69 or three known bacterial diseases, along with the chance that it was instead Disease X. 70 We tracked two of the most active medical social media sites, i.e., ProMED (ProMED, 71 2020) and Flutracker (Flutracker, 2020), that reported the non-virological characteristics 72 of the outbreak, including atypical pneumonia, other clinical characteristics, and 73 exposure factors, as it progressed. These characteristics do not necessarily represent the 74 features that were causing disease, but are instead basic observations from the ongoing 75 outbreak. Given these characteristics, we then calculated the probability that the 76 ongoing outbreak is due to a known disease or unknown Disease X. On the first day of 77 calculation (i.e. 30 December 2019), the only explanatory factors we included was 78 atypical pneumonia, which was common to all enumerated diseases. Our analysis 79 represents simple logical deductions from the limited data that were available during the 80 outbreak in a quantitative manner and was updated to reflect new information about the 81 outbreak as it became available in real time.
82 Table 1 shows the information compiled about the current outbreak, and the 83 dates on which each of these characteristics were discovered. Each characteristic listed 84 was assigned a value of zero or one, denoting whether or not the characteristic of listed 85 outbreak, not individual cases, was likely for the emerging outbreak, and the equivalent 86 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
(which was not peer-reviewed) The copyright holder for this preprint . https: //doi.org/10.1101//doi.org/10. /2020 5 values for outbreaks of previously observed pathogens were also noted. We make two 87 assumptions to use and un-use a part of the input exposure characteristics: (i) previously 88 known disease outbreaks are all based on empirically observed notion (and do not 89 include the new exposure data (i.e., exposure at a wet market), that is specific to novel 90 coronavirus in Wuhan, which may be non-informative to other outbreaks for the 91 calculation) and (ii) all exposure characteristics are known for all previously known 92 outbreaks, incorporating all factors enumerated. Also, once pathogens were ruled out as 93 the causative agent of the current outbreak, they were removed from our analysis: for 94 example, highly pathogenic avian influenza (HPAI) (H5N1) was confirmed not to be 95 the causative agent by laboratory testing on 3 January. Hence, we omitted this pathogen 96 from our analysis from 3 January 2020 onwards.

97
To assess the probability that the emerging outbreak was caused by a variant of 98 a known pathogen, we first calculated the distance between the set of characteristics of 99 the ongoing outbreak and those of previously known pathogens. The distance between 100 the characteristics of the ongoing outbreak and cases due to pathogen j is denoted by dj. 101 We assumed that the probability that the outbreak is due to a variant of pathogen j 102 decreases exponentially with distance dj. Then, by Bayes' theorem,  so that qi was simply the reciprocal of the number of pathogens being considered 109 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not peer-reviewed)
The copyright holder for this preprint . https://doi.org/10.1101/2020.01.26.20018887 doi: medRxiv preprint 6 (including Disease X) on each date in our analysis. We initially estimated the distance 110 between observed characteristics of the outbreak and each known candidate pathogen 111 using the Hamming distance (i.e., the sum of squares differences between the entries in 112 the columns of Table 1 corresponding to the Disease X and the candidate pathogen).

113
Then, we assumed that the probability that the outbreak is driven by disease j is 114 governed by a negative exponential function

116
where dj is the calculated Hamming distance.

117
We also repeated our analysis using an alternative measure of the distance 118 between observed characteristics of the outbreak and each known candidate pathogen, 119 namely the Euclidean distance (i.e. the square root of the Hamming distance). In each 120 case, we assumed that the importance of each characteristic had an identical weight in 121 our analysis, so that a simple quantitative assessment could be obtained in a 122 probabilistic manner without the need for subjective judgement.

123
Combining equations (1) and (2), and assuming that qi is identical over i, we 124 have: The probability that the outbreak is driven by Disease X corresponds to the 127 distance dX = 0, and represents a risk score taking values between the reciprocal of the 128 number of candidate pathogens including Disease X itself and one: (3) 130 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not peer-reviewed)
The copyright holder for this preprint . https://doi.org/10. 1101/2020 Supposing that there are n known pathogens responsible for the atypical 131 pneumonia, the probability of observing Disease X without any information is identical 132 with the probability of observing other listed pathogen (i.e., 1/(1+n)) and as pathogens 133 are ruled out by laboratory testing, the identical probability increases (i.e., 1/11 until 2 134 Jan 2020, 1/7 from 3 Jan 2020 and 1/5 from 5 Jan 2020 in current outbreak). In 135 addition, if the probability of observing Disease X according to equation (3) takes a 136 value close to the probability of observing other candidate pathogens, the overall 137 probability that the outbreak is due to a novel pathogen should be interpreted as being 138 low. A result of significant practical importance, however, is when the probability of 139 observing Disease X is close to one or much larger than the probability corresponding 140 to each previously observed candidate pathogen. In that case, all candidate pathogens 141 are not similar to the causative agent of the ongoing outbreak, and so the outbreak is 142 likely to be due to a novel pathogen. 143 We converted the probability of disease X into the equivalent percentage value 144 (so that, for example, a result of 0.8 in equation (1) is assumed to mean an 80% 145 probability) and refer to the percentage value as the "probability of Disease X" 146 hereafter.

148
We show temporal changes in estimates of the probability that the ongoing outbreak is 149 driven by each candidate pathogen in Figure 1. Because the only information on 30 150 December 2019 was that cases displayed symptoms of pneumonia, the distance between 151 ongoing outbreak and known ten diseases was all zero, and thus, all eleven candidate 152 pathogens initially showed an identical probability of 9.1% (i.e., 1/11). Additional 153 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not peer-reviewed)
The copyright holder for this preprint . https://doi.org/10.1101/2020.01.26.20018887 doi: medRxiv preprint 8 characteristics became known the following day (i.e., 31 December 2019), and 154 consequently, the inferred probability that the outbreak was driven by a novel pathogen 155 increased substantially to 58.6% and 36.9% for Hamming and Euclidean distance 156 metrics, respectively. When the exposure characteristic (i.e. exposure at a wet market), 157 that is specific to ongoing outbreak were excluded from the analyses, the probability of 158 observing Disease X given observed characteristics is as high as 48.7% and 32.6% for 159 Hamming and Euclidean distance.

160
Later in the outbreak, adenoviruses, HPAI (H5N1 and H7N9) and other 161 influenza viruses were ruled out on 3 January 2020, leading the probability of Disease X 162 being assessed as 90.7% and 57.2% for Hamming and Euclidean distance metrics, when 163 all factors were considered as characteristic. Excluding the wet market exposure, the 164 probability of Disease X was 78.2% and 50.6% for Hamming and Euclidean distance 165 metrics, respectively. SARS-and MERS-associated coronaviruses were ruled out as the 166 causative agent on 5 January 2020, leading to a very high estimate for the probability 167 that the outbreak is caused by a novel pathogen once all information had been collected.

168
As of 12 January 2020, the probability of Disease X is estimated to be 92.5% and 65.5% CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. while awaiting the results of virological tests. We believe that the proposed approach 199 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

200
It is critically important to discuss two issues that the definition of variables in 201   Table 1 has involved. First, a critical underlying assumption is that Table 1 Table 1 207 and ours is only for the exposition using a typical Table 1 that authors came up. Second, 208 as we have shown, there are multiple combinations of characteristic data to be used.

209
Namely, as an exposure to a wet market for known disease outbreaks other than HPAI 210 was not necessarily derived from empirical observation, the fairness of an assumption 211 that the majority of cases of those known disease outbreaks were asked not to have 212 visited a wet market would be a subject for debate.

213
In the past, descriptive outbreak information has been used to produce sensitive 214 outbreak case definitions, and causative agents have been pinpointed without using 215 statistical methods in combination with epidemiological observations. In the present 216 study, we have shown that such assessments can be made quantitatively using a simple  (Table 1), these data can contribute to narrow down the possible range of causative 220 agents. In the case of the outbreak in Wuhan, our calculation of the probability that each 221 pathogen is the causative agent indicates that virologically excluding the possibility of 222 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

225
As important limitations, the precision and credibility of input data, and the 226 method for calculating the distance between candidate diseases and the observed 227 outbreak, must be refined in the future. First, our proposed approach used very limited 228 data in Table 1 for logical quantification of the probability that each pathogen was the 229 causative agent. However, with more clinical data, the dataset of characteristics could 230 be replaced by continuous frequencies (e.g. the frequencies of cases experience 231 coughing and difficulty in breathing) rather than binary variables, and then the proposed 232 method could even be used for screening suspected cases. Second, with such data it 233 would also be possible to model the likelihood of a pathogen in equation (1) not by 234 arbitrarily measuring the distance but by using classification models using regression or 235 more sophisticated machine learning approaches. Third, the erroneous input of incorrect 236 information may be a challenge in real time analyses, although this did not appear to be 237 an issue during the course of our analysis of the outbreak in Wuhan. However, it must 238 be considered that the veracity of the source of information for such an analysis could 239 have an impact on the resulting probability calculations. Fourth, the estimated 240 probability that an outbreak is driven by a novel pathogen might be slightly over-or 241 underestimated due to limited information about the mode of transmission and small 242 numbers of observed cases. Of note, we believe that without 100% specificity of 243 bacterial pathogens linked to the ongoing outbreak, excluding bacterial pathogens as 244 candidate cannot be ensured, while the chance that the current outbreak is due to 245 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not peer-reviewed)
The copyright holder for this preprint . https: //doi.org/10.1101//doi.org/10. /2020 bacterial may be less suspected over time with partial clinical evidence. Nevertheless, 246 the large number of characteristics that could be considered for the outbreak in Wuhan 247 suggests that estimation was not beset in this study. Finally, we had to restrict ourselves 248 to assume that the priori probability of all outbreak (^_) is identical. However, since the 249 priori probability of observing the outbreak driven by a Disease X is completely 250 unknown, we believe that this assumption can be plausible in this practice.

252
Despite the future improvements to our statistical modelling framework that 253 are required, this short study has demonstrated clearly that the ongoing outbreak of 254 pneumonia cases in Wuhan is consistent with causation by a novel pathogen, "Disease 255 X." Analyses of the type conducted in this study can greatly support virological and 256 genetic efforts to characterize the causal agent of this and future outbreaks, with the 257 benefit that such analyses can be carried out extremely quickly. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not peer-reviewed)
The copyright holder for this preprint . https://doi.org/10. 1101/2020 13 model formulation, supervision, fund raising, validation, writing.  CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

308
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not peer-reviewed)
The copyright holder for this preprint . https: //doi.org/10.1101//doi.org/10. /2020  . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

342
*Severe acute respiratory syndrome; **Middle East respiratory syndrome; ***Highly pathogenic avian influenza. Zeros represent characteristics that are unlikely 343 for outbreaks for that pathogen, and ones represent characteristics that occur. Dates and characteristics for the ongoing outbreak were obtained from two online 344 information systems [5,6], and information for other pathogens was summarised from the pathogen-specific pages on the WHO and CDC websites.