Prevalence in News Media of Two Competing Hypotheses about COVID-19 Origins

: The COVID-19 pandemic has been one of the most disruptive and painful phenomena of the last few decades. As of July 2021, the origins of the SARS-CoV-2 virus that caused the outbreak remain a mystery. This work analyzes the prevalence in news media articles of two popular hypotheses about SARS-CoV-2 virus origins: the natural emergence and the lab-leak hypotheses. Our results show that for most of 2020, the natural emergence hypothesis was favored in news media content while the lab-leak hypothesis was largely absent. However, something changed around May 2021 that caused the prevalence of the lab-leak hypothesis to substantially increase in news media discourse. This shift has not been uniformed across media organizations but instead has manifested itself more acutely in some outlets than others. Our structural break analysis of daily news media usage of terms related to the laboratory escape hypothesis provides hints about potential sources for this sudden shift in the prevalence of the lab-leak hypothesis in prestigious news media.


Introduction
The COVID-19 pandemic has caused enormous amounts of human suffering worldwide. As of July 2021, the origins of the SARS-CoV-2 virus that caused the outbreak remain unknown. There are two popular competing hypotheses about such origin. One asserts that the virus probably leaped naturally from wildlife to people (Calisher et al. 2020;Andersen et al. 2020). The other proposes that the virus might have accidentally leaked from a biolab (Bloom et al. 2021;Wade 2021). No conclusive evidence for either hypothesis has yet been uncovered. Determining which hypothesis is correct is critical to prevent a similar outbreak reoccurring again in the future with its associated catastrophic loss of human life.
The author of this work noticed a sudden increase in media mentions of the lableak hypothesis around mid-2021. Obviously, anecdotal evidence based on individual subjective perceptions could be the result of cognitive biases. Thus, the author carried out a quantitative analysis of news media content aimed at comparing the temporal dynamics of the two competing SARS-CoV-2 origin hypotheses. This work reports the results of that analysis and provides convenient visualizations about news media chronological coverage of the two alternative conjectures.
Computational content analysis of large bodies of text can be illuminating to elucidate the semantic associations embedded in the text Rozado 2020b). Simply charting word frequencies in a diachronic corpus of news content tracks the time course of historical events and highlights the dynamics of social trends within the cultural context in which the texts were produced (Rozado 2020a;. Figure 1 illustrates the validity of our method by tracking sociocultural phenomena related to the COVID-19 pandemic. The first row shows the temporal dynamics of the different terms used to refer to the virus that caused the pandemic. The second row displays the shifting attention paid by news media towards several techniques attempted at alleviating the havoc caused by the virus (Jorge 2021). The third row illustrates how the media reflected the climate of social fear, peaking around March 2020, as the severity of the virus became clear. The subsequent concern with the mounting death toll likely prompted, media echoed, mandates for lockdowns, masks, and social distancing protocols. The last row of Figure 1 shows how in the critical time around February-March 2020, the theme of freedoms and civil liberties became less prominent in news media written articles, probably due to the urgency of containing the virus. This theme however, rebounded in prevalence between May and July of the same year, perhaps prompted by concerns about overreaching mandates to stop the virus. Figure 1 o illustrates how the media reflected the unemployment damage caused by the pandemic and subplot p tracks the occurrence of the anti-vaxxers theme. Soc. Sci. 2021, 10, x FOR PEER REVIEW 2 of 14 alleviating the havoc caused by the virus (Jorge 2021). The third row illustrates how the media reflected the climate of social fear, peaking around March 2020, as the severity of the virus became clear. The subsequent concern with the mounting death toll likely prompted, media echoed, mandates for lockdowns, masks, and social distancing protocols. The last row of Figure 1 shows how in the critical time around February-March 2020, the theme of freedoms and civil liberties became less prominent in news media written articles, probably due to the urgency of containing the virus. This theme however, rebounded in prevalence between May and July of the same year, perhaps prompted by concerns about overreaching mandates to stop the virus. Figure 1 o illustrates how the media reflected the unemployment damage caused by the pandemic and subplot p tracks the occurrence of the anti-vaxxers theme.  News media is supposed to provide their audiences with the facts and information that said audiences need to understand current events. In a recent UK survey, most people felt that news media organizations helped them respond to the COVID-19 crisis, but a third of respondents believed that news coverage made the crisis worse (Nielsen 2020). Previous work has investigated the role of news media on COVID-19 misperceptions (Bridgman et al. 2020). Other work has reported declining public trust in news media reporting about COVID-19 (Fletcher et al. 2020). The role of news media in terms of agenda-setting with respect to COVID-19 vaccination has also been studied (Medina et al. 2021).
Agenda-setting theory (McCombs and Shaw 1972) studies how news coverage of events shapes the formation of public opinion (McCombs 2018). Previous research has investigated whether news media influences consumers of said media by establishing a hierarchy of news thematic prevalence that filters information streams and shapes audiences perceptions of current events (Mrogers and Wdearing 1988). This work explores the prevalence in news media content of two competing hypotheses about COVID-19 origins. Interpreting the results through the lens of agenda-setting theory can provide insight into how thematic prominence in news outlets of a given hypothesis about SARS-CoV-2 origins has the potential to shape audiences' perceptions about the causal roots of a major and dramatic event such as the COVID-19 pandemic.

Methods
The textual content of news and opinion articles from the outlets listed in Figure 1 is available in the outlets online domains and/or public cache repositories such as Google cache, The Internet Wayback Machine (Notess 2002) and Common Crawl (Mehmood et al. 2017). Textual content included in our analysis is circumscribed to the articles' headlines and main text and does not include other article elements such as figure captions. This work has not analyzed video or audio content of news media organizations, except when an outlet explicitly provides a transcript of such content in article form. Targeted textual content was located in HTML raw data using outlet-specific XPath expressions. Tokens were lowercased prior to estimating frequency counts.
Frequency usage of a target word or n-gram in an outlet for any given temporal interval (monthly, weekly or daily) was estimated by dividing the number of occurrences of the target word/n-gram in all articles within a given interval by the total number of all words in all articles within that interval. This method of estimating frequency accounts for variable volume of total article output over time.
Latent associations in news media content were measured using embedding models. Reliable word embeddings require substantial amounts of textual data to produce robust results. Thus, we derived word embedding models for each month between October 2020 and June 2021 from all the combined outlets monthly content. The gensim (Řehůřek and Sojka 2010) implementation of word2vec with the continuous bag of words (CBOW) architecture setting was employed to train the embedding models.
Prior to estimating word embedding models, tokens were lowercased. Markup language tags, URLs, non-alphanumeric characters, punctuation, and multiple spaces were removed before training the embeddings models.
For training the word embedding models, the following parameters were used: vector dimensions = 300, window size = 10, negative sampling = 10, down sampling frequent words = 0.0001, minimum frequency count = 5 (only terms that appear more than 5 times in the corpus were included into the word embedding model vocabulary), and number of training iterations (epochs) through the corpus = 5. The exponent used to shape the negative sampling distribution was the default 0.75.

Prevalence of Two Alternative COVID-19-Origins Hypotheses in News Media
We now focus on analyzing news media treatment of the two competing hypotheses regarding the pandemic origin. The first officially acknowledged signs of COVID-19 surfaced around December 2019 in Wuhan, China (Sohrabi et al. 2020). Chinese authorities initially reported that many early cases had been traced back to the Wuhan wet market. This was reminiscent of the 2003 SARS1 outbreak in which a bat virus first jumped to civets, some of which were sold in wet markets, and from there the virus again leaped the species barrier to infect people (LeDuc and Barry 2004). Many scientists and Chinese government officials proposed that a similar event could have happened again, perhaps with the intermediate host this time being pangolins  or a direct transmission from bats to humans (Zhou et al. 2020). The Wuhan wet market was signaled as perhaps the breeding ground of the outbreak (China Daily 2020). News media at the time echoed this first plausible explanation (see first row of Figure 2) and largely ignored the possibility of a lab-leak, as evidenced in the second, third, and fourth rows of Figure 2. Over time however, conclusive supportive evidence for the natural emergence hypothesis has not yet materialized, as no signs of prior intermediate host infection with COVID-19 have been found despite an intensive search (Wade 2021). Initial cases of COVID-19 not linked to the Wuhan wet market were also eventually reported (Chan et al. 2020). This perhaps explains the decreasing prevalence of the intermediate host hypothesis in news media content since its peak around February-April of 2020; see Figure 2a-d.
The possibility of a of lab-leak was largely absent in news media discourse during most of 2020 and has only gained prominence in mid-2021, as shown in the second row of Figure 2. A compelling reason to not rule out the lab-leak hypothesis was due to Wuhan hosting a virology laboratory that conducted research work on coronaviruses, the Wuhan Institute of Virology (WIV). Media interest about the lab has however, only peaked recently; as reflected in Figure 2i.
The WIV is also China's only maximum biosafety level-4 (BSL-4) laboratory, meaning it is authorized and equipped to work on the most dangerous viral pathogens; see Figure 2j. Critically, since at least 2015, this research lab had been working on gain-of-function experiments to make coronavirus strains more infectious of cells lining the human respiratory tract (Daszak 2014;Wade 2021), allegedly under inadequate safety conditions (Washington Post 2020). The rationale for such research being that the insights gained from it could be useful to prevent natural spillovers. Media mentions of the WIV engaging in this type of research have only picked up in May-June of 2021, see Figure 2k.
The hypothesis of a lab-leak was dismissed early in 2020 by some prominent members of the scientific community as a conspiracy theory and their opinions were published in prestigious scientific journals such as The Lancet (Calisher et al. 2020) and Nature Medicine (Andersen et al. 2020). This perhaps could explain why mainstream news media mostly echoed the natural emergence hypothesis and largely ignored the lab-leak alternative hypothesis during most of 2020 as the world suffered the thrust of the pandemic.
At least one signatory of The Lancet letter (Calisher et al. 2020) was a member and president of the EcoHealth Alliance, an organization that had funded coronavirus gain-offunction research at the Wuhan Institute of Virology with U.S. government grants from the National Institute of Allergy and Infectious Diseases (NIAIDS) (Daszak 2014;Wade 2021). The NIAIDS coronavirus gain-of-function research grant to the Wuhan Institute of Virology through EcoHealth has only recently attracted substantial media attention; see subplot l in Figure 2.
The most similar public genome to SARS-CoV-2 is a bat coronavirus known as RaTG13, with a genome similarity to SARS-CoV-2 of 96% (Zhou et al. 2020). Media attention to this virus that was retrieved from a cave in the Yunnan province (1800 km away from Wuhan), sequenced and published by staff from the Wuhan Institute of Virology (Zhou et al. 2020), has also only recently become prominent; see Figure 2m.
Several relevant molecular features of SARS-CoV-2 were also largely underreported by mainstream news media during 2020. In the middle of the SARS2 spike protein, a motif called the furin cleavage site is critical for the subunits of the spike protein (S1 and S2) to be cut apart by a protein cutting tool on the surface of human cells known as furin (Johnson et al. 2020). Such cleavage allows the virus to fuse with the target cells' membrane, inject its genetic material into the cell and cause the cell to generate new copies of the virus. The human furin protein will cut any protein chain that carries the motif amino acid sequence proline-arginine-arginine-alanine (PRRA). SARS2 is the only SARS-related betacoronavirus with a furin cleavage site, making it particularly optimized to target human cells (Wade 2021;Peacock et al. 2021). Yet, news media outlets have largely overlooked this molecular feature of the virus until recently; see Figure 2n. Several relevant molecular features of SARS-CoV-2 were also largely underreported by mainstream news media during 2020. In the middle of the SARS2 spike protein, a motif called the furin cleavage site is critical for the subunits of the spike protein (S1 and S2) to be cut apart by a protein cutting tool on the surface of human cells known as furin (Johnson et al. 2020). Such cleavage allows the virus to fuse with the target cells' membrane, inject its genetic material into the cell and cause the cell to generate new copies of the virus. The human furin protein will cut any protein chain that carries the motif amino acid At the S1/S2 junction, the 12-nucleotide sequence codifying the PRRA motif that renders the protein chain susceptible to be cleaved by furin and allow viral particles to fuse with human cells membrane is T-CCT-CGG-CGG-GC. This sequence contains the unusual feature that the double arginine codons pattern, CGG-CGG, has never been found in any other beta coronavirus (Wade 2021). This molecular characteristic also appears to have been absent from news media discourse until recently; see Figure 2o. Perhaps as a result of the above discussed unusual molecular features of SARS-CoV-2, some news media outlets have only recently started to mention the possibility that the virus might have been manipulated in a lab; see Figure 2p.

High Frequency Analysis of the COVID-19 Lab-Leak Hypothesis Prevalence in News Media
Figure 2 only allows us to observe that the prevalence of the lab-leak hypothesis in news media content markedly increased in May and June of 2021. To visualize higherresolution dynamics around this period, we replicate the previous analysis using weekly frequency counts for a set of key target words denoting the lab-leak hypothesis theme. Figure 3 shows that the prevalence in news media of the lab-leak hypothesis theme increased during the month of May to spike in the last week of that month, and then decreased gradually as the month of June progressed. There also appears to be milder peaks of this topic prevalence in mid-February and in the week at the end of March/beginning of April. Figure 3 also illustrates that not all news media outlets have manifested a spike in the prevalence of the lab-leak hypothesis theme in their textual content. Instead, the increased prevalence has been driven mainly by just some outlets such as Fox News, The New York Post, The Wall Street Journal, and The Washington Post.
To achieve even higher granularity temporal dynamics of the lab-leak hypothesis thematic prevalence in news media, we next analyze daily frequency counts of target words in news media content from 1 January 2021 to 30 June 2021; see Figure 4. We leverage the usage peaks identified in Figure 3 to guide a search for potentially relevant events around those dates that could have plausibly influenced media coverage of the lab-leak hypothesis.
We have identified and highlighted in Figure 4 six such potentially relevant events. The first three are the beginning of the World Health Organization's (WHO) field visit to China to investigate the origins of the pandemic, their visit to the Wuhan Institute of Virology, and the end of their field visit to China. The next event corresponds to the publication of the WHO report on its Wuhan field visit investigation that recommended a call for further studies and reiterated that all hypotheses about COVID-19 origins remain open (World Health Organization 2021).
The next relevant event concerns Nicholas Wade, a former science reporter at the New York Times, and his publication of "The origin of COVID: Did people or nature open Pandora's box at Wuhan?" on 5 May 2021 (Wade 2021). In his article, Wade enumerated what he considered substantial evidence pointing in the direction of the lab-leak hypothesis, although he acknowledged that no definite proof existed yet for either the natural emergence or the lab-leak hypotheses.
The final highly likely influential event in press coverage of the lab-leak hypothesis concerns the U.S. president, Joe Biden, ordering to its intelligence community on 26 May 2021 to further investigate the origins of the COVID-19 virus and provide a report back to him within 90 days (REUTERS 2021). To achieve even higher granularity temporal dynamics of the lab-leak hypothesis thematic prevalence in news media, we next analyze daily frequency counts of target   A structural break analysis using the Chow test (Chow 1960) (Bonferroni adjusted for multiple comparisons) to determine whether regression coefficients prior to each event highlighted in Figure 4 were different from regression coefficients after each event (window size = 14) were statistically significant (p < 0.05) for the Biden event on 26 May 2021 for two sets of words ( ‡ markers in Figure 4). Paired t-tests (Bonferroni adjusted for multiple comparisons) of overall prevalence prior to and after (window size = 14) each highlighted event in Figure 4 reached statistical significance (p < 0.05) for Nicholas Wade's article of 5 May 2021 for the three sets of words analyzed (* markers in Figure 4). The largely absent prevalence of the lab-leak hypothesis theme in the days prior to Nicholas Wade's publication and the subsequent gradual pickup in media interest provides suggestive, but ultimately circumstantial, evidence about whether this particular event could have triggered increased media coverage of the lab-leak hypothesis.

Latent Associations about COVID-19 Origins in News Media Content
While frequency analysis of a corpus of text can be informative about the thematic prevalence of certain topics, the technique is also limited in that it does not analyze the context in which words are being used. To overcome this limitation, we next performed an analysis of news media articles using word2vec embedding models (Mikolov et al. 2013) to measure the frequency with which sets of words are associated. We built embedding models for each month between October 2019 and June 2021 using news media articles published in the corresponding month. This allows for chronological measurements of the strength with which sets of words are associated (i.e., appear in the vicinity of each other or in similar contexts) in news media articles. Figure 5 shows the results of our analysis. The first row of Figure 5 contains subplots using a dashed orange line and it is only used to illustrate that the technique produces sensible results, including detecting the temporal occurrences of events such as Donald Trump's infection and subsequent positive testing for COVID-19, or Joe Biden winning the Democratic Party nomination for the U.S. presidency around March/April 2020 and his subsequent electoral victory in the U.S. presidential election of November 2020.
The second row of Figure 5 illustrates the decreasing prevalence of the intermediate host hypothesis in news media as shown by the declining association of coronavirus with potential intermediate hosts such as pangolins, civets, and bats, as well as peak association of the virus with wet markets between January and February of 2020.
The third row of Figure 5 shows how news media have recently started to more strongly associate terms such as covid or coronavirus with a lab-leak or the Wuhan Institute of Virology. Subplot k in the figure also shows that during 2020, the media mostly did not report on the gain of function research experiments being conducted since 2015 at the WIV (Daszak 2014;Wade 2021). Similarly, associations about the dangerous nature of gain of function research are stronger in mid-2021 than at any time in 2020. The commonality for all these associations is that their strength of association has peaked around May and June of 2021.
Subplots m and n in Figure 5 shows that associations between the research grants from NIAID/NIH for gain of function research at the WIV have become more prominently linked in the last few months. Subplot o also illustrates how in recent journalistic discourse, the lab-leak hypothesis is often associated with terms denoting racism. The embedding method used does not allow discerning whether such associations occur because news media content suggests that it is racist to propose the lab-leak hypothesis or whether some writers are arguing that the lab-leak hypothesis was not properly scrutinized previously because of concerns about accusations of racism or in an attempt to not stir up racist sentiment. The pattern could also be the result of a combination of all the previous possibilities. Finally, associations between the WIV and a potential laboratory accident have become more prominent in 2021, although the relationship was also briefly common in April and May of 2020; see subplot p in Figure 5. Soc. Sci. 2021, 10, x FOR PEER REVIEW 11 of 14 Figure 5. Chronological plots of monthly association strength between sets of terms in embedding models derived from news media content. Figure 5. Chronological plots of monthly association strength between sets of terms in embedding models derived from news media content.

Discussion
As of July 2021, the origin of the SARS-CoV-2 virus that caused the COVID-19 pandemic remains a mystery. The results presented here suggest that for most of 2020, popular news media outlets mostly ignored or downplayed the possibility of a lab-leak as a reason for the virus outbreak. Perhaps the publication in prestigious scientific journals, such as The Lancet and Nature Medicine, of opinion pieces dismissing the lab-leak hypothesis (Andersen et al. 2020;Calisher et al. 2020) played a role in media attitudes towards this hypothesis.
Alternatively, the fact that early on in the pandemic, U.S. president at the time, Donald Trump, advocated for the lab-leak hypothesis without providing explicit evidence (BBC News 2020) could have contributed to prominent news outlets avoiding such hypothesis due in part to the notorious mutual animosity between news media organizations and Trump.
If mutual hostility between news media outlets and former U.S. President Donald Trump partly prompted prestigious outlets during 2020 to downplay a plausible hypothesis about COVID-19 origins, the ability of news media institutions to reliably investigate and report on politically-loaded events in an unbiased manner could be raised into question.
In May 2021, however, something caused the prevalence of the lab-leak hypothesis in news media discourse to substantially spike in prominence in some, but not all, of the studied media outlets. Although our analysis cannot provide conclusive evidence about what caused the shift, it provides hints about potential sources for the structural break in the prevalence of the lab-leak hypothesis in news media discourse.
If Nicholas Wade's essay did indeed trigger the sudden increase in attention of at least some outlets to the lab leak hypothesis, it is extraordinarily striking that news media scrutiny about the causal roots of a pandemic that has killed, as of July 2021, more than 4 million people worldwide (COVID-19 Data Repository CSSE-JHU 2021) could be dependent on the investigative reporting of a single individual. It is also noteworthy that most of the interest in the lab-leak hypothesis seems to have emerged in right-leaning news outlets (Fox News, the Wall Street Journal, and the New York Post). Although, the Washington Post, a prominent left-leaning newspaper (AllSides Media Bias Ratings 2019), has also manifested in its content an increasing prevalence of the lab-leak hypothesis.
Interpreting the results of this work through the media agenda-setting theory suggests the potential of news media to shape public perceptions about important current events. A valid criticism of this interpretation is that with the growing influence of the Internet and social media, people can find information through alternative sources other than traditional news media outlets, making it harder for news media to uniquely set agendas. Nonetheless, the majority of the population still trusts news media reporting on the COVID-19 pandemic (De Coninck et al. 2020). Such trust, however, appears to be eroding (Fletcher et al. 2020). Fair and honest reporting on current events is essential to maintain trust between the public and news media organizations. If additional supporting evidence for the lableak hypothesis eventually surfaces while supporting evidence for the natural emergence hypothesis fails to materialize, the downplaying of the lab-leak hypothesis in mainstream news outlets during the first 16 months of the pandemic could contribute to further public erosion of trust in news media.
Funding: This research received no external funding.

Informed Consent Statement: Not applicable.
Data Availability Statement: The analysis scripts, monthly word embedding models of news media content, list of written articles' URLs analyzed, and the counts of target words and total words per article are provided in electronic form at: https://zenodo.org/record/5108976 (accessed on 17 August 2021).

Conflicts of Interest:
The author declares no conflict of interest.