Using Machine Learning for Pharmacovigilance: A Systematic Review

Pharmacovigilance is a science that involves the ongoing monitoring of adverse drug reactions to existing medicines. Traditional approaches in this field can be expensive and time-consuming. The application of natural language processing (NLP) to analyze user-generated content is hypothesized as an effective supplemental source of evidence. In this systematic review, a broad and multi-disciplinary literature search was conducted involving four databases. A total of 5318 publications were initially found. Studies were considered relevant if they reported on the application of NLP to understand user-generated text for pharmacovigilance. A total of 16 relevant publications were included in this systematic review. All studies were evaluated to have medium reliability and validity. For all types of drugs, 14 publications reported positive findings with respect to the identification of adverse drug reactions, providing consistent evidence that natural language processing can be used effectively and accurately on user-generated textual content that was published to the Internet to identify adverse drug reactions for the purpose of pharmacovigilance. The evidence presented in this review suggest that the analysis of textual data has the potential to complement the traditional system of pharmacovigilance.


Introduction
In drug development, there exists a strong tension between accessibility and safety. While drugs can effectively cure diseases and improve life [1], the required process of research and development of drugs is expensive, and pharmaceutical companies have a high stake in yielding a profit on their investment [2]. This increases the urgency to make effective drugs available to the public. In contrast, medicines can also induce adverse drug reactions (ADRs) that may result in mortality, and the identification of such reactions demands thorough and time-consuming testing of the drug's safety, drastically increasing the time-to-market of new drugs [3]. In fact, the potential consequences of ADRs are significant. In the European Union (EU), five percent of hospital admissions and almost 200,000 deaths were caused by ADRs in 2008, and the associated societal cost totaled EUR 79 billion [4].
A system that applies tools and practices from the research field of pharmacovigilance was introduced to alleviate this tension [5]. This system performs ongoing monitoring of ADRs of existing drugs [5]. It also minimizes the time-to-market of effective drugs, and it allows their long-term safety post market authorization to be continuously examined [6]. Overall, pharmacovigilance is the cornerstone in the regulation of drugs [1]. The traditional system that applies pharmacovigilance is very expensive and often fails to monitor ADRs experienced by users if these are not reported to the authorities, pharmaceutical companies, or medical professionals [6,7]. The reporting of these ADRs is important because it may help to protect public health [1].
To address this research gap, the purpose of this study is to review the existing evidence on, and the effectiveness of natural language processing to understand user-generated content for the purpose of pharmacovigilance. According to our review, it is worthwhile to analyze user-generated content that has already been published to the Internet, to proactively and automatically identify ADRs, without relying on users to actively report those cases to the authorities, pharmaceutical companies, or medical professionals.

Background
A severe limitation in the process of bringing new drugs to market is the potential of drugs to cause ADRs. While pre-clinical and clinical studies include testing drug safety and potential ADRs, only a total of a few hundreds or thousands of participants are included in these studies [1]. In addition, these studies are performed under controlled clinical conditions that may not represent every real-world situation or circumstance [1]. Therefore, not all ADRs may have been identified prior to making the drug generally available [3]. As long as the benefits outweigh potential costs, it is generally considered unethical to withhold the general public from using an effective drug at this stage, thus accepting that some people may develop ADRs in the future.

Traditional Approaches
To counteract the limitations of pre-clinical and clinical testing, existing drugs on the market are constantly being monitored for safety and ADRs [3]. The long-term monitoring of existing drugs is crucial, because potential ADRs, interactions, and other risk factors, may only emerge many years or even decades after the drug initially received market authorization [3].
The long-term monitoring of drug safety beyond market authorization is named pharmacovigilance [5], which is defined by the WHO as "the science and activities relating to the detection, assessment, understanding and prevention of adverse effects or any other medicine-related problem" [3]. As such, the application of tools and practices from pharmacovigilance by public health authorities results in a pro-active system that is intended to promote and protect public health [1]. It involves a wide array of activities, including data collection about drug safety, obligating pharmaceuticals and medical professionals to report ADRs, inviting patients to report experiences with drugs, and the detection of signals that may indicate drug safety issues [6]. There are, however, significant costs associated with the processing and administration of the reported cases of ADRs [7]. In addition, the current system of collecting data to monitor drug safety is suboptimal because end-users are not obliged to report cases of ADRs [6].

Improving Pharmacovigilance Using Natural Language Processing
In the preceding 15 years, many technological innovations have enabled the storage, processing, and analysis of big data [42][43][44]. In particular, with the emergence of Web 2.0 and social media platforms, there has been a significant increase of user-generated content that is published to the Internet [8][9][10]. Among others, vast amounts of textual data are generated on blogs, forums, and social media [45]. Similarly, there have been significant developments in artificial intelligence that resulted in powerful methods and algorithms for NLP [46], which enabled the processing and understanding of human-generated text [45,47]. This opened new opportunities for mining social media and analyzing texts [48]. In recent years, these fields experienced significant innovations [49].
Text mining is frequently defined as the analysis of textual data, such as unstructured or semi-structured text, with the purpose to extract hidden patterns and information [45]. As such, it combines data mining with NLP [43]. Text mining has emerged from a need to analyze large amounts of text containing human language, which can be mined for insights that facilitate data-driven decision-making [45]. However, many standard data mining techniques cannot be applied to unstructured textual data. Therefore, text mining is applied as pre-processing for unstructured data [50,51], e.g.,: • tokenization: the separation of text into smaller units, like words, characters, or sub-words (n-grams); • transformation of cases, such as uniform lowercasing or uppercasing; • stop word removal: the removal of words carrying very little meaning; and such as pronouns; • reducing inflected words to their word stem (stemming).
Once text mining has been applied to extract structured data from a semi-structured or unstructured source, conventional data mining algorithms can subsequently be used to process and analyze these structured data further to yield the valued insights [47]. The complexity that is involved with analyzing unstructured textual data and in particular its irregularities, makes the process of text mining a difficult area in artificial intelligence [52].
The applications of text mining are numerous, and include: • assigning affective states to text (sentiment analysis) [43]; • the discovery of associations between words or other tokens [53]; • the summarization of documents [43]; • clustering texts according to some similarity measurement [54,55]; • classification of text into various categories [56,57]; • predicting words or other lexical units (as part of a word processor or chatbot) [58,59]; and • the extraction of concepts, entities, and the relationships between them [43].

Materials and Methods
This systematic review was guided by the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines [60,61]. However, most of the reviewed papers do not contain controlled trials, comparable statistical analysis, or methodology, making it impossible to apply the complete PRISMA 2020 checklist to this review. Therefore, we only applied items on the checklist if they were applicable, and thus our review does not conform completely to the guideline. The quality of this systematic review was evaluated using the PRISMA Checklist in Appendix A.

Search Strategy
To cover all related disciplines, a broad selection of databases was made that included PubMed, Web of Science, IEEE Xplore, and ACM Digital Library. These databases were selected because they index studies in a wide range of fields. Specifically, PubMed was included because it predominantly indexes research in the field of public health, healthcare, and medicine. IEEE Xplore and ACM Digital Library were searched because these databases index publications in information technology and information management. Web of Science was included because it is a very large database that indexes studies in various disciplines, and also because of its multidisciplinary nature, there exists a consensus among researchers that it is good practice to include this database in systematic reviews. We recognize that Google Scholar is increasingly used as a source for systematic reviews, but that there exists a debate among scientists about its appropriateness [62]. A common argument against Google Scholar is that its algorithm for ranking the relevance of publications is updated frequently, thereby making the search results unreliable for reproduction [62]. Therefore, we have excluded Google Scholar as an information source in this review. Furthermore, because it is a commonality in information technology and computational linguistics that materials are not always published in peer-reviewed journals, but instead it is frequently published only in conference proceedings or conference papers, both journal articles and conference proceedings were included in this systematic review. It was not expected that this would have a significant effect on the reliability of studies, because conference proceedings and conference papers are also subject to a peer-review process.
For each of the included databases, an optimized search strategy was formulated (see Appendix B). The search query was constructed from two blocks. The first block addresses the concept of NLP, and the second block includes search terms related to health surveillance. The systematic literature search was performed on 25 March 2020 for all databases. All publications appearing up to this point were considered in the search. After the databases were searched, the method for de-duplication by Bramer et al. [63] was performed to identify and remove duplicate studies. Studies eligible for this systematic review were selected in three subsequent phases and visualized in Figure 1.

Study Selection
First, the titles were screened for the presence of subjects related to public health monitoring or public health surveillance. The screening was very global to prevent the unnecessary exclusion of studies. Therefore, not only terms such as "adverse drug reactions" were considered relevant, but titles containing more indirect terms such as "medication outcomes" were also included. In addition, if it was ambiguous whether a study was relevant or not, it was still included for further screening in the next phase. Studies that were not relevant were omitted from the library.
Second, the abstracts were screened for information related to NLP, public health monitoring, public health surveillance, and pharmacovigilance. The keywords provided with the manuscript for indexing purposes were also screened for these concepts. This phase was also intended to be broad. For example, abstracts were considered relevant if they contained terms directly related to pharmacovigilance, such as "adverse effects of drug treatment", but also indirectly related terms such as "drug reviews". Drug reviews involve an extensive process where experimental drugs are assessed on safety (e.g., toxicity and side effects) and effectiveness using various clinical trials [1]. Drug reviews are mainly performed by pharmaceutical companies which document their tests for review by the European Medicines Agency (EMA) or the U.S. Food and Drug Administration (FDA) [1]. Post market authorization, existing drugs and their side effects are continuously being monitored by medical doctors, laboratories, pharmaceutical organizations, and health authorities [1]. Publications were still included if their relevance was considered ambiguous, for further screening in the next phase. Irrelevant manuscripts were removed.
Third, the full text was downloaded and read. Studies were considered relevant if they investigated the application of NLP to understand text with the purpose of public health monitoring or public health surveillance within the discipline of pharmacovigilance. Eligible studies reported on the application and results of using computational linguistics to identify adverse drug reactions from textual sources, such as forums, patient records, and social media.

Inclusion and Exclusion Criteria
Overall, studies were only eligible for inclusion in this systematic review if they aimed to identify adverse drug reactions using computational linguistics. Both journal articles and conference proceedings were included. In addition, we only included studies if written in the English language, irrespective of the language of the dataset of user-generated content that these studies utilized. There were no limitations regarding the publication date, institutional affiliation, or the journal that these studies were published in.
Publications were excluded if they only reported on a framework instead of the actual application. For example, authors may suggest a process to investigate adverse drug reactions using computational linguistics without actually applying it and evaluating the results. Likewise, studies were excluded if they were published in a language other than English. We only excluded studies if the manuscript was not written in English, irrespective of the language that its dataset was written in. Furthermore, if the same publication was published in different formats, for example as both a conference proceeding and a journal article, only one format of the publication, namely the journal article, was retained.

Reliability and Validity
The included publications were evaluated on quality by assessing their reliability and validity. This assessment was performed using the strategy of Kampmeijer et al. [64]. A publication was evaluated as reliable if it reported a thorough and repeatable description of the performed process, methods, data collection, and data analysis [64]. A reliable study provides a well-defined, transparent, and consistent protocol for the collection, processing, and analysis of data. It facilitates researchers to establish its consistency and identify potential flaws in the research design. In addition, a reliable study provides sufficient details such that it can be reproduced. Under the same conditions, if repeated, a reliable study will produce similar findings.
A publication was evaluated as valid if the reported findings are logically the result of the described process, methods, data, and analyses that were used to find that result [64]. The validity of a study refers to its accuracy; the study indeed measures what it intended to measure. This evaluation requires that researchers are transparent about their protocol. Assessing the validity of a study involves identifying that the reported results and conclu-sions in a study are consistent with the study hypotheses and research design. In addition, it involves the verification that the reported findings from one study are comparable to other studies utilizing a comparable research protocol. The identified consistency within one paper, or consistency in reported findings among comparable papers, are indications that a paper was evaluated as valid.
The reliability and validity of studies were assessed qualitatively and discussed among researchers until consensus was achieved. Studies with "low" reliability did not provide a well-defined, transparent, and consistent protocol or this information provided insufficient details. Instead, studies with "high" reliability provided this information and this information was thorough. In all intermediate cases, these studies were marked as "medium" reliability. Similarly, studies with "low" validity had either limited consistency between the hypotheses and research design on the one hand, with the results and conclusions on the other hand, or their findings were not consistent with studies that utilized a comparable research design. Instead, if the consistency was high and their findings were comparable to similar studies, the validity of these studies was marked as "high". In all intermediate cases, the validity was considered "medium".
Although the quality assessment was rigorous and based on scientific standards, all identified publications were included in the systematic review.

Data Analysis
Thematic analysis was used to analyze the included publications [65]. The themes were defined by the objectives of the present systematic review. The following themes were extracted from the full text: authors, year of publication, type of drugs, data source, sample size, users, unique users, origin of users, average number of followers, years of data collection, horizon of data collection, software used, techniques and classifiers used, outcome, drugs studied, result, and a description of the result.
For each publication, the extracted themes were processed into an extraction matrix. This matrix was used to synthesize and narratively present the extracted information by theme. The results are summarized and presented using tables.

Results
The procedure that was followed for the selection of studies is presented in Figure 1. The 5318 initial records, which were identified through an inclusive search strategy, were assessed for the presence of duplicate publications. Consequently, 744 duplicate results were identified and omitted. Therefore, the literature search yielded 4574 unique studies. According to the thorough study selection strategy described in Section 3.2, the first selection phase identified 4347 irrelevant studies to be excluded. In the second phase, the remaining 227 results were screened by reading the abstract; 206 irrelevant studies were omitted. For example, studies were excluded when not mentioning "adverse effects of drug treatment" or other related but rather general terms such as "drug reviews" in the abstract. In the third phase, the full text of the remaining 21 publications was read. Five studies were considered irrelevant because they did not investigate the application of computational linguistics to understand text, with the purpose of public health monitoring or public health surveillance within the discipline of pharmacovigilance.
The reliability and validity of all studies were assessed as medium. While all studies were performed reasonably well, they failed to be entirely transparent about their process, methodology, used software, and the used technologies and classifiers. As is presented in the detailed overview of the characteristics of the included studies in Appendix C, all studies failed to disclose a complete overview of crucial information.

General Characteristics
A general description of the publications included in the analysis is provided in Table 2. To establish differences between them, various characteristics of these publications were compared and the observed differences are presented in Table 3.
Only one study by Adrover et al. [31] discussed the geographical location of users that published the included posts. They report that the users were from Canada, South Africa, the United Kingdom, or the United States [31]. The remaining 15 studies did not disclose the geographical location of users.
Studies that reported the date of publication of the included content (50%) were used to compute the time horizon of the collected data [31,32,36,40,41,66,68,69]. In 13% of studies, this horizon was one calendar year [36,66]. In 6% of the studies, this horizon was between two and five years [31]. In another 6%, the horizon ranged between 6 and 10 years [32]. In four studies (25%), the time horizon could not be computed because the data were published within the same calendar year [40,41,68,69]. The remaining studies (50%) did not present the date on which the included data were published [33][34][35][37][38][39]67,70]. Therefore, the horizon of data collection could not be computed.
The studies also differed with respect to the number of drugs for which ADRs were investigated. Most studies (31%) included posts concerning 20 or more drugs [36,38,39,67,68], followed by 25% that studied between five and nine drugs [32,41,66,69]. Two studies (13%) included between ten and fourteen drugs [40,70], while only one study (6%) addressed less than five drugs [31]. No studies included between 15 and 19 drugs. The remaining 25% of studies did not disclose the number of drugs that were investigated [33][34][35]37]. Table 2. General description of publications included in the analysis.

Sub-Categories n (%) References
Year

Employed Methods
The studies reported a vast difference in software that was used (see Table 3). In total, 27 software products were discussed. Often, studies also used alternatives for the same type of software. For example, although some studies used Tweepy (3%) [41], Twitter REST API (3%) [37], or Twitter4J (3%) [36] to retrieve data from Twitter using a different programming language, these software products can be aggregated in the type of Twitter API. By frequency, the spelling checker Hunspell was used most often (9%) [33,34,37]. This tool can reduce the dimensionality of NLP tasks by considering various spellings of a single word. For example, misspellings (e.g., "organiezation") or inconsistent types of English such as "organization" and "organisation", would then be transformed into the U.S. word "organization". Notably, three studies (9%) did not present the software that was used [39,67,70].
The studies reported consistent evidence that NLP can be successfully used to understand text for the purpose of pharmacovigilance. A vast majority of studies (88%) presented positive results [31][32][33][35][36][37][38][39][40][41][67][68][69][70]. These studies claimed that ADRs could indeed be extracted accurately and reliably from content published by patients. These studies often compared the accuracy of the adverse effects that were extracted from posts against a list of known ADRs, for example, from the medical package insert or from other reliable sources. Only 13% of the studies reported neutral findings [34,66]. No studies reported a negative result.
For example, Nikfarjam et al. [39] addressed the challenges of patients who use informal language and express medical concepts in lay terms, which may obstruct utilizing patients' digital content for public health monitoring for pharmacovigilance. They developed a system for extracting ADRs from highly informal and unstructured content from both Twitter and a website for drug reviews. They find that it is possible, with reasonably high performance, to extract complex medical concepts from these platforms. In addition, Sampathkumar et al. [68] aimed to identify mentions about ADRs from user-generated content that were published by drug users to healthcare-related forums and to use those mentions for the purpose of pharmacovigilance. They find that it is possible to extract those mentions about ADRs with good performance and that the mentions are consistent with known ADRs. Likewise, Wu et al. [41] developed a pipeline "for collecting, processing, and analyzing tweets to find signals" about ADRs. They were able to identify several well-known ADRs. Furthermore, Yang et al. [70] mined the associations between drugs and the ADRs that patients published to online healthcare communities. These identified associations were then compared to ADR alerts from the U.S. Food and Drug Administration. They find that association mining appears to be promising for the detection of ADRs.
For named diseases, only one study observed neutral effectiveness for oncological drugs [66]. Specifically, Bian et al. [66] developed an approach to identify drug users and extract ADRs concerning cancer from tweets. They used high-performance computing to analyze more than two billion tweets using NLP, and classified tweets using support vector machines. They, however, find that their classification model had limited performance.
There were no significant inconsistencies in the effectiveness of NLP to identify ADRs with respect to the outcome that was under investigation in each study. For a vast majority of the outcomes, ADRs could indeed be established. For the outcome of reported ADRs for cancer, only neutral effectiveness was reported [66]. Although most of the studies that investigated the outcome of reported ADRs observed positive findings, only one study found a neutral result [34]. There were no notable differences in the effectiveness of NLP with respect to the number of drugs that were considered in the publications.

Discussion
The purpose of this study was to review the existing evidence on the methods and effectiveness of natural language processing to understand user-generated textual content for the purpose of pharmacovigilance.
The first main finding of this systematic review is that the potential of applying NLP for pharmacovigilance looks very promising. Studies included in this systematic review consistently reported positive results on the effectiveness and accuracy of using NLP that is applied to user-generated digital content to identify ADRs. For all diseases investigated, a vast majority of studies reported that the identified ADRs were consistent with the information provided on the medical package insert. For example, Ru et al. [40] analyzed and compared content about patient-reported medication outcomes concerning asthma, cystic fibrosis, rheumatoid arthritis, and type 2 diabetes, published to the social media sites Pa-tientsLikeMe.com, WebMD.com, Twitter, and YouTube. They find that, although advising that more emphasis should be placed on developing more reliable methods for NLP and text mining, social media platforms are indeed suitable and complementary sources for investigating outcomes of medication. In addition, Mishra et al. [38] assessed pharmaceutical oncological drug reviews authored by patients and published to user forums, and they compared the reported drug-related issues with official drug labels. They used support vector machines to classify sentiments about ADRs with good performance. Furthermore, Akay et al. [32] investigated how user-generated content on a depression-related forum can be used for modeling the exchange of information between users about drug-related treatments for depression. They find that it is possible to use NLP on this content to identify the ADRs of these drugs in greater detail, and they confirmed the identified ADRs using medical literature about these drugs. In another study, Androver et al. [31] studied the potential of using user-generated tweets to identify ADRs for HIV-related drugs and to gauge patient sentiments about these drug treatments. They find that the identified ADRs are consistent with well-recognized toxicities.
The second main finding of this systematic review is that some studies also correctly identified ADRs that were previously unknown. In [68], the authors identified increased suicide risk for the drug Singulair, and an increased risk of acute pancreatitis and altered kidney function for the drug Byetta. In both cases, the FDA required the manufacturing companies to conduct an investigation, and update the labels of the products with a warning indicating these risks. This result suggests that NLP may also be used to identify novel ADRs, and it may serve as a suitable tool for pro-active and real-time health surveillance using remote sensing. As such, this automated system may identify trends and periodically report novel insights to policymakers and public health professionals, and it may support and enable these professionals to initiate interventions timely to protect public health and to maintain, and perhaps even increase, the quality of healthcare further [32].
Although this systematic review finds that the application of computational linguistics may be effective for pharmacovigilance, it does not suggest that the traditional system is obsolete and should be replaced by computational linguistics. Instead, it may be worthwhile to apply computational linguistics as a complementary tool to retrieve and process adverse drug reactions that end-users share on the Internet. This information and the insights may be combined with the adverse drug reactions that are reported by medical professionals, with the purpose to achieve a more complete overview of adverse drug reactions. Similarly, computational linguistics may be a suitable tool for the real-time monitoring of adverse drug reactions.

Limitations
The systematic literature search and study selection were performed by only one researcher. Therefore, it was not possible to establish inter-rater reliability. However, the process of study selection and the included studies were discussed by the authors until consensus was achieved. Nevertheless, it may be possible that this has introduced selection bias, but this could not be verified.
All studies that were included in this systematic review were found to have a medium quality. Quality was operationalized using reliability and validity. The process and assessment of the quality of the included studies were discussed by the authors until consensus was achieved.
It was observed that studies often failed to report information on the themes that were used to extract relevant information (see Appendix C). Consequently, the absence of these data limited the analyses of the studies with respect to their methodology, sample characteristics, and the utilized techniques. In addition, various publications failed to disclose information on the diseases that the identified ADRs were related to. We highly recommend authors disclose this information. This information has great value, among others, to establish the quality of these studies and to enable replicability, but it may also benefit the research community if the methodology and processes are explained in greater detail.
Because it is a commonality in the field of information technology and computational linguistics that findings are not always published in peer-reviewed journals, but instead it is often only published in conference proceedings or conference papers, both types of publications were included in this systematic review. This is important, because it may be possible that the process of peer-review is more rigorous when performed by journals compared to conferences. It was also observed that a significant number of included publications were not journal articles.
A common and unavoidable limitation of studies using user-generated content from social media (including all data sources listed in this review) is the inherent noisiness and bias of these data sources. In the context of the reviewed studies, users are usually unqualified to assess their symptoms, they might exaggerate mild or unrelated symptoms, they might just follow popular trends in criticism, or be biased or even malicious (e.g., seeking to discredit competition). These factors have to be taken into account when judging the effectiveness of the proposed tools.
As we discussed in the methods section, due to the interdisciplinary nature of the reviewed studies and their limitations, it was impossible to apply the complete PRISMA 2020 checklist in this systematic review. Therefore no registration was made in PROSPERO either. We acknowledge this as a limitation of this work.

Conclusions and Future Outlook
Our findings suggest that the user-generated textual content that drug users share on the Internet may have the potential to augment or enhance the expensive and timeconsuming traditional system of pharmacovigilance. NLP may thus be used to automate the monitoring of ADRs using content that users publish to social media and other digital platforms [40]. This novel tool may not only contribute to improving public health and the quality of healthcare, but it could potentially also reduce the costs and processing time that are associated with conducting pharmacovigilance. Therefore, this tool may be a viable solution that addresses two of the most prominent challenges of traditional pharmacovigilance, namely the reduction of the high associated costs [7] and the inclusion of ADRs as experienced by the end-users [6]. It is strongly suggested for policymakers to consider the automated analysis of user-generated textual content for the purpose of pharmacovigilance, and to employ it ethically, responsibly, and with great respect to the privacy and anonymity of these drug users.
We acknowledge that we ought to be limited in describing the architecture of such tools for pharmacovigilance in the present paper. On an abstract level, this tool would subscribe to the Twitter API and filters Tweets based on keywords related to ADRs. Relevant Tweets are then subject to sentiment analysis and processed using NLP techniques. Based on the informational needs of researchers, further processing and analysis can be performed to extract key information on ADRs for medicines of interest and the related sentiment expressed by drug users.
Author Contributions: Study design: P.P. and M.L.; implementation: P.P.; analysis: P.P.; writing and editing: P.P. and A.B. All authors have read and agreed to the published version of the manuscript.
Funding: Parts of this work have been funded by the Applied AI Digital Innovation Hub North project, funded by the European Regional Development Fund.

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Data Availability Statement:
The resources used in this systematic review are available from the first author.

Conflicts of Interest:
The authors declare no conflict of interest. Selection process 8

Appendix A. PRISMA 2020 Checklist
Specify the methods used to decide whether a study met the inclusion criteria of the review, including how many reviewers screened each record and each report retrieved, whether they worked independently, and if applicable, details of automation tools used in the process.

4
Data collection process 9 Specify the methods used to collect data from reports, including how many reviewers collected data from each report, whether they worked independently, any processes for obtaining or confirming data from study investigators, and if applicable, details of automation tools used in the process.

Data items 10a
List and define all outcomes for which data were sought. Specify whether all results that were compatible with each outcome domain in each study were sought (e.g. for all measures, time points, analyses), and if not, the methods used to decide which results to collect.

10b
List and define all other variables for which data were sought (e.g. participant and intervention characteristics, funding sources). Describe any assumptions made about any missing or unclear information.

4, 5
Study risk of bias assessment 11 Specify the methods used to assess risk of bias in the included studies, including details of the tool(s) used, how many reviewers assessed each study and whether they worked independently, and if applicable, details of automation tools used in the process.

5
Effect measures 12 Specify for each outcome the effect measure(s) (e.g. risk ratio, mean difference) used in the synthesis or presentation of results. n. a.

Synthesis methods 13a
Describe the processes used to decide which studies were eligible for each synthesis (e.g. tabulating the study intervention characteristics and comparing against the planned groups for each synthesis (item #5)).

13b
Describe any methods required to prepare the data for presentation or synthesis, such as handling of missing summary statistics, or data conversions. 6 13c Describe any methods used to tabulate or visually display results of individual studies and syntheses. 6 13d Describe any methods used to synthesize results and provide a rationale for the choice(s). If meta-analysis was performed, describe the model(s), method(s) to identify the presence and extent of statistical heterogeneity, and software package(s) used.

13e
Describe any methods used to explore possible causes of heterogeneity among study results (e.g. subgroup analysis, meta-regression). n. a. 13f Describe any sensitivity analyses conducted to assess robustness of the synthesized results. n. a.
Reporting bias assessment 14 Describe any methods used to assess risk of bias due to missing results in a synthesis (arising from reporting biases). n. a.

Certainty assessment 15
Describe any methods used to assess certainty (or confidence) in the body of evidence for an outcome. 5 Table A1. Cont.

Study selection 16a
Describe the results of the search and selection process, from the number of records identified in the search to the number of studies included in the review, ideally using a flow diagram.

16b
Cite studies that might appear to meet the inclusion criteria, but which were excluded, and explain why they were excluded. n. a.

Study characteristics 17
Cite each included study and present its characteristics. 7 Risk of bias within studies 18 Present assessments of risk of bias for each included study. 20

Results of individual studies 19
For all outcomes, present, for each study: (a) summary statistics for each group (where appropriate) and (b) an effect estimate and its precision (e.g. confidence/credible interval), ideally using structured tables or plots. Report which of the following are publicly available and where they can be found: template data collection forms; data extracted from included studies; data used for all analyses; analytic code; any other materials used in the review.
n. a.

Appendix B. Search Strategies
The systematic literature search was performed on 25 March 2020 for all databases. All publications appearing up to this point were considered in the search. "All Metadata":artificial intelligence OR "All Metadata":machine learning OR "All Metadata":text mining OR "All Metadata":computational linguistics OR "All Metadata": natural language processing OR "All Metadata":nlp OR "All Metadata":sentiment analysis OR "All Metadata":word embedding* OR "All Metadata":natural language toolkit OR "All Metadata":nltk Block 2: Health Surveillance "All Metadata":public health surveillance OR "All Metadata":health surveillance OR "All Metadata":public health monitoring OR "All Metadata":health monitoring Filters: