Analyzing Public Reactions, Perceptions, and Attitudes during the MPox Outbreak: Findings from Topic Modeling of Tweets

The recent outbreak of the MPox virus has resulted in a tremendous increase in the usage of Twitter. Prior works in this area of research have primarily focused on the sentiment analysis and content analysis of these Tweets, and the few works that have focused on topic modeling have multiple limitations. This paper aims to address this research gap and makes two scientific contributions to this field. First, it presents the results of performing Topic Modeling on 601,432 Tweets about the 2022 Mpox outbreak that were posted on Twitter between 7 May 2022 and 3 March 2023. The results indicate that the conversations on Twitter related to Mpox during this time range may be broadly categorized into four distinct themes - Views and Perspectives about Mpox, Updates on Cases and Investigations about Mpox, Mpox and the LGBTQIA+ Community, and Mpox and COVID-19. Second, the paper presents the findings from the analysis of these Tweets. The results show that the theme that was most popular on Twitter (in terms of the number of Tweets posted) during this time range was Views and Perspectives about Mpox. This was followed by the theme of Mpox and the LGBTQIA+ Community, which was followed by the themes of Mpox and COVID-19 and Updates on Cases and Investigations about Mpox, respectively. Finally, a comparison with related studies in this area of research is also presented to highlight the novelty and significance of this research work.


Introduction
Monkeypox (Mpox), caused by the monkeypox virus, which belongs to the Poxviridae family, Chordopoxvirinae subfamily, and Orthopoxvirus genus [1], is a re-emerging zoonotic disease.In the shape of a brick-like virion ranging from 200 nm to 250 nm in length, the Mpox virus has a large genome of about 200 kilobase pairs encoding approximately 190 proteins.Two clades of Mpox, clade 1 and clade 2, show a 0.5% genomic difference, with clade 1 having a 1-12% case fatality rate and clade 2 having a 0.1% case fatality rate [2,3].The first case of human Mpox was recorded in a 9-month-old boy in the Democratic Republic of the Congo (DRC) in 1970 [1].After the first case in 1970, 59 cases were reported in West and Central Africa in the next decade [4], with a 17% mortality rate in children under 10 [5,6].The World Health Organization (WHO) monitored Mpox cases post-1980.Between 1981 and 2017, there were multiple outbreaks of Mpox in DRC due to clade 1, with the fatality rate being between 1 and 12%, primarily due to inadequate health systems [7].From 2003 to 2022, a few travel-related cases were reported outside endemic countries but the number of cases was not very high [8].However, a global outbreak of the Mpox virus started on 7 May 2022 [9], and on 23 July 2022, the WHO declared Mpox a Global Public Health Emergency (GPHE) [10].This outbreak is linked to a new lineage, B.1 (clade 2b), that has a higher mutation rate.This outbreak has resulted in 110 countries reporting about 87,000 cases and 112 deaths so far [11].
The Mpox virus can enter hosts via respiratory or dermal routes.As a result, infection may occur in airway epithelial cells, keratinocytes, fibroblasts, and endothelial cells [12,13].Currently, no FDA-approved treatments for Mpox exist.At present, in the United States, there are three vaccines-JYNNEOS, ACAM2000 ® , and APSV-that are available.Out of these three vaccines, JYNNEOS has been approved by the FDA for smallpox and monkeypox in adults at high risk [14].Tecovirimat is effective against Orthopoxvirus in animals but untested in human Mpox [15].Other potential treatments for Mpox include VIGIV, cidofovir, and brincidofovir, which have proven in vitro and animal efficacy but limited availability [16].Since the first case of this outbreak, various policy-making bodies of the world have taken measures to contain the spread of the Mpox virus.For instance, New York City Health + Hospitals (NYC H + H), an integrated healthcare system, has been pivotal in the fight against emerging pathogens and viruses, including Mpox, in the New York City region of the United States [17].However, the uncertainty and challenges surrounding asymptomatic transmission of viruses such as Mpox and the absence or inadequacy of appropriate and effective transmission-based personal protective equipment (PPE), such as N95 masks, face shields, gowns, and extended cuff examination gloves, may result is a possible resurgence of Mpox [18].The fifth meeting of the WHO's International Health Regulations (IHR) Emergency Committee on the Multi-Country Outbreak of Mpox took place on 10 May 2023.At this meeting, the committee acknowledged remaining uncertainties about the disease, regarding modes of transmission in some countries, poor quality of some reported data, and continued lack of effective countermeasures in the African countries, where Mpox occurs regularly.On 15 May 2023, the US Centers for Disease Control and Prevention (CDC) warned about the potential resurgence of Mpox cases in the US [19].
In today's Internet of Everything era, social media platforms provide a seamless and virtual means for users to connect, communicate, and collaborate with each other [20].Among the numerous social media platforms that have been used by the public in the last decade and a half, Twitter has been highly popular amongst all age groups.Twitter has over 368 million active monthly users [21].Twitter stands out as the social media site that journalists choose to use [22] and is among the sites with the highest global adoption rates [23].On average, 500 million Tweets are published on Twitter each day [24].Twitter is also the most popular social media platform for news and current events, and the usage of Twitter by Generation Z users is growing approximately 30% faster than for Instagram [25].Furthermore, on average, users on Twitter who post more than five Tweets follow 405 accounts [26].Therefore, Twitter has been highly popular amongst researchers from different disciplines for the investigation of a wide range of research problems.During the virus outbreaks that have taken place in the last few years, such as COVID-19, H1N1, flu, Ebola, Zika virus, Middle East Respiratory Syndrome (MERS), measles, and West Nile virus, just to name a few [27], topic modeling of Tweets helped to understand the perception, preparedness, response, views, and opinions of the general public during these virus outbreaks.In a generic manner, topic modeling is a methodology that comprises different algorithms that identify, comprehend, and annotate the thematic structure in a collection of documents [28].In addition to investigating the perception, preparedness, response, views, and opinions of the general public during these virus outbreaks, the field of topic modeling has had multiple interdisciplinary applications in the last few years, as can be seen from the recent works in topic modeling in bioinformatics [29,30], software engineering [31,32], cryptocurrency [33,34], smart-home research [35,36], human behavior modeling [37,38], analysis of cognitive impairment [39,40], education research [41,42], biology [43,44], and medicine [45,46].Prior works related to the mining and analysis of Tweets about the MPox outbreak have primarily focused on the sentiment analysis and content analysis of Tweets.Since the first case of this outbreak, only a couple of studies have been published that have focused on topic modeling of the Tweets.However, these studies have multiple limitations centered around (1) the limited time range of the Tweets that were analyzed, (2) the limited number of Tweets that were analyzed, (3) the elimination of a topic from the study, and (4) the lack of reporting of metrics to discuss the working or the accuracy of the topic modeling approaches.Addressing this research gap serves as the main motivation for this work.The rest of the paper is organized as follows.A comprehensive review of recent advances in this area of research is presented in Section 2. Section 3 describes the methodology that was followed for this work.It is followed by Section 4, which presents the results and highlights the novel findings of this work.Section 4 is followed by a conclusion section that summarizes the contributions of this paper and outlines the scope for future work in this area.

Literature Review
Mining and analysis of Tweets for the investigation and exploration of different research questions, with a specific focus on information diffusion [47], sentiment analysis [48,49], text categorization [50], spam detection [51], privacy issues [52], trust issues [53], user migration [54], and topic detection [55], has been of significant interest to researchers from a wide range of disciplines in the last few years.While misinformation presents some challenges [56], it is still crucial to understand web behavior on Twitter and its implications for real-world decision making.Therefore, this section is section is divided into three parts.Section 2.1 presents a brief review of mining and analysis of Tweets for interdisciplinary research.Section 2.2 discusses the recent advances related to the study and analysis of Tweets in the field of healthcare.Section 2.3 outlines the latest works that have focused on the mining and analysis of Tweets about the MPox outbreak.

A Brief Review of Recent Works Related to the Mining and Analysis of Tweets for Interdisciplinary Research
Abu Samah et al. [57] proposed a web-based dashboard to visualize customer sentiment towards Malaysian airline companies, as expressed on Twitter.Similar to the tourism industry, the entertainment industry and politics have also benefitted from the mining and analysis of Tweets.Bodaghi et al. [58] explored the differences between the web behaviors of actors who spread fake news and those who spread the truth on Twitter.This study showed that although fake news has much better modularity and intra-to interlinks ratio, truth tweeters generally have higher page rank centrality.Collins et al. [59] conducted a comparative analysis of over 2000 tweets from the first two U.S. presidents of the "Twitter era", Barack Obama and Donald Trump, to assess the impact of their online correspondence on America's image abroad and its soft power.An interesting finding of this study was that the tone of presidential tweets can have significant and divergent effects on the perception of the U.S., even internationally.Beyond the border of the United States, Berrocal-Gonzalo et al. [60] found that politainment, the phenomenon of trivializing political information for entertainment purposes, manifested on Twitter during the Spanish general elections in April 2019, which prevented the creation of meaningful debates or interactions surrounding the elections on the platform.
The Big Data of conversations on Twitter is also considerably informative regarding the unremitting controversies underlying human society and human rights.Following the United States Supreme Court's decision in Dobbs vs. Jackson Women's Health Organization that overturned abortion rights, Chang et al. [61] presented a large-scale Twitter dataset collected on the abortion rights debate in the U.S. Similarly, Peña-Fernández et al. [62] analyzed the polarization produced in social media debates regarding the rights of transgender people and the views of feminists, specifically the use of the term "TERF" (trans-exclusionary radical feminist) on Twitter.The findings of this work show that online debates are poorly inclusive, suggesting the prevalence of community isolation.Goetz et al. [63] analyzed sentiments in food-security-related Tweets in the U.S. during the early stages of the COVID-19 pandemic, from which they found that keywords of negative emotions were statistically correlated with the contemporaneous food insufficiency rates reported in the Household Pulse Survey.Tao et al. [64] conducted a comparative study of posts on Twitter and Weibo regarding the Russian-Ukrainian War to reveal the differences in the topics of posts between the two platforms and to call for humanitarianism and peace.Researchers have also utilized Twitter as well as studied Tweets in the context of various needs and challenges faced by different diversity groups.In the context of this focus area of research, the needs faced by the elderly population such as falls [65,66], indoor localization issues [67,68], behavior-related problems [69,70], memory issues [71,72], inability to perform activities of daily living (ADLs) [73,74], and user experience with different technologies [75,76], has been widely investigated.Yavuz et al. [65] developed a system for detecting falls, which also comprised virtual support.The virtual support feature included a warning about the fall and specific location-related information of the person who experienced the fall being communicated to caregivers via Twitter messages.Tamplain et al. [69] developed a methodology that could analyze Tweets and identify selfreportings of dyspraxia.An investigation of the evolution and changes in human behavior related to natural disasters in Japan, as expressed on Twitter, was conducted by Lu et al. [73].Finally, analysis of Tweets has revealed several novel insights underlining the public discourse related to the education industry [77,78], alcohol industry [79,80], tobacco industry [81,82], sports industry [83,84], assisted living technologies [85,86], context-driven technologies [87,88], web behavior analysis [89,90], and emerging works in robotics [91,92].As can be seen from this brief review, mining and analysis of Tweets holds the potential for the investigation of research questions across different disciplines.In the context of the different virus outbreaks that the world has witnessed in the last decade and a half, healthcare-based research using Tweets has emerged as a crucial utilization of this vast potential of mining and analyzing Tweets.Some recent works are briefly reviewed in Section 2.2, which is followed by a dedicated review of recent works related to the analysis of Tweets about the MPox outbreak.

A Brief Review of Recent Works Related to the Mining and Analysis of Tweets for Healthcare Research
The Big Data of conversations and information exchange from social media platforms, specifically Twitter, has the potential to improve the efficiency, accuracy, and coverage of healthcare systems in different geographic regions.Cevik et al. [93] analyzed the sentiments of Tweets about Parkinson's disease.Kesler et al. [94] applied topic modeling and qualitative content analysis to comments related to cancer-related cognitive impairment (CRCI), revealing the importance of coping mechanisms.Klein et al. [95] demonstrated the potential of using Twitter data to identify the start and end of the 40-week prenatal period, making it a valuable resource for observational studies on potential risk factors in pregnancy.Thakur et al. [96,97] developed a framework to address loneliness and social isolation in the elderly using Twitter data.Thackeray et al. [98] analyzed how Twitter is used during Breast Cancer Awareness Month (BCAM).The findings showed that although organizations and celebrities emphasized fundraisers, early detection, and diagnoses, the general public published the majority of the tweets that did not promote any specific preventive behavior.
Studies in this field have shown that Tweets provide a timely and authentic record of public perception and understanding of human health crises.To investigate how the Dengue epidemic was reflected on Twitter, Gomide et al. [99] proposed an active surveillance methodology based on volume, location, time, and public perception.The analysis found a high correlation between the number of cases reported by official statistics and the number of Tweets posted during the same period.The work performed by Radzikowski et al. [100] reported that news organizations had a higher impact than health organizations in communicating health-related information.By examining the use of Twitter data to track public sentiment and disease activity related to H1N1 or swine flu, Signorini et al. [101] found that estimates of influenza-like illness derived from Twitter chatter accurately tracked reported disease levels from governmental organizations.With the realization of Twitter's ability to provide real-time information, Sugumaran et al. [102] collected and analyzed tweets tagged with #WestNileVirus and #WNV and concluded that unusually higher temperatures and mosquito activities led to an increase in Tweet numbers about the West Nile virus.To inform health promotion efforts, Porat et al. [103] analyzed the content and sources of popular Tweets related to a diphtheria case in Spain.The most notable conclusion from their study was their suggestion for healthcare organizations to collaborate with popular journalists, news outlets, and science authors to address public concerns and misinformation through the outlet of social media platforms such as Twitter.As can be seen from this brief review, the study, analysis, and interpretation of multimodal characteristics of Tweets about different virus outbreaks has helped in the timely advancement of research in the field of healthcare.Section 2.3 specifically highlights the recent advances in this area of research that focused on the investigation of the public discourse on Twitter about MPox.

Review of Recent Works Related to the Mining and Analysis of Tweets about MPox
Knudsen et al. [104] studied 262 Tweets to describe the risk of Mpox to students.The results showed that credentialed Twitter users were 4.6 times more likely to Tweet inaccurate information about MPox.Zuhanda et al. [105] studied 5000 Tweets about MPox posted on 5 August 2022 to perform sentiment analysis.The results showed that 51.92% of the Tweets had a negative sentiment and 48.08% of the Tweets had a positive sentiment.Ortiz-Martínez et al. [106] performed a study of top 100 Tweets about MPox posted on 24 May 2022.The findings showed that most of the Tweets were posted by informal individuals or groups (60%), followed by healthcare or public health (32%), and news outlets or journalists (8%).The work by Rahmanian et al. [107] involved studying 384,560 Tweets posted between 16 May 2022 and 22 May 2022.The results indicated that most of these Tweets were posted by individuals from the United States and Canada.Cooper et al. [108] studied Tweets containing the word "monkeypox" posted between 1 May 2022 and 23 July 2022.The results showed that a total of 48,330 Tweets were posted by individuals who identified as members or allies of the LGBTQ+ community.The work of Ng et al. [109] focused on studying 352,182 Tweets about MPox posted between 6 May 2022 and 23 July 2022.The authors performed topic modeling of these Tweets and derived three themes-concerns of safety, stigmatization of minority communities, and a general lack of faith in public institutions.Bengesi et al. [110] mined over 500,000 multilingual tweets related to MPox and performed sentiment analysis.Olusegun et al. [111] studied 800,000 Tweets about MPox and used NRCLexicon to predict and measure the emotional significance of each Tweet.Farahat et al. [112] performed opinion mining on a total of 8532 Tweets about MPox posted between 22 May 2022 and 5 August 2022.The results indicated that close to 50% of the Tweets were neutral.Sv et al. [113] studied Tweets about MPox posted between 1 June 2022 and 25 June 2022.The study was performed in two steps.In the first step, the authors analyzed a dataset of 556,402 Tweets about MPox and performed sentiment analysis of those Tweets.The results showed that 128,037 Tweets (23.01%) had a negative sentiment.Thereafter, only these 128,037 Tweets that had a negative sentiment were used for topic modeling in the second step of their study.The results of topic modeling showed that among the Tweets about MPox that had negative sentiments, there was a range of topics that were represented, such as deaths caused by the MPox virus, the severity of the virus, lesions caused by the virus, whether the virus is airborne, vaccines for the virus, and whether the virus will lead to the next pandemic.Mohbey et al. [114] developed a neural-network-based model to perform opinion mining of Tweets about MPox, and the accuracy of the model was found to be 94%.A dataset of Tweets about the MPox outbreak was developed by Nia et al. [115].The work of Iparraguirre-Villanueva [116] focused on the detection of polarity in conversations on Twitter about MPox.The results showed that 45.42% of people expressed neither positive nor negative opinions, whereas 19.45% expressed negative and fearful feelings about MPox.AL-Ahdal [117] studied 15,936 Tweets about MPox posted by individuals from Germany.The results showed that the public displayed an impersonal feeling toward MPox.
As can be seen from this review and the research questions that were investigated in these papers, the major focus areas of these works have been sentiment analysis and content analysis.Only a couple of works have focused on the topic modeling of Tweets about MPox.However, those works have multiple limitations centered around (1) the limited time range of the Tweets that were analyzed, (2) the limited number of Tweets that were analyzed, (3) the elimination of a topic from the study, and (4) the lack of reporting of metrics to discuss the working or the accuracy of the topic modeling approaches (discussed in detail in Section 4).This study aims to address this research gap.Therefore, topic modeling of 601,432 Tweets about the 2022 MPox outbreak that were posted on Twitter between 7 May 2022 and 3 March 2023 was performed in this study.Section 3 outlines the step-by-step process that was followed for the system design and implementation.The results and novel contributions of this work are presented and discussed in Section 4.

Methodology
This section presents the methodology that was followed for the system design and implementation.This section is divided into three parts.In Section 3.1, a technical overview of RapidMiner [118] is presented, as RapidMiner was used for this work.Section 3.2 presents the description of the topic modeling architecture that was used in this work.Section 3.3 outlines the steps that were followed for the implementation of this topic modeling architecture, along with the specifics of the system design.

Technical Overview of RapidMiner
RapidMiner is a Data Science software platform that allows the development and implementation of different algorithms related to Machine Learning, Data Science, Artificial Intelligence, and Big Data.It enables its users to visually design data workflows and build predictive models using a graphical user interface (GUI).Applications developed in RapidMiner are referred to as "processes" that comprise one or more "operators".These "operators" may be built-in or user defined.The RapidMiner Studio consists of several built-in "operators" that may be directly used for the implementation of various tasks.There also exist certain "operators" in RapidMiner Studio that may be utilized to modify the functionality of other "operators".RapidMiner is developed on a client-server model with public and private cloud infrastructures.There is a free version and an enterprise version of RapidMiner Studio.The free version has a limit of 10,000 rows in a dataset for a "process".For this research work, the educational license of RapidMiner (available to researchers in academia upon request) was used in RapidMiner Studio 10.1.001.With the educational license, RapidMiner Studio can be used to process any number of rows in any dataset.The following represent a few notable characteristics of RapidMiner Studio [118,119]: 1.It supplies pre-built "operators" encompassing distinct functions that can be directly employed or customized for the creation and execution of algorithms and applications.2. RapidMiner is developed using Java, which ensures that RapidMiner "workflows" retain the write once run anywhere (WORA) attribute of Java. 3. The platform permits the installation of various extensions to facilitate seamless connectivity and integration of RapidMiner "workflows" with other software and hardware environments.
4. Scripts developed in programming languages, such as Python and R, can also be imported into a RapidMiner "workflow" to supplement its functionalities.5.The software enables the creation of new "operators" and effortless dissemination of the same within the RapidMiner community.6. RapidMiner consists of "operators" that enable it to establish connections with social media platforms, such as Twitter and Facebook.Such connections facilitate the extraction of tweets, comments, posts, reactions, and other relevant social media interactions.
Similar to RapidMiner, there are other data science platforms, for example, WEKA [120] and MLC++ [121].Both these platforms can be utilized for developing machine learning models.However, a major limitation of these platforms is that nesting of "operators" is not permitted.Such a limitation is not present in RapidMiner.In view of this and in view of the different features of RapidMiner related to the development and implementation of machine learning algorithms and models, RapidMiner was utilized for this research work.

Description of the Topic Modeling Architecture for System Design
Latent Dirichlet Allocation (LDA) [122] is a generative probabilistic model used in the field of Natural Language Processing and Machine Learning.It is commonly used for topic modeling, which is the task of identifying topics within a collection of documents.In LDA, the mixture of topics is derived from a consistent Dirichlet prior, which is the same across all documents.The procedure [123] for creating a corpus is outlined as follows (in this context, a smoothed LDA is considered).Thereafter, the likelihood of generating a corpus can be represented, as shown in Equation (1).
1. Select a multinomial distribution ∅  for each topic z from a Dirichlet distribution with parameter . 2. For every document d, select a multinomial distribution   from a Dirichlet distribution with parameter . 3.In document d, for each word w, select a topic z, such that z ∈ {1….K} from the multinomial distribution   .4. Select w from the multinomial distribution   .
LDA considers the topic distribution as a k-parameter hidden random variable instead of a large set of features associated with the training data to address the problems of overfitting and generating new documents that were present in probabilistic latent semantic indexing (pLSI) [123].To use language models for information retrieval in an LDA, an approach using the query likelihood model is used, where each document is scored by the likelihood of its model generating a query Q.This is shown in Equations ( 2) and (3).In Equation ( 2), D represents a model for documents, Q stands for the query, and q denotes an individual term within the query Q. P(Q|D) represents the probability of the document model generating the query terms, following the assumption of "bag-ofwords".This assumption considers that terms are independent.(  |) is specified by the document model with Dirichlet smoothing.In Equation (3), P(w|D) is the maximum likelihood estimate of word w in the document D, P'(w|coll) is the maximum likelihood estimate of the same word w in the entire collection, and μ represents the Dirichlet prior.It is important to mention here that each topic in an LDA model represents a specific combination of words.However, the same approach may not always be as accurate as the working of non-topic models in Natural Language Processing such as unigram or bigram analysis.As a result, directly implementing the LDA model may hurt the overall performance of information retrieval.So, in a prior work in this field, the original document model (Equation ( 3)) was combined with the LDA model to construct a new LDA-based document model, as shown in Equation ( 4).The LDA model introduces a novel document representation centered around topics.After obtaining the posterior estimates for θ and φ, the word probability within a document can be computed using Equation (5), where θ ̂ and ϕ ̂ are the posterior estimates of  and ∅, respectively [123].
The LDA cannot be solved by direct inference.So, Gibbs sampling is utilized, which helps in the approximation of θ ̂ and ϕ ̂, with α and β being the hyperparameters that determine the smoothness of the empirical distribution.Gibbs sampling involves performing an iteration over a set of variables z1, z2, z3,…….zn,where, for each iteration, zi is sampled from P(zi|z\i, w).Every such iteration over all these variables is known as a Gibbs sweep.After a considerable number of iterations, the Gibbs sampling for an LDA produces samples from P(z|w).This sampling may be performed by jointly resampling all the topics.In this approach, the scope of a Gibbs sweep is defined to be the hidden topic variables by taking into account both original and new documents.Initially, the sampling of the topic variables for the training set is performed such that they converge (without the new documents).Thereafter, the topic variables are randomly initialized and the sampling is performed again such that the model converges by taking into account all the documents.At this point, the topic distribution, θd can be estimated using a single Markov chain state, as shown in Equation (6).In this Equation,  .|represents the length of the document.
A higher accuracy may be obtained by computing the average of values generated by Equation ( 6) from multiple Markov chains [123].In this research work, SparseLDA was implemented, as prior work has shown that it is 20 times faster than the traditional LDA [124].In the SparseLDA framework, given an observed word type w, the probability of topic z in document d can be computed using Equation (7).

𝑃(𝑧 = 𝑡|𝑤)𝛼 (𝛼 𝑡 + 𝑛 (𝑡|𝑑) )
+ | + .|(7) In this context, the process of sampling involves the calculation of the unnormalized weight, q(z), for each topic by sampling a random variable U ∼ U(0, ∑ () z ) and evaluating t such that ∑ () . The procedure also necessitates the calculation of q(z) for all the topics for the computation of the normalizing constant for the distribution ∑ () z , despite the fact that the probability mass is usually concentrated on a small set of topics.A simpler approach involves caching a significant portion of the computation needed to calculate the normalizing constant.Through the reorganization of terms within the numerator, Equation ( 7) can be portioned into three distinct sections, as shown in Equations ( 9)- (11).Here, the first term is constant for all documents and the second term does not depend on the current word type.Moreover, ∑ ()  corresponds to the sum across topics for each of the three components in Equation ( 8) [124].

𝑃(𝑧 = 𝑡|𝑤)𝛼
This divides the full sampling mass into three "buckets".Now, ~(0,  +  + ) can be sampled.If U < s, it would imply hitting the "smoothing-only" bucket.Thereafter, the process involves stepping through each topic and calculating and adding for that topic until it is greater than x.For the document bucket, s < x < (s + r), the process involves iterating through the set of topics that satisfies  | ≠ 0. The scenario of x > (s + r) implies hitting the "topic word" bucket, and topics are considered such that  | ≠ 0. The calculation of the three parameters of the normalizing constant, r, s, and q, is not complicated.The constant s only changes when the hyperparameter α is updated.Conversely, the constant r is solely influenced by document topic counts.This permits the computation of r once at the start of each document and its subsequent modification by subtracting and adding values related to the prior and current topics in each Gibbs update.This process takes constant time and is not dependent on the number of topics.The topic word constant, q, changes with the value of w, so old computations cannot be easily recycled.However, the performance can be significantly improved by splitting q into two components, as shown in Equation (12).With this Equation, the coefficient (  +  | )  +  .|can be cached for every topic.Calculating q for a specific w involves performing a single multiplication operation for each topic, where  | ≠ 0. Given that  | = 0 for all topics within any document, the coefficients vector will predominantly comprise only . Therefore, this allows the optimization of the LDA model by caching these coefficients across documents, refreshing values only for topics with non-zero counts in the current document, and reverting these values to α-only values upon finishing the sampling process for that document.
If the values of α and β are small, most of the total mass is taken up by q.It has been found empirically that 90% of the samples belong to this "bucket".In a Dirichlet-multinomial distribution comprising fewer parameter magnitudes, the likelihood of the distribution is approximately proportional to the concentration of the counts on a small number of dimensions.It has also been found that the time per iteration is approximately proportional to the likelihood of the model.As the sampler approaches a region of high probability, the time per iteration decreases, leveling off as the sampler converges.The efficiency of this algorithm depends on its capability to detect topics by satisfying the condition of  | ≠ 0. In addition to this, as the expression in Equation ( 12) is approximately proportional to  | and as the evaluation of the terms can be stopped from the point where the sum of the terms is more than U − (s + r), it is considered desirable to perform the iteration over non-zero topics in descending order [124].

Description of the System Design and Implementation
This section describes the system design and its implementation, as well as the dataset that was used for performing this research work.The dataset that was used comprises 601,431 Tweet IDs of Tweets about MPox posted between 7 May 2022 and 3 March 2023 [125].This dataset contains only Tweet IDs, and the standard procedure for working with such datasets is that the dataset is hydrated to obtain the text of the Tweets and related information.However, this dataset was developed by the first author of this paper, so all the Tweets were already available and hydration of the Tweet IDs was not necessary.Figure 1 shows the system design in RapidMiner.This is a "process" that was developed in RapidMiner Studio 10.1.001(with the Educational License) to set up this system, and this "process" comprises different "operators" with different functionalities.In this Figure, the "MPox_Tweets_Data" "operator" represents the Tweets from the dataset described earlier in this section.These Tweets were imported into RapidMiner Studio to develop this "process".All 601,431 Tweets about MPox posted between 7 May 2022 and 3 March 2023 were used for developing this LDA model.Thereafter, separate "operators" were developed to perform the different steps of the data processing.The data processing comprised the following steps, and for each of these steps, a separate "operator" was developed in this RapidMiner "process".For steps (a), (b), (c), and (d) of the data preprocessing, different regular expressions (RegEx) were developed and applied to define the functionalities of these "operators".After completion of the data pre-processing, an LDA model was developed and implemented in RapidMiner as per the architecture of the parallel topic model and SparseLDA (described in Section 3.2) by customizing and utilizing the "Extract Topics from Data (LDA)" operator in RapidMiner Studio 10.1.001.The number of iterations for optimization was set to 1000, as using 1000 iterations for developing LDA models has been used in related studies in this area of research [126,127], where 1000 iterations were suggested to reach a stable convergence.There are three hyperparameters that are associated with the LDA model that was implemented-the number of topics (k), the distribution of words per topic (), and the distribution of topics per document (α).The distribution of topics per document (α) was set to 50/k and the distribution of words per topic () was set to 50/(number of words), which are standard considerations related to topic modeling using LDA [127].The frequency of hyperparameter optimization was set to 10, which is the default value for this "operator" in RapidMiner Studio 10.1.001.As discussed in related studies in this area of research [128,129], the average coherence value of an LDA model serves as a key indicator for the determination of the optimal number of topics.So, this "process" (shown in Figure 1) was repeatedly run by varying the number of topics from 2 to 50, and the average coherence value of the model was computed and recorded for each of these runs.The results of running this "process" 49 times to deduce the optimal number of topics, as well as the specific topics that were identified in the Tweets, are discussed in Section 4. Figure 2 shows the order in which the different "operators" of this "process" were executed for each run of this "process".

Results and Discussions
This section presents the results of this work.As stated in Section 3.3, the LDA model (shown in Figure 1) was run by varying the number of topics from 2 to 50 to determine the optimal number of topics based on the analysis of the average coherence value for each run.Table 1 represents the average coherence value of this LDA model from each run.This table was compiled from 49 runs of this LDA model by varying the number of topics from 2 to 50.An analysis of the same is presented in Figure 3.  From Table 1 and Figure 3, the optimal number of topics was determined to be four, as the LDA model produced the highest coherence score for the same.It is worth mentioning here that negative values of coherence scores are not unusual for an LDA model that follows the system architecture as described in this paper, as the formula for computing the coherence scores (for the LDA model used in this work) involves applying the logarithmic function on calculated probability values, as shown in Equation ( 13), where C stands for the average coherence score, N represents the top words of a topic, and wi and wj represent the ith and jth word, respectively.In addition to this, ε features in this Equation to prevent a scenario of the logarithmic function being applied to zero [130].
Furthermore, a recent study [131] followed a similar system architecture for the LDA model that was developed, and the authors of that work obtained negative values for the coherence scores for all the numbers of topics; the lowest value of the coherence score was reported to be as low as about −11.5.After determining that the optimal number of topics for this LDA model was four, this model was run by setting the number of topics as four, and the characteristics from the output were observed and analyzed.For each Tweet, the RapidMiner "process" computed a confidence value for each topic and then predicted a topic for that Tweet based on the highest confidence value.This is shown in Figure 4. To avoid an output table comprising 601,431 rows, this Figure shows 17 rows (selected randomly) from this output table.In Figure 4, the attributes confidence(Topic_0), confidence(Topic_1), confidence(Topic_2), and confidence(Topic_3) represent the confidence values of each Tweet belonging to Topics 0, 1, 2, and 3, as computed by the LDA model.The last attribute in this Figure , Tweet_Text, shows the Tweet (after data preprocessing) that was analyzed.For each Tweet, the highest of these confidence values was used to predict the topic for the same.For instance, in the first row in Figure 4, it can be seen that confidence(Topic_0) has the highest value, so the predicted topic for this Tweet was Topic 0. In a similar manner, the LDA model predicted the topics for all the Tweets of this dataset.It is important to mention that every Tweet in this dataset was assigned a confidence value for every topic (Topic 0, Topic 1, Topic 2, and Topic 3) and then the highest confidence value out of these four confidence values was utilized by the developed LDA model to determine the specific topic for a Tweet.The minimum, maximum, average, and standard deviation of these confidence values for each topic are presented in Table 2. Figure 5 shows a histogram-based representation of the confidence values for Topic 0 for all the analyzed Tweets and the categorization of these Tweets into different topics.The X-axis represents the confidence values, and the Y-axis represents the number of Tweets.The color coding (as stated in this Figure ) represents the predicted topic by the LDA model.As can be seen from Figure 5, all the Tweets that received a confidence value of around 0.5 or higher for Topic 0 were categorized as Topic 0. Figures 6-8 show similar histogrambased representations of the confidence values of Tweets for Topics 1, 2, and 3, respectively, and their categorization into different topics.The Tweets belonging to each of these topics-Topic 0, Topic 1, Topic 2, and Topic 3-were studied to understand the underlying themes of conversation that represented each of these topics.Based on this study, the broad themes that represented these topics were observed to be "Views and Perspectives about MPox", "Updates on Cases and Investigations about Mpox", "MPox and the LGBTQIA+ Community", and "MPox and COVID-19".
Table 3 represents a random selection of five Tweets for each of these Topics.In Table 3, these Tweets are presented in "as is" form, i.e., in the manner in which they were originally posted on Twitter to provide better context.Table 3. Representation of five Tweets (selected randomly) for each Topic-Topic 0, Topic 1, Topic 2, and Topic 3.

Tweet # Original Text of the Tweet
Topic 0, Theme: Views and Perspectives about MPox Tweet #1 @vancemurphy @pfizer @moderna_tx @US_FDA Well, you know the new thing is monkey pox, right?Vaccines are so yesterday.Tweet #2 Its annoys me how they use pictures of black peoples hands when they discuss monkey pox Tweet #3 The pics of monkey pox looks exactly like shingles Tweet #4 @masthahh1 Are there any stats on the people who have gotten monkey pox?Were they all vaccinated?Tweet #5 Looking at the state of the UK.I'd be more worried about Monkey Pox catching a dose of Englishman! Topic 1, Theme: Updates on Cases and Investigations about MPox Tweet #1 BREAKING: Health department investigating possible monkey pox case in NYC Tweet #2 New York health officials are investigating a potential case of monkeypox after a patient tested positive for the family of viruses associated with the rare illness.
Tweet #3 U.S. government officials are placing orders for millions of doses of monkeypox vaccines amid a worldwide outbreak and a possible case in New York City, the Independent reports.
Tweet #4 WHO is convening an Emergency Committee meeting out of concern for international spread of monkeypox, a high consequence infection.They will likely discuss whether to declare monkeypox a Public Health Emergency of International Concern (PHEIC) Tweet #5 The UK Health Security Agency said the new cases of the rare monkeypox infection do not have known connections with the previous confirmed cases announced on 14 May and a case on 7 May Topic 2, Theme: MPox and the LGBTQIA+ Community Tweet #1 @CraigbryCraig @BreezerGalway Moneypox has been known about since 1958.Majority of case are in gay males.No need to freak out Tweet #2 . Gay? Had "close" contact with someone whose in the hospital now in Montreal.Apparently majority in Montreal who contracted the Monkey Pox were gay 35-50 year old men.AIDS started in the gay community too.Something about monkeying around...

Tweet #3
@jmcrookston Just to be SUPER CLEAR, what I mean by this, is that no, monkeypox isn't a "gay disease".I'm queer and super not okay with the way the media is framing this the same way HIV/AIDS was framed in the 70s/80s.Tweet #4 @jeffreyatucker @ezralevant Some knowledge about Monkey pox, it's mostly for gay.Not a threat.Tweet #5 @EnemyInAState @TimothyVollmer Absolutely agree only other events won't have the stigma attached which is happening with monkey pox so many people are convinced it's a gay disease because there's no context Topic 3, Theme: MPox and COVID- 19 Tweet   Table 4 presents the characteristic features of this LDA model to outline its overall performance and working.The characteristics reported are average coherence scores, word lengths, exclusivity, document entropy, and tokens for each topic.Finally, a comparative study with related works in this area of research is presented.Table 5 outlines a summary of related studies in this area of research that focused on the mining and analysis of Tweets about MPox (reviewed in Section 2).As can be seen from Table 5, most of the recent studies in this area of research have focused on sentiment analysis and content analysis of Tweets.At the same time, a couple of works [109,113] also exist in this field where topic modeling of Tweets about MPox was performed.As can be seen from Table 5, topic modeling of Tweets about MPox is a research area that remains less explored and less investigated.A comparison of the different characteristics of this work and these two works [109,113] is presented in Table 6.[109,113] in this field and highlights how the work of this paper addresses the same.The following is a comprehensive analysis of the limitations in these works [109,113]: 1. Limited time range of the analyzed Tweets: The time range of the Tweets that were analyzed in these works represents Tweets that were posted only during certain months of the 2022 MPox outbreak.One of the works [109] included Tweets that were posted on the day the first case of the 2022 MPox outbreak was recorded (7 May 2022) but the other work [113] did not.Furthermore, none of these works analyzed Tweets posted after 23 July 2022.2. Limited number of Tweets used for topic modeling: The number of Tweets that were used for topic modeling in these works is 352,182 and 128,037 Tweets, respectively.It is relevant to mention here that the work presented in [113] involved a two-step process.In the first step, the authors analyzed a dataset of 556,402 Tweets about MPox and performed sentiment analysis of those Tweets.The results showed that 128,037 Tweets (23.01%) had a negative sentiment.Thereafter, only these 128,037 Tweets that had a negative sentiment were used for topic modeling in the second step of that study.The number of Tweets used for topic modeling in the previous works represents a fraction of the total number of Tweets that have been posted since the first recorded case of the 2022 Mpox outbreak on 7 May 2022.3. Elimination of a topic from the study: The work of Ng et al. [109] reports that after performing topic modeling, one topic was categorized as "Miscellaneous", which accounted for 31.1% of the total number of Tweets.The work also reports that this topic was omitted from the results or, in other words, the specific themes or focus areas of conversation reported in that study [109] are based on the analysis of 68.9% of the Tweets only.
4. Lack of reporting of metrics to discuss the working or the accuracy of the topic modeling approaches: The average coherence value of an LDA model serves as a key indicator for the determination of the optimal number of topics.At the same time, metrics such as exclusivity, document entropy, number of tokens, and average word length of each topic help to provide a better understanding of the working of the underlying topic modeling approach.The two prior works that exist in this field [109,113] do not report any of these metrics to discuss either the working or the accuracy of the topic modeling approaches that were used.
These limitations that exist in similar studies in this area of research are addressed in this paper.First, the dataset that was used for developing the LDA model comprises Tweets about MPox that were posted on Twitter between 7 May 2022 and 3 March 2023a time range that is greater than the time ranges of the similar works shown in Table 6.Second, a total of 601,432 Tweets were used for topic modeling.This is much higher than the number of Tweets that were used for topic modeling in similar works (352,182 Tweets in [109] and 128,037 Tweets in [113]) in this field.Third, no topic was eliminated or omitted prior to the identification of the themes or focus areas of conversations on Twitter related to MPox; in other words, the results presented in this work are based on the analysis of 100% of the Tweets present in the dataset.Finally, this work reports multiple metrics to discuss the accuracy and working of the topic modeling approach.Specifically, the LDA model developed in this paper was run by varying the number of topics from 2 to 50 and the average coherence score was computed for each topic.Based on the analysis of these coherence scores, the optimal number of topics was determined to be four.Thereafter, the LDA model was developed by considering the number of topics to be four and the average coherence scores, word lengths, exclusivity, document entropy, and tokens for each topic were reported to further discuss the accuracy and working of the developed LDA model.Thus, to summarize, the time range of the Tweets, the number of Tweets that were analyzed in this study, the fact that no topics were eliminated prior to the identification of themes of conversation, and the reporting of the average coherence scores, word lengths, exclusivity, document entropy, and tokens for each topic in the developed LDA model further support the scientific contributions of this work.
The discourse surrounding the global outbreak of MPox on the Twitter platform has garnered worldwide attention.The findings of this research reveal that conversations on Twitter revolving around MPox are characterized by their multifaceted nature, with the public actively engaged in seeking and disseminating information across a spectrum of topics connected to this virus.It is imperative for public health authorities to comprehend the primary concerns and interests of the populace concerning MPox.Such insights can serve as a foundation for the timely development of applicable policies and measures to bolster public awareness, preparedness, and response mechanisms in the face of the continued spread of the MPox virus and the potential resurgence thereof.Researchers from diverse domains have analyzed the mechanisms governing the flow of information from social media platforms to mainstream news outlets [132].In today's contemporary media landscape, the once-distinct boundary separating social and mainstream media has blurred into obscurity.As presented in this study, one of the focal points of discussion on Twitter related to MPox pertains to updates on cases and investigations concerning the virus.This observation indicates that social media discourse mirrored news narratives (news regarding the latest developments and case reports of MPox) during this outbreak.Public health agencies could consider exploring this trend further by monitoring Twitter conversations, especially on days marked by significant news events, such as updates on vaccine developments or treatments or reports of severe adverse reactions attributed to specific vaccines or treatments for MPox.The identification of immediate reactions on Twitter may be helpful to these agencies for the development of timely responses and applicable policies.In view of the fact that the outbreak of MPox started a few months ago, the scientific community is still in the process of studying this virus.Consequently, there exists a spectrum of uncertainties encompassing the virus's transmission dynamics, current treatment modalities, vaccination strategies, and public receptiveness toward these treatments and vaccinations.These uncertainties have catalyzed an outpouring of opinions and perspectives on MPox, both directly concerning the virus and its outbreak.Public health authorities could also consider an exploration of Tweets related to these specific themes to gauge whether the ongoing public discourse serves to bridge the information gap between the public's needs and the information disseminated by various healthcare and medical sectors.Moreover, such an investigation could be instrumental in identifying instances of misinformation or the propagation of conspiracy theories in connection with the MPox outbreak.
Even though the limitations of the previous and related works in this area of research have been addressed in this paper, this work also has limitations.The Tweets analyzed in this paper were available on Twitter at the time of data analysis.However, Twitter allows users to delete their Tweets as well as to delete their accounts.Furthermore, as per Twitter's inactive account policy [133], accounts on Twitter that have been inactive for a very long time may be permanently removed, which results in the deletion of all the Tweets from that account.So, if this study is repeated in a few months or a few years from now, it is possible that the results obtained could vary to a degree if any of the analyzed Tweets were deleted due to the users of those accounts deleting those Tweets, users (who posted one or more of the analyzed Tweets) deleting their accounts, or Twitter permanently removing one or more of the accounts (from which one or more of the analyzed Tweets were posted) due to very long inactivity.

Conclusions
In the last decade and a half, the world has experienced outbreaks of a range of viruses, such as COVID-19, H1N1, flu, Ebola, Zika virus, Middle East Respiratory Syndrome (MERS), measles, and West Nile virus, just to name a few.In today's Internet of Everything era, the popularity of social media platforms has been growing exponentially.Social media platforms have served as virtual communities during the outbreaks of such viruses in the past, allowing people from different parts of the world to share and exchange information, news, perspectives, opinions, ideas, and comments related to the outbreaks.Researchers from different disciplines have analyzed this Big Data of conversations related to virus outbreaks on social media platforms such as Twitter, using concepts such as topic modeling to understand the underlying themes of conversations and information exchange that the general public participate in.The recent outbreak of the MPox virus has resulted in a tremendous increase in the utilization of social media platforms such as Twitter.Recent studies in this area of research have primarily focused on sentiment analysis and content analysis of Tweets about MPox, whereas a couple of studies in this area of research that have focused on topic modeling have multiple limitations.This paper aims to address this research gap and makes two scientific contributions to this field.First, it presents the results of performing topic modeling of 601,432 Tweets about the 2022 MPox outbreak that were posted on Twitter between 7 May 2022 and 3 March 2023.These results indicate that the conversations related to MPox during this time range may be broadly categorized into four distinct themes-Views and Perspectives about MPox, Updates on Cases and Investigations about MPox, MPox and the LGBTQIA+ Community, and MPox and COVID-19.Second, the paper presents the findings from the analysis of the Tweets that focused on these topics.The results show that the theme that was most popular on Twitter (in terms of the number of Tweets posted) during this time range was Views and Perspectives about MPox.It was followed by the theme of MPox and the LGBTQIA+ Community.This theme was followed by the themes of MPox and COVID-19 and Updates on Cases and Investigations about Mpox, respectively.As per the best knowledge of the authors, no similar work has been conducted in this field thus far.With the continuous advances in the fields of healthcare and medicine in the last few months, the general public now has access to different forms of treatment and vaccines for MPox.As more treatments and vaccines become available to the public and studies reporting the accuracy and any potential side effects of the same are published, the patterns of conversations on Twitter related to MPox are expected to see an increase in the number of Tweets related to vaccines and treatments for MPox.Therefore, future work in this area would involve collecting more Tweets and repeating this study a few months from now to identify any significant variations in terms of the themes of conversations on Twitter related to MPox.If any new themes of conversations are identified, future work will also involve an exploration of the Tweets in those themes to understand the patterns of seeking and sharing information or misinformation related to those themes.

Figure 1 .
Figure 1.System Design in RapidMiner for Performing Topic Modeling.

Figure 2 .
Figure 2. Representation of the order of execution of the different "operators" of the developed "process" in RapidMiner.

Figure 3 .
Figure 3. Analysis of the average coherence values of the LDA model for different numbers of topics.

Figure 4 .
Figure 4.A selection of 17 rows (selected randomly) from the output table of the developed LDA model.

Figure 5 .
Figure 5.A histogram-based representation of the confidence values (for Topic 0) for all the analyzed Tweets and the categorization of these Tweets into different topics (the color coding represents the categorization of these Tweets into different topics).

Figure 6 .
Figure 6.A histogram-based representation of the confidence values (for Topic 1) for all the analyzed Tweets and the categorization of these Tweets into different topics (the color coding represents the categorization of these Tweets into different topics).

Figure 7 .
Figure 7.A histogram-based representation of the confidence values (for Topic 2) for all the analyzed Tweets and the categorization of these Tweets into different topics (the color coding represents the categorization of these Tweets into different topics).

Figure 8 .
Figure 8.A histogram-based representation of the confidence values (for Topic 3) for all the analyzed Tweets and the categorization of these Tweets into different topics (the color coding represents the categorization of these Tweets into different topics).

Figure 9
Figure9shows an analysis of the number of Tweets per topic.As can be seen from Figure9, Topic 0, or the theme of Views and Perspectives about Mpox, was most popular on Twitter (in terms of the number of Tweets posted) during the time range of 7 May 2022 to 3 March 2023.It was followed by Topic 2, or the theme of MPox and the LGBTQIA+ Community.This was followed by Topic 3 (or the theme of MPox and COVID-19) and Topic 1 (or the theme of Updates on Cases and Investigations about Mpox), respectively.

Figure 9 .
Figure 9. Analysis of the number of Tweets posted per topic.

Table 1 .
Average coherence values of the LDA model shown in Figure1for different numbers of topics.

Table 2 .
The minimum, maximum, average, and standard deviation of the confidence values associated with each topic.
#1 @COVIDnewsfast Transmission of Monkey Pox is not the same as Covid!Monkey Pox is coming!Covid did not do the trick.

Table 4 .
Different characteristic features of the developed LDA model for each topic.

Table 5 .
Summary and categorization of the recent studies in this area of research.

Table 6 .
Comparison of specific characteristics of this work with two similar studies in this area of research that also focused on topic modeling of Tweets.