A New Social Media Analytics Method for Identifying Factors Contributing to COVID-19 Discussion Topics

: Since the onset of the COVID-19 crisis, scholarly investigations and policy formulation have harnessed the potent capabilities of artiﬁcial intelligence (AI)-driven social media analytics. Evidence-driven policymaking has been facilitated through the proﬁcient application of AI and natural language processing (NLP) methodologies to analyse the vast landscape of social media discussions. However, recent research works have failed to demonstrate a methodology to discern the underlying factors inﬂuencing COVID-19-related discussion topics. In this scholarly endeavour, an innovative AI-and NLP-based framework is deployed, incorporating translation, sentiment analysis, topic analysis, logistic regression, and clustering techniques to meticulously identify and elucidate the factors that are relevant to any discussion topics within the social media corpus. This pioneering methodology is rigorously tested and evaluated using a dataset comprising 152,070 COVID-19-related tweets, collected between 15th July 2021 and 20th April 2023, encompassing discourse in 58 distinct languages. The AI-driven regression analysis revealed 37 distinct observations, with 20 of them demonstrating a higher level of signiﬁcance. In parallel, clustering analysis identiﬁed 15 observations, including nine of substantial relevance. These 52 AI-facilitated observations collectively unveil and delineate the factors that are intricately linked to ﬁve core discussion topics that are prevalent in the realm of COVID-19 discourse on Twitter. To the best of our knowledge, this research constitutes the inaugural effort in autonomously identifying factors associated with COVID-19 discussion topics, marking a pioneering application of AI algorithms in this domain. The implementation of this method holds the potential to signiﬁcantly enhance the practice of evidence-based policymaking pertaining to matters concerning COVID-19.


Introduction
Social media analytics has been used in various ways to understand the impact of COVID-19.According to a study conducted by the World Health Organization (WHO), social media and other digital platforms created opportunities to keep people safe, informed, and connected during the pandemic [1].
In this paper, an innovative methodology is proposed that uses AI-based services (Microsoft Cognitive Services [34]-based language detection, translation, and sentiment analysis) and algorithms (topic analysis, regression, and clustering) to autonomously identify the factors influencing COVID-19-related discussion topics, as shown in Figure 1.Moreover, the presented methodology was evaluated with 152,070 multilingual tweets, collected between 15th July 2021 and 20th April 2023.In summary, the following are the core contributions of this paper:

•
An inventive framework, rooted in AI and NLP, is systematically employed.This framework integrates a spectrum of methodologies, including translation, sentiment analysis, topic analysis, regression, and clustering techniques, with the purpose of methodically discerning and expounding upon the factors that are pertinent to the diverse discourse topics encompassing COVID-19.

•
This innovative approach underwent a rigorous examination and assessment, utilizing a dataset encompassing 152,070 tweets that were gathered within the temporal span from 15 July 2021 to 20 April 2023.Notably, this dataset encapsulates discourse in a wide array of 58 distinct languages.• AI-and NLP-based regression identified and described 37 observations, of which 20 were found to be significant.Moreover, clustering techniques identified 15 observations, containing nine of significance.

•
These 52 observations, generated through AI-driven methods, elucidated the relationships existing between topic confidences, encompassing Topic 1 confidence, Topic 2 confidence, Topic 3 confidence, Topic 4 confidence, and Topic 5 confidence, and an extensive array of factors.These factors included variables such as tweet time, followers, friends, retweets, language name, sentiment, positive sentiment confidence, neutral sentiment confidence, negative sentiment confidence, and predicted Topic.

•
This methodology could be applied to identify factors related to any discussion topics within any micro-blogging social media platforms.
Information 2023, 14, x FOR PEER REVIEW 3 of 21

Background Context and Literature
In the realm of contemporary data analysis, the integration of multilingual, global sentiment analysis and topic analysis holds paramount significance when scrutinizing COVID-19-related tweets.This methodological approach encompasses a comprehensive investigation into the multifaceted linguistic expressions of a diverse global population during the pandemic.Multilingual sentiment analysis not only elucidates the emotional undercurrents within the discourse but also allows for the nuanced interpretation of sentiments across linguistic boundaries.Simultaneously, the employment of topic analysis facilitates the identification and categorization of emergent themes and topics within the vast corpus of COVID-19 tweets, ensuring a systematic exploration of the evolving narrative.Within the rest of this paper, a background and literature review are provided (in Section 3), followed by the details of the proposed methodology (also in Section 3).Section 4 describes how the proposed methodology was evaluated with COVID-19-related tweets.Finally, Section 5 provides concluding remarks, limitations of this study, and future endeavours.

Background Context and Literature
In the realm of contemporary data analysis, the integration of multilingual, global sentiment analysis and topic analysis holds paramount significance when scrutinizing COVID-19-related tweets.This methodological approach encompasses a comprehensive investigation into the multifaceted linguistic expressions of a diverse global population during the pandemic.Multilingual sentiment analysis not only elucidates the emotional undercurrents within the discourse but also allows for the nuanced interpretation of sentiments across linguistic boundaries.Simultaneously, the employment of topic analysis facilitates the identification and categorization of emergent themes and topics within the vast corpus of COVID-19 tweets, ensuring a systematic exploration of the evolving narrative.

Multilingual Analysis
The COVID-19 pandemic transcended linguistic barriers, impacting diverse populations worldwide.Multilingual analysis allows us to decipher sentiments and opinions expressed in various languages, providing a comprehensive view of global perceptions and concerns [13,14,21].This inclusivity fosters a more accurate understanding of the pandemic's impact on different communities.

Topic Analysis
COVID-19 tweet analysis through topic analysis identifies emerging themes and discussions within the vast tweet corpus [10,11,16,22].This aids in tracking the evolution of public discourse, from early outbreak concerns to vaccine distribution and beyond.Understanding topics informs public health strategies and crisis communication [4].Table 1 summarizes the existing research works on COVID-19 Twitter analytics that applied sentiment analysis and topic analysis on multilingual and global tweets.
In summary, a comprehensive approach that integrates multilingual capabilities, global context, sentiment analysis, and topic analysis in COVID-19 tweet analysis is indispensable for capturing the nuanced dynamics of the pandemic's impact, sentiments, and evolving discourse on a global scale.This research-driven approach empowers decision-makers to make informed, data-driven choices in managing and mitigating the pandemic's effects.As seen in Table 1, none of the existing research work investigated the factors influencing COVID-19 discussion topics.This study reports the first academic work on identifying the factors behind COVID-19 discussion topics on Twitter by concurrently using sentiment analysis and topic analysis on multilingual and global tweets.

Materials and Methods
The proposed framework revolves around AI-driven processes of tweet acquisition, language detection, translation, sentiment analysis, topic analysis, and correlation analysis.The correlation analysis uses both regression and clustering techniques, and is demonstrated in Figure 2.Each of these steps are described within this section in detail.

Tweet Acquisition
At the inception of this analytical endeavour, we embark upon the acquisition of a corpus of multilingual tweets that are germane to the COVID-19 discourse.This foundational process entails the meticulous extraction of tweets that incorporate the keywords "COVID" or "CORONA".Notably, this endeavour is not confined to the mere capture of textual content but extends to the comprehensive cataloguing of contextual parameters that encapsulate the temporal, audience-related, and propagation-related dimensions of each tweet.These dimensions include the tweet text, tweet time, followers, friends, and retweets, among other pertinent attributes.This step orchestrates the crystallization of a heterogeneous dataset, the quintessence of the analytical journey that ensues.

Language Detection
Subsequently, a critical layer of linguistic scrutiny is introduced through the mechanism of language detection.The profusion of languages within the Twitterverse necessitates an astute differentiation, rendering this phase indispensable.Herein, we leverage cuttingedge APIs, notably those furnished by Microsoft Cognitive Services, to determine the linguistic origin of each tweet.This critical linguistic assignment is chronicled as the "Language Name".The veritable goal of this phase is the creation of a harmonious alignment of tweets with their respective linguistic affiliations, a foundational step for subsequent linguistic and sentiment analyses.

Translation (for Non-English Tweets)
In recognition of the global diversity that is inherent in Twitter discourse, where linguistic heterogeneity is the norm, an equilibrating mechanism is invoked for tweets that diverge from the English linguistic ambit.This mechanism, embodied in the translation process, endeavours to homogenize all tweets into the English language.Accordingly, those tweets that are identified as non-English in the preceding step undergo a transformational metamorphosis into English.This translation operation, facilitated by APIs such as those provided by Microsoft Cognitive Services, presents a unifying linguistic canvas, thereby fostering linguistic consistency for subsequent analytical endeavours.

Sentiment Analysis
The nuance of sentiment within the tweets, an elemental facet of the analysis, is meticulously unveiled through the prism of sentiment analysis.Each tweet within the standardized English dataset becomes a subject of scrutiny, wherein its emotional tenor in relation to the COVID-19 topic is artfully gauged.This nuanced analysis typically culminates in categorizations of tweets into one of three classes: positive, negative, or neutral.Notably, this classification is accompanied by quantified confidence scores, encapsulating the robustness of the categorization.The orchestration of this phase involves the utilization of sentiment analysis APIs, which, in the context herein, emanate from the domain of Microsoft Cognitive Services.Hence, the analytical outcome bestows upon each tweet a set of salient parameters: "Sentiment", "Positive Sentiment Confidence", "Neutral Sentiment Confidence", and "Negative Sentiment Confidence".

Topic Analysis (LDA-Based)
A pivotal stage in our analytical odyssey materializes with the advent of Latent Dirichlet Allocation (LDA)-based topic analysis.This modelling paradigm, founded upon probabilistic principles, aspires to uncover latent topics that are interwoven within the corpus of tweets.Each tweet assumes the role of a document, serving as a carrier of topic-related information.By engaging in the allocation of tweets to one or more topics, LDA bestows upon them topic affiliations, accompanied by associated confidence scores.This compositional orchestration of themes in the COVID-19 discourse begets a diverse set of parameters, most notably the "Predicted Topic" and the "Topic Confidence" scores for each tweet.This discourse-level dissection engenders insights into the salient themes permeating the Twitterverse in the context of COVID-19.

Correlation Analysis
At this juncture, the focus pivots toward the elucidation of associations, elucidating the intricate interplay between various parameters and COVID-19 discussion topics.Central to this endeavour is the endeavour to unearth correlations between the confidence levels assigned to each of the identified topics (e.g., Topic 1 confidence, Topic 2 confidence, and so forth) and a multifarious array of attributes.The palette of attributes encompasses diverse dimensions including temporal characteristics (e.g., tweet time), social dynamics (e.g., followers, friends, retweets), linguistic attributes (e.g., language name), sentiment attributes (e.g., sentiment, positive sentiment confidence, neutral sentiment confidence, negative sentiment confidence), and the very topics birthed from LDA-based topic analysis.This multifaceted inquiry invokes the services of AI-driven regression and clustering methods, eloquently weaving a tapestry of nuanced relationships, and revealing the underpinnings of the COVID-19 discourse.
Regression analysis automatically prioritizes and assesses the importance of factors for both categorical and numeric metrics.For numerical features, Microsoft's ML.NET SDCA regression [45] was employed, using linear regression, a fundamental supervised learning technique for solving regression problems.Linear regression predicts a continuous dependent variable based on independent variables, aiming to determine the best-fit line that accurately forecasts the continuous output, thereby establishing a linear relationship, represented by Equation (1).
For categorical features, logistic regression was executed using L-BFGS logistic regression from ML.NET [46,47].Logistic regression, a widely used supervised learning algorithm, serves purposes in both classification and regression problems.It predicts categorical dependent variables based on independent variables, employing Equation (2).Logistic regression outputs values between zero and one, making it suitable for tasks where probability estimates between two classes are needed, such as binary decisions like rainy or not rainy, 0 or 1, true or false, and so on.
Initially, logistic regression operates as a regression model.However, when a threshold is introduced, it transforms into an effective classifier.The process begins with the utilization of the logistic or sigmoid function (the process described with Equations ( 3)-( 9)).
The sigmoid function of Equation ( 3) maps real numbers to interval (0, 1).Then, a hypothesis function is defined with Equation (4).
The classification decision is made on y = 1, when h θ (x) ≥ 0.5 and y = 0 otherwise.The decision boundary is θ T x = 0.The cost function is shown with Equation (5).
where H(p,q) is the cross-entropy of distribution q relative to distribution p and is shown with Equation ( 6).
In this case, y (i) ∈ {0,1} so p 1 = 1 and p 2 = 0. Therefore, Similar to the selection of the quadratic cost function in linear regression, the selection of this cost function is mainly driven by the fact that it is efficient, as shown in Equation (8).
Hence, the gradient descent for logistic regression could be reflected with Equation (9).
Both linear and logistic regressions were automatically applied utilizing NLP [48].

Explanatory Analysis (NLP-Based)
The analytical sojourn reaches its culmination with a synthesis that bridges the chasm between numerical correlations and human understanding.Enter the realm of natural language processing (NLP)-based explainable AI, an ingenious avenue wherein the multifarious correlations unearthed in the prior step are rendered intelligible through humanreadable narratives.By employing sophisticated NLP algorithms, this phase aspires to provide lucid elucidations that elucidate not only the "what" but also the "why" behind the identified correlations.The resulting explanations serve as the lighthouse that guides scholars and practitioners through the labyrinth of interconnected parameters, thereby fostering an enriched comprehension of the COVID-19 discussion dynamics on Twitter.
In summary, the processes of tweet acquisition, language detection, translation, sentiment analysis, and topic analysis created various attributes or factors, as shown in Table 2.These attributes are used in the correlation process (i.e., clustering, logistic regression, and explainable AI) for identifying the factors that influence COVID-19-related discussion topics (as shown in Table 2).Figure 3 demonstrates how these attributes are created as well as how these attributes are used.Algorithm 1 demonstrates our implementation of this methodology.Various notations used within Algorithm 1 are portrayed in Table 3.

Notation Description
T Extracted tweets as the output of ExtractTweetsContainingKeywords("COVID", "CORONA") m Date and time of tweet as the output of ExtractTweetsContainingKeywords("COVID", "CORONA") f Follower count as the output of ExtractTweetsContainingKeywords("COVID", "CORONA") d Friend count as the output of ExtractTweetsContainingKeywords("COVID", "CORONA") r Retweet count as the output of ExtractTweetsContainingKeywords("COVID", "CORONA") l Tweet language as detected using DetectLanguage(Tweet) s Detected sentiment as the output of SentimentAnalysis    Description racted tweets as the output of ExtractTweetsContainingKeywords("COVID", "CORONA") e and time of tweet as the output of ExtractTweetsContainingKeywords("COVID", "CORONA") lower count as the output of ExtractTweetsContainingKeywords("COVID", "CORONA") nd count as the output of ExtractTweetsContainingKeywords("COVID", "CORONA") weet count as the output of ExtractTweetsContainingKeywords("COVID", "CORONA") eet language as detected using DetectLanguage(Tweet) ected sentiment as the output of SentimentAnalysis(tweet) itive sentiment confidence as the output of SentimentAnalysis(tweet) ative sentiment confidence as the output of SentimentAnalysis(tweet) tral sentiment confidence as the output of SentimentAnalysis(tweet) ic ID as the output of PerformLDATopicAnalysis(T_EN) ic 1 confidence as the output of PerformLDATopicAnalysis(T_EN) ic 2 confidence as the output of PerformLDATopicAnalysis(T_EN) ic 3 confidence as the output of PerformLDATopicAnalysis(T_EN) ic 4 confidence as the output of PerformLDATopicAnalysis(T_EN) ic 5 confidence as the output of PerformLDATopicAnalysis(T_EN) for tweet in T: for tweet in T: 6.
if l is not "English": In summation, this academic endeavour embodies a holistic and rigorous analytical framework for the in-depth examination of COVID-19 discourse within the Twitter ecosystem.This process, characterized by its methodical granularity, encompasses diverse facets of data acquisition, linguistic analysis, sentiment assessment, thematic exploration, correlation identification, and linguistic elucidation, thereby affording a comprehensive view of the intricate discourse surrounding the pandemic within the digital public sphere.Its integration of advanced AI and NLP techniques amplifies the depth and interpretability of the insights garnered, rendering it a valuable resource for scholars in the realms of data science, linguistics, and social sciences.

Results and Discussion
The methodology was tested and critically evaluated with 152,070 tweets from 15 July 2021 to 20 April 2023.During these 645 days, tweets in 58 distinct languages were analysed with AI-based language detection, translation, sentiment analysis, and LDA-based topic analysis.LDA-based topic analysis identified five topics on COVID-19-related discussion.Finally, AIand NLP-based clustering and regression algorithms were used to identify and describe the correlations between the topic confidences against each of the related variables.
Table 4 provides the details of the five topics.These topics were (1) broad discussion on corona, (2) COVID statistics and vaccination, (3) wordplay on corona, (4) COVID experiences or updates, and finally, (5) likely context of COVID in India.As seen in Table 4, each of these discerned topics demonstrated distinct patterns of word occurrences and weights.For example, within Topic 3, the word "crown" and its variations appear prominently, along with "Corona"."Corona" in Latin means "crown", and the name of the virus is derived from this due to its appearance under the microscope.Moreover, the COVID virus appears as football (soccer) and hence the word "Corona_Futbol" appears with a weight of 582.As seen in Table 5, about 60,855 tweets were in English, 30,212 tweets were in German, followed by 22,226 tweets in Spanish, 7419 in Dutch, and 5748 in French.Most interestingly, as shown in Table 5, the language distribution against each of the topics has distinctive patterns, suggesting possible correlations between topics and languages.We can see in Table 5 that Topic 2, Topic 4, and Topic 5 contain mostly English tweets.However, Topic 1 and Topic 3 demonstrate a dominance of German and Spanish tweets, respectively.Figure 4 shows the word cloud for each of the five topics.Figure 4a is mostly in the German language.Figure 4c is mostly in Spanish. Figure 4b, Figure 4d, Figure 4e, and Figure 4f are predominantly in English.It should be mentioned that default stop-words like "am", "is", and "at" have been removed from Figure 4.Moreover, common terminologies like "COVID", "https", "rt", and "corona" have also been discarded from the word clouds shown in Figure 4. Finally, Table 6 depicts the distribution of sentiment confidences (i.e., the results of the sentiment analysis process), follower count, friend count, retweet count, and the number of distinct tweet languages against each of the topics.As seen in Table 6, each of these five topics appear to be in distinct patterns, and AI-based clustering and regression in subsequent processes would confirm all possible correlations against each of these topics.As seen in Table 5, about 60,855 tweets were in English, 30,212 tweets were in German, followed by 22,226 tweets in Spanish, 7419 in Dutch, and 5748 in French.Most interestingly, as shown in Table 5, the language distribution against each of the topics has distinctive pa erns, suggesting possible correlations between topics and languages.We can see in Table 5 that Topic 2, Topic 4, and Topic 5 contain mostly English tweets.However, Topic 1 and Topic 3 demonstrate a dominance of German and Spanish tweets, respectively.Figure 4 shows the word cloud for each of the five topics.Figure 4a is mostly in the German language.Figure 4c is mostly in Spanish. Figure 4b, Figure 4d, Figure 4e, and Figure 4f are predominantly in English.It should be mentioned that default stop-words like "am", "is", and "at" have been removed from Figure 4.Moreover, common terminologies like "COVID", "h ps", "rt", and "corona" have also been discarded from the word clouds shown in Figure 4. Finally, Table 6 depicts the distribution of sentiment confidences (i.e., the results of the sentiment analysis process), follower count, friend count, retweet count, and the number of distinct tweet languages against each of the topics.As seen in Table 6, each of these five topics appear to be in distinct pa erns, and AIbased clustering and regression in subsequent processes would confirm all possible correlations against each of these topics.

Analysing the Correlated Factors for Topic 1
For Topic 1, six correlations were discovered using the AI-based regression method.Out of these six correlations, three of them are significant (as the correlation factor is greater than or equal to 0.1).This is observed from the result of the AI-based regression analysis as depicted in Figure 5a.The three significant factors that influence Topic 1 con-

Analysing the Correlated Factors for Topic 1
For Topic 1, six correlations were discovered using the AI-based regression method.Out of these six correlations, three of them are significant (as the correlation factor is greater than or equal to 0.1).This is observed from the result of the AI-based regression analysis as depicted in Figure 5a.The three significant factors that influence Topic 1 confidence (c 1 ) were identified to be language (l) and retweet count (r).The AI-based regression analysis uses NLP to describe these relationships.The following are three NLP-based descriptions of significant correlations:

•
When the tweet language is 'de', the average Topic 1 confidence increases by 0.51;

•
When the tweet language is 'nl', the average Topic 1 confidence increases by 0.38;

•
When the average retweet count is 308 or less, the average Topic 1 confidence increases by 0.13.
Information 2023, 14, x FOR PEER REVIEW 12 of 21 fidence (c1) were identified to be language (l) and retweet count (r).The AI-based regression analysis uses NLP to describe these relationships.The following are three NLP-based descriptions of significant correlations:  When the tweet language is 'de', the average Topic 1 confidence increases by 0.51;  When the tweet language is 'nl', the average Topic 1 confidence increases by 0.38;  When the average retweet count is 308 or less, the average Topic 1 confidence increases by 0.13.
← {r ≤ 308} ( 12) ← {u ≤ 0.01} ( 14) The automated AI-based clustering technique also discovered four clusters, as shown in Figure 5b.All clusters were found to be significant, as the Topic 1 confidence (c 1 ) was more than or equal to 0.4.
Equations ( 16)-( 19) depict the characteristics of these four significant clusters.For Topic 2, six correlations were discovered using the AI-based regression method.Out of these six correlations, four of them are significant (as the correlation factor is greater than or equal to 0.1).This is observed from the result of the AI-based regression analysis, as depicted in Figure 6a.The four significant factors that influence the Topic 2 confidence (c 2 ) were identified to be the language (l), retweet count (r), and positive sentiment confidence (p).The AI-based regression analysis uses NLP to describe these relationships.The following are four NLP-based descriptions of significant correlations for Topic 2 confidence (c 2 ):

•
When the tweet language is 'en', the average Topic 2 confidence increases by 0.21;

•
When the tweet language is 'fr', the average Topic 2 confidence increases by 0.17;

•
When the average retweet count is more than 302, the average Topic 2 confidence increases by 0.14;

Analysing the Correlated Factors for Topic 3
For Topic 3, eight correlations were discovered using the AI-based regression method.Out of these eight correlations, three of them are significant (as the correlation factor is greater than or equal to 0.1).This is observed from the result of the AI-based regression analysis, as depicted in Figure 7a.The four significant factors that influence the Topic 3 confidence (c3) were identified to be the language (l), negative sentiment confidence (n), and positive sentiment confidence (p).The AI-based regression analysis uses NLP to describe these relationships.The following are three NLP-based descriptions of significant correlations for Topic 3 confidence (c3): These four significant correlations to the Topic 2 confidence (c 2 ) are also portrayed in Equations ( 20)-( 23).The insignificant correlations (i.e., a correlation factor less than 0.1) are portrayed in Equations ( 24)-( 25).
The automated AI-based clustering technique also discovered four clusters, as shown in Figure 6b.Three out of the four clusters were found to be significant, as the Topic 2 confidence (c 2 ) was more than or equal to 0.4.

Analysing the Correlated Factors for Topic 3
For Topic 3, eight correlations were discovered using the AI-based regression method.Out of these eight correlations, three of them are significant (as the correlation factor is greater than or equal to 0.1).This is observed from the result of the AI-based regression analysis, as depicted in Figure 7a.The four significant factors that influence the Topic 3 confidence (c 3 ) were identified to be the language (l), negative sentiment confidence (n), and positive sentiment confidence (p).The AI-based regression analysis uses NLP to describe these relationships.The following are three NLP-based descriptions of significant correlations for Topic 3 confidence (c 3 ):

•
When the tweet language is 'es,' the average Topic 3 confidence increases by 0.33;

•
When the average confidence-negative sentiment is 0.01 or less, the average Topic 3 confidence increases by 0.17;

•
When the average confidence-positive sentiment is more than 0.69, the average Topic 3 confidence increases by 0.12.
confidence increases by 0.17;  When the average confidence-positive sentiment is more than 0.69, the average Topic 3 confidence increases by 0.12.
⎯ { ≤ 0.01} ⎯ {̅ > 0.69} Equation ( 38) depicts the characteristics of the significant cluster.Equation ( 39) represents the insignificant cluster (i.e., Topic 3 confidence, c3 ≤ 0.4).For Topic 4, eight correlations were discovered using the AI-based regression method.Out of these eight correlations, four of them are significant (as the correlation factor is greater than or equal to 0.1).This is observed from the result of the AI-based regression analysis, as depicted in Figure 8 (a).The two significant factors that influence the Topic 4 confidence (c4) were identified to be the retweet count (r) and language (l).The AI-based regression analysis uses NLP to describe these relationships.The following are four NLP-based descriptions of the significant correlations for Topic 4 confidence (c4):


When the average retweet count is more than 16,740, the average Topic 4 confidence increases by 0.31; These three significant correlations to Topic 3 confidence (c 3 ) are also portrayed in Equations ( 30)- (32).The insignificant correlations (i.e., a correlation factor less than 0.1) are portrayed in Equations ( 33)- (37).
The automated AI-based clustering technique also discovered two clusters, as shown in Figure 7b.One out of the two clusters were found to be significant, as the Topic 3 confidence (c 3 ) was more than or equal to 0.4.
 When the tweet language is 'pt', the average Topic 4 confidence increases by 0.13;  When the tweet language is 'en', the average Topic 4 confidence increases by 0.13;  When the average retweet count is 1284-16740, the average Topic 4 confidence increases by 0.12.
These four significant correlations to Topic 4 confidence (c4) are also portrayed in Equations ( 40)- (43).The insignificant correlations (i.e., a correlation factor less than 0.1) are portrayed in Equations ( 44)-(47).Equation ( 48) depicts the characteristics of the significant cluster.Equations ( 49)-( 51) represent the insignificant clusters (i.e., Topic 4 confidence, c4 ≤ 0.4).For Topic 5, nine correlations were discovered using the AI-based regression method.Out of these nine correlations, six of them are significant (as the correlation factor is greater than or equal to 0.1).This is observed from the result of the AI-based regression analysis, as depicted in Figure 9a.The three significant factors that influence the Topic 5 confidence (c5) were identified to be the language (l), neutral sentiment confidence (u), follower count (f), and friend count (d).The AI-based regression analysis uses NLP to describe these relationships.The following are six NLP-based descriptions of the significant correlations: These four significant correlations to Topic 4 confidence (c 4 ) are also portrayed in Equations ( 40)- (43).The insignificant correlations (i.e., a correlation factor less than 0.1) are portrayed in Equations ( 44)-(47).← {n > 0.03} (46) These six significant correlations to Topic 5 confidence (c 5 ) are also portrayed in Equations ( 52)-(57).The insignificant correlations (i.e., a correlation factor less than 0.1) are portrayed in Equations ( 58)-(60).
The automated AI-based clustering technique also discovered one cluster, as shown in Figure 9b.This cluster was found to be significant, as the Topic 5 confidence (c 5 ) was more than or equal to 0.4.
Finally, Table 7 summarizes the results of the cluster analysis for each of the topics (i.e., Topic 1 confidence, Topic 2 confidence, Topic 3 confidence, Topic 4 confidence, and Topic 5 confidence).Moreover, this table shows how many records (i.e., population count) were used to obtain the details of these clusters.As seen in Table 7, the significant clusters (i.e., a cluster confidence greater than or equal to 0.4) are highlighted in red.In essence, the methodology described within this paper autonomously generated 37 (six for Topic 1, six for Topic 2, eight for Topic 3, eight for Topic 4, and another nine for Topic 9) with AI-driven regression.On the other hand, AI-driven clustering automatically generated 15 observations (four for Topic 1, four for Topic 2, two for Topic 3, four for Topic 4, and another one for Topic 5).These 52 (as represented with Equations ( 10)-( 61)) AI-driven observations identified the factors that were deemed to be correlated with discussion topics found in COVID-19-related Twitter discourse.In Figure 10, the AI-driven observations (broken down into the total observation and significant observation) are portrayed with radar charts.
Table 7. Fifteen observations found with AI-driven clustering (9 significant observations highlighted in red).With the deployed solution in mobile environments, a strategic decision-maker could be in a remote location, making evidence-based strategic decisions, whilst being completely mobile.Since the proposed solution is designed to allow decision-makers to make evidencebased decisions on COVID-19-related issues based on Twitter analytics, this was deployed in mobile environments, both in iOS and Android.Figure 11 shows the deployed system in mobile environments, showing the correlation between retweets and the Topic 1 confidence (previously shown with Equation ( 12)).

Conclusions
Since the emergence of the COVID-19 crisis, scholars and policymakers have adeptly harnessed Twi er as a principal reservoir for the meticulous scrutiny of public sentiments [9][10][11][12][13][14][15][16][17][18][19][20][21][22][23][24][25].The perspicacious analysis of public sentiment engenders empirically grounded policymaking across a spectrum of COVID-19-related strategic imperatives, including, but not limited to, the imposition of lockdown measures, travel restrictions, vaccination campaigns, and the amelioration of misinformation dissemination.Consequently, the utilization of Twi er-based critical analysis has yielded substantive triumphs in the realm of COVID-19-driven decision making across multifarious dimensions.With the deployed solution in mobile environments, a strategic decision-maker could be in a remote location, making evidence-based strategic decisions, whilst being completely mobile.
However, none of these existing research works investigated the factors that drive COVID-19-based Twitter discourse.The present paper elucidates a systematic and methodological framework, employing artificial intelligence (AI) to autonomously unearth 52 distinct observations.This process, characterized by the utilization of both regression and clustering techniques, systematically unravels the intricate interplay between diverse factors and the topics encapsulating COVID-19 discussions on Twitter.Within this compendium of observations, 37 were ascertained through the AI-driven regression technique, while the AI-based clustering technique yielded an additional 15 observations.Furthermore, 29 of these observations bear considerable significance, denoting their pivotal role in shaping specific discourse themes.
Furthermore, this research not only introduces an innovative methodological paradigm but also subjects this framework to rigorous evaluation, encompassing an extensive dataset spanning 645 days, commencing on 15 July 2021, and culminating on 20 April 2023.This dataset encompasses a multitude of multilingual tweets, spanning 58 distinct languages, thereby furnishing strategic decision-makers with a comprehensive toolkit for comprehending the manifold factors that govern the discourse surrounding COVID-19.
There are technical, qualitative, and ethical limitations of Twitter-based social media analytics, as apparent from [2,4,14].Twitter has garnered acknowledgment as a fertile environment for the proliferation of disinformation and the dissemination of deceptive content, as noted in the scholarly discourse [49,50].Within the confines of this specific investigation, a foundational proposition was laid out, positing the veracity of the entire corpus of 152,070 tweets with a cyber-related focus, subjected to scrutiny.Moreover, there are ethical issues pertaining to social media-based intelligence without the explicit permission of the social media users [51,52].Research works in [51,52] portray ethical concerns in obtaining AI-driven intelligence from closed-network social media platforms like Facebook and LinkedIn.Users of Facebook, LinkedIn, and other closed-network platforms share their content only towards their closed group and do not consent to the intelligence acquisition of their data.In contrast, users of open platforms (like Twitter) are already aware that their contents are publicly available and could be subjected to intelligence acquisition.Consequently, an inherent constraint manifests itself in the shape of an absence of stringent validation protocols systematically applied to the open-source data sourced from the Twitter platform.As shown in Figure 12, the limitations of this work would shape the scope of future research in Twitter-based COVID-19 discourse.
share their content only towards their closed group and do not consent to the intelligence acquisition of their data.In contrast, users of open platforms (like Twi er) are already aware that their contents are publicly available and could be subjected to intelligence acquisition.Consequently, an inherent constraint manifests itself in the shape of an absence of stringent validation protocols systematically applied to the open-source data sourced from the Twi er platform.As shown in Figure 12, the limitations of this work would shape the scope of future research in Twi er-based COVID-19 discourse.

Figure 1 .
Figure 1.Conceptual diagram of the proposed system (factors 5 to 10 are NLP-based).

Figure 1 .
Figure 1.Conceptual diagram of the proposed system (factors 5 to 10 are NLP-based).

Figure 2 .
Figure 2. The process of analysing the correlated factors of COVID-19-related Twi er Topics.Figure 2. The process of analysing the correlated factors of COVID-19-related Twitter Topics.

Figure 2 .
Figure 2. The process of analysing the correlated factors of COVID-19-related Twi er Topics.Figure 2. The process of analysing the correlated factors of COVID-19-related Twitter Topics.
(tweet) p Positive sentiment confidence as the output of SentimentAnalysis(tweet) n Negative sentiment confidence as the output of SentimentAnalysis(tweet) u Neutral sentiment confidence as the output of SentimentAnalysis(tweet) Topic Topic ID as the output of PerformLDATopicAnalysis(T_EN) c 1 Topic 1 confidence as the output of PerformLDATopicAnalysis(T_EN) c 2 Topic 2 confidence as the output of PerformLDATopicAnalysis(T_EN) c 3 Topic 3 confidence as the output of PerformLDATopicAnalysis(T_EN) c 4 Topic 4 confidence as the output of PerformLDATopicAnalysis(T_EN) c 5 Topic 5 confidence as the output of PerformLDATopicAnalysis(T_EN)

Figure 3 .
Figure 3. Detailed process map of the proposed system.

Figure 3 .Algorithm 1 :
Figure 3. Detailed process map of the proposed system.

Figure 5 .
Figure 5. Identifying the correlated factors for Topic 1-broad discussion on corona.(a) Identifying 6 correlations with regression.(b) Identifying 4 correlations with clustering.

Figure 5 .
Figure 5. Identifying the correlated factors for Topic 1-broad discussion on corona.(a) Identifying 6 correlations with regression.(b) Identifying 4 correlations with clustering.

Figure 10 .
Figure 10.Total observations vs. significant observations for regression and cluster analysis.(a) Results of regression.(b) Results of clustering.

Figure 10 .
Figure 10.Total observations vs. significant observations for regression and cluster analysis.(a) Results of regression.(b) Results of clustering.
Information 2023, 14, x FOR PEER REVIEW 18 of 21

Figure 11 .
Figure 11.The proposed solution deployed on a Samsung Galaxy S23 Ultra Mobile running Android version 13.

Figure 11 .
Figure 11.The proposed solution deployed on a Samsung Galaxy S23 Ultra Mobile running Android version 13.

Figure 12 .
Figure 12.Limitations of the Twi er-based analysis of COVID-19 discourse.Funding:This research received no external funding.

Table 2 .
Lifecycle of attributes/factors (processes that create or use the attributes).

Table 3 .
Description of notations.

Table 3 .
Description of notations.

Table 4 .
Word weights across each of the five topics.

Table 5 .
Most used Tweet languages for each of the topics.

Table 6 .
Details of NLP analysis for each of the predicted topics.

Table 5 .
Most used Tweet languages for each of the topics.

Table 6 .
Details of NLP analysis for each of the predicted topics.

Table 7 .
Fifteen observations found with AI-driven clustering (9 significant observations highlighted in red).