Information Diffusion Model in Twitter: A Systematic Literature Review

: Information diffusion, information spread, and inﬂuencers are important concepts in many studies on social media, especially Twitter analytics. However, literature overviews on the information diffusion of Twitter analytics are sparse, especially on the use of continuous time Markov chain (CTMC). This paper examines the following topics: (1) the purposes of studies about information diffusion on Twitter, (2) the methods adopted to model information diffusion on Twitter, (3) the metrics applied, and (4) measures used to determine inﬂuencer rankings. We employed a systematic literature review (SLR) to explore the studies related to information diffusion on Twitter extracted from four digital libraries. In this paper, a two-stage analysis was conducted. First, we implemented a bibliometric analysis using VOSviewer and R-bibliometrix software. This approach was applied to select 204 papers after conducting a duplication check and assessing the inclusion– exclusion criteria. At this stage, we mapped the authors’ collaborative networks/collaborators and the evolution of research themes. Second, we analyzed the gap in research themes on the application of CTMC information diffusion on Twitter. Further ﬁltering criteria were applied, and 34 papers were analyzed to identify the research objectives, methods, metrics, and measures used by each researcher. Nonhomogeneous CTMC has never been used in Twitter information diffusion modeling. This ﬁnding motivates us to further study nonhomogeneous CTMC as a modeling approach for Twitter information diffusion.


Introduction
Social media such as blogs, forums, chat applications, and social networking are platforms for online interactions regardless of a user's physical location [1] and have become integrated into daily life. Twitter is a growing microblog that allows users to send text messages composed of up to 280 characters [2]. User activities on social media have been subjected to data collection and monitoring and are considered meaningful data sources for various public and private organizations, including in the industry and academia. The largest use of social media data comes from microblogs, as much as 46% [1]. Twitter data can be used in a remarkably diverse number of research studies, such as sentiment analyses [3,4], text analyses [5][6][7][8], opinion analyses [9,10], as well as analyses of influence or information diffusion [11][12][13][14].
Studies of information diffusion on Twitter are important as it is a topic that continues to attract researchers' attention and is a subject useful for scrutinization and various advanced analyses. Information diffusion is defined as the process of information travelling from a sender to a set of receivers through a carrier. In the case of Twitter, the sender is the user who posted the tweet, the carrier is the tweet that was posted, and the recipients are followers of the user who posted the tweet [15].
A user is influential regarding delivering tweets if their messages can spread to many other Twitter users. In this case, these people with a strong influence in the Twitter network are called influencers [16].
Although some studies on information diffusion in social networks exist, such as [28][29][30][31][32][33][34][35], the following have not been studied in previous literature reviews: article selection reviewed according to the SLR procedure, a bibliometric analysis and theme evolution approach, the determination of research objectives related to the information diffusion model on Twitter, and the metrics and measures used by researchers.
This study applied a systematic literature review (SLR) as an approach to obtain an overview of existing studies and trends related to information diffusion and social media, especially on Twitter. Our work proposes a bibliometric analysis that allows us to know popular places and networks from which authors conducting research on information diffusion and social influencers gather their data. We also conduct an evolution analysis, allowing us to detect the changes in topics over time.
In other words, this paper gives a survey report and systematically presents existing studies on information diffusion and Twitter analysis from the perspective of the methods, metrics, and measures used. We added a bibliometric analysis to assess the evolution of themes year after year, which has not been covered in previous review articles.
To be precise, the SLR that is used in this paper is a method of identifying, evaluating, and interpreting all existing research relevant to a phenomenon of interest [36]. Referring to [36], the determination of a research question (RQ) is based on the research objectives. We studied the topic of information diffusion on Twitter in our research, so we read several articles related to information diffusion, such as [12,14,21]. Varshney et al. [12] aimed to predict the probability of information diffusion and used a Bayesian network method based on tweet/retweet metrics. Kumar et al. [14] used the susceptible-exposed-infected (SEI) model to model the diffusion of information on Twitter. Meanwhile, Oo and Lwin [21] used PageRank to measure the influence of users on Twitter. Through a discussion of the study results from all of these authors, we determined the RQs. Thus, in this paper, we focus on information diffusion on Twitter and answer the following research questions (RQ): 1.
What are the purposes of information diffusion-related research on Twitter? 2.
What methods have researchers used regarding the information diffusion model on Twitter? 3.
What metrics from Twitter data do researchers use? 4.
What measures do researchers use to determine influencer rankings?
Information on Twitter spreading from one user to another fulfills the nature of a Markov chain, that is, the information redistributed by the next user depends only on the current user's information and does not depend on the history of the previous dissemination of information [11]. This means that information diffusion on Twitter can be modeled with the continuous time Markov chain (CTMC). In this paper, we aim to find gaps in the research theme of CTMC in modeling information diffusion on Twitter.
To conduct this research, we applied bibliometric analysis software, namely VOSviewer and R-bibliometrix. VOSviewer was used to implement co-authorship-author, co-authorshipcountry, and co-occurrence-word analyses. Meanwhile, R-bibliometrix with the biblioshiny web interface was employed to assess the evolution of themes.
This paper is structured as follows: Section 2 provides a related literature review. Section 3 presents the research method, especially the method used to collect the articles for analysis. In Section 4, we describe the bibliometric analysis of the papers obtained using two software and the results of a study on the selected articles covering research about information diffusion on Twitter, the methods used in information diffusion modeling on Twitter, the metrics used in Twitter data, and the measures used to determine the rank Information 2022, 13, 13 3 of 17 of influencers. Section 5 comprises the discussion and further research agenda. Section 6 concludes this article.

Related Literature Review
Several surveys or literature reviews have been conducted regarding the study of information diffusion on social networks (SN) in general and not specifically on Twitter. No previous review articles performed a complete systematic literature review. Kakar and Mehrotra [28] conducted a review of 90 filtered papers from six databases: namely, Scopus, Science Direct, ACM Digital Library, Springer, IEEE, and Google Scholar. This work focuses on three research areas under the umbrella of information diffusion in social networks, namely influence modeling, influence maximization, and retweet prediction. However, it did not discuss the metrics and measures used by researchers.
Razaque et al. [29] conducted a review focused on the classification of information diffusion models and the vulnerability of each model. However, this work did not discuss network influencers and how to maximize influence within the network. Alamsyah and Rahardjo [30] investigated social networks (SNs) from the perspective of graphical representations to find the SN taxonomy. It included the SN topology, structural modeling, community detection, tie strength, community detection, as well as metrics. Graphical representation techniques were the focus of this study, rather than discussing information diffusion models. Hamzah [31] reviewed 49 articles filtered from six databases, namely the ACM Digital Library, Google Scholar, IEEE Xplore, Science Direct, Springer Link, and Taylor & Francis Online. This article discussed machine learning and visualization techniques used for Twitter analytics. In this work, neither network influencers nor information diffusion analysis techniques based on the perspective of user behaviors were discussed, and the focus was more toward identifying social vulnerabilities.
Meanwhile, a survey conducted by [32] divided the diffusion model into two categories, namely explanatory models and predictive models. This explanatory model included epidemic models and influence models, while the predictive model included the independent cascade model (ICM), the linear threshold model (LTM), and the game theory model (GTM).
Singh [33] analyzed how information dissemination is carried out through networks and how people influence each other on social networks. This survey focused on maximizing influence with LTM, ICM, and epidemic models but did not discuss the metrics and measures used by researchers.
Firdaus et al. [34] conducted a survey on the information diffusion mechanism on Twitter. The authors focused on predicting tweets: starting from how to retrieve Twitter data, users who post tweets, tweet content, and predicting whether a tweet will be retweeted. Machine learning techniques were used for the tweet prediction. Additionally, some of the measures used to evaluate the performance of the model were discussed. This work did not discuss influence maximization or influential users on Twitter.
Riquelme and González-Cantergiani [35] conducted a survey on the size of a user's influence on Twitter. This work collected and classified various measures of influence on Twitter. Some were based on simple metrics, and some were based on complex mathematical models. Various criteria were given to determine the most influential users on Twitter. However, an information diffusion model was not discussed in this work.
What is different about our work compared with that of previous researchers is the systematic literature search conducted on studies of information diffusion models, especially on Twitter. We also added a bibliometric analysis to describe the distribution of research topics regarding the diffusion of information and changes in themes that occurred during each time period. The aspects covered in our article and in existing reviews are summarized in Table 1.
The SLR procedure that we carried out was referenced from [36]. SLR procedures include the search strategy, quality assessment, data extraction and monitoring, and data synthesis. Starting with database selection, we all agreed to choose four databases, i.e., Sco-Information 2022, 13, 13 4 of 17 pus, Science Direct, Dimensions, and Google Scholar. These four databases provide many articles from various domains such as science, engineering, computer science, medicine, social science, and others. We hope that our research can be considered a contribution to the field of research. The data from these four databases are quite demanding on our efforts to conduct a bibliometric analysis because the metadata of each database are different. Referring to [37], our keyword selection was performed through RQ analysis and by looking at related papers. We all agreed on the keywords applied to the four databases. The keywords included "information diffusion", "user influence", "influence maximization", "social network", and "Twitter". Capturing all articles about information diffusion studies, especially on Twitter, was assumed to be sufficient. In the inclusion process, we decided together that the selected articles had to be articles that had been published in an English-language journal from 2000 to February 2021 (when the data were collected). In the manual article-exclusion stage, two authors read the abstract and the article content to mark the article as being relevant or not. If opinions differed, another author participated in reading the abstract and the contents of the article and made the final decision. The complete research procedure carried out can be seen in Section 3.

Methods
In this section, we describe how this research process was carried out: namely, the collection of article data and the selection method.
This study began with a systematic search for publications indexed in four selected databases: Scopus, Science Direct, Dimensions, and Google Scholar. The keywords used in this first search were ("information diffusion") OR ("influence analysis") OR ("influence maximization") OR ("user influence"). We limited publication time to 2000 as Twitter was founded in 2006, and we ended collection time in February 2021. We limited our search by only looking for journal article publications and excluded conference proceedings or books. We only included articles written in English and published (final) in peer-reviewed in international journals.
Data retrieval in the Scopus, Science Direct, and Dimensions digital libraries was carried out with a few keywords applied to the "title, abstract, and keywords". Meanwhile, in the Google Scholar database, the keywords were only applied to the title, since the Google Scholar search engine does not provide a search process using the abstract. The first step was to find all papers related to information diffusion. From these search activities, 2675 papers were obtained from Scopus, 850 papers were obtained from Science Direct, 2950 papers were obtained from Dimensions, and 5950 papers were obtained from Google Scholar. Then, we applied the inclusion filter using two new keywords, namely ("social networks") OR ("social media"), aiming to acquire papers that included social media data or social networks when scrutinizing information diffusion. In this second round, several papers from the Scopus database were removed, reducing the number to 1211 papers. The number The filtering process continued by applying the keyword "Twitter" to capture papers examining information diffusion studies on Twitter. From this search, we obtained 199 papers from Scopus, 50 papers from Science Direct, 172 papers from Dimensions, and 3 papers from Google Scholar.
A summary of the search results from the three filtering processes on the four databases can be seen in Table 2. Note that the "Type" column in Table 2 represents the use of the following keywords: A. ("information diffusion") OR ("influence analysis") OR ("influence maximization") OR ("user influence"); B. ("social network") OR ("social media"); and C. "Twitter". Furthermore, after semi-automatic selection of all articles with the three keywords in the four digital libraries, we removed 168 duplicate articles and 3 survey articles. Then, the selection of articles was carried out through the abstracts, obtaining 204 relevant articles. Next, we performed manual filtering by reading the full text and obtained 34 articles. Our general selection process is shown in Figure 1. 1211 papers. The number of papers from Science Direct decreased to 367 papers, w number of papers from Dimensions and Google Scholars were brought down to 10 110 papers, respectively.
The filtering process continued by applying the keyword "Twitter" to capture examining information diffusion studies on Twitter. From this search, we obtain papers from Scopus, 50 papers from Science Direct, 172 papers from Dimensions papers from Google Scholar.
A summary of the search results from the three filtering processes on the fou bases can be seen in Table 2. Note that the "Type" column in Table 2 represents the the following keywords: A. ("information diffusion") OR ("influence analysis") OR ("influence maximiz OR ("user influence"); B. ("social network") OR ("social media"); and C. "Twitter". Furthermore, after semi-automatic selection of all articles with the three keyw the four digital libraries, we removed 168 duplicate articles and 3 survey articles the selection of articles was carried out through the abstracts, obtaining 204 releva cles. Next, we performed manual filtering by reading the full text and obtained 34 a Our general selection process is shown in Figure 1.

Semi-Automatic Selection
We developed a simple script using Python to select duplicate documents. We used Scopus articles as a reference for viewing duplicates in Dimensions, Science Direct, and Google Scholar. From this process, 122 Dimensions articles, 45 Science Direct articles, and 1 Google Scholar article were found to be redundant. After removing the duplicate articles, we obtained a total of 256 unique articles.

Manual Selection
The manual selection process was carried out in three stages: • First: We examined the title, abstract, and full text from the filtered articles to find articles conducting a survey or literature review. We removed three articles in the form of surveys, namely two articles from Scopus and one article from Dimensions. Thus, in total, from this stage, we obtained 253 articles. • Second: We examined the abstract to assess the relevance of the article to our research focus. Based on the abstracts, we discarded a total of 49 out of 253 articles, so we obtained 204 selected articles (hereinafter referred to as "Dataset 1"). Note that the original raw data returned from each digital library came in different formats. The selection results of this article originally had a different data format. Hence, we adjusted the article data for Dimensions, Science Direct, and Google Scholar in such a way that their formats were uniform to the raw file from Scopus. After restructuring all datasets into a homogeneous structure, bibliometric analysis was carried out for Dataset 1 (see Section 4). • Third: We thoroughly read the full text and the content and discussion of the articles to further evaluate their relevance. At this point, we obtained 34 articles (henceforth referred to as "Dataset 2"), which were used further for our systematic literature review analysis.
To sum up, we used Dataset 1 to conduct the bibliometric analysis as presented in Section 4.1 and Dataset 2 to discuss the results from the systematic literature review as presented in Section 4.2.
The results of this semi-automatic and manual selection process are shown in Table 3.

Bibliometric Analysis
We performed a bibliometric analysis for Dataset 1. This analysis technique is often used for literature analyses intent on obtain bibliographic overviews of scientific selections of highly cited publications. It can recover a list of author productions, national or subject bibliographies, or other specialized subject patterns [38]. We performed the bibliometric analysis using VOSviewer and R-bibliometrix. VOSviewer is a computer program used for bibliometric mapping [39], while R-bibliometrix is a package from the open source R software with a shiny web interface capable of conducting comprehensive analyses and scientific mapping of data with complete bibliographic information [40]. Both software have their respective advantages in bibliometric analysis. For example, VOSviewer has better visualization and clear links among different nodes in the network images compared with R-bibliometrix. In contrast, R-bibliometrix has a Sankey diagram feature that is particularly useful in conducting thematic evolution analyses.

Results from Bibliometric Analysis
In this section, we present the results of the analysis using the network visualization, grid matrix, and Sankey diagram techniques. This analysis was divided into three parts: (1) co-authorship-author and co-authorship-country, (2) co-occurrence-words, and (3) thematic evolution.

Visualization of the Co-Authorship-Author and Co-Authorship-Country Relations
In this section, the co-authorship analysis was conducted by examining the relationship between authors and their countries of origin. In VOSviewer, the co-authorship-author menu was selected by limiting each author to a minimum of one article. This means that all articles were analyzed. Based on this provision, VOSviewer obtained 559 authors, but only 70 authors were connected with other authors. The co-authorship-author relation was divided into nine clusters, namely red, yellow, green, blue, orange, pink, aqua, brown, and purple, as shown in Figure 2. In this case, the most productive author on the topic under study was Zhang, Y with five articles, followed by Zhang, C and Wang, Y with four articles each.
have their respective advantages in bibliometric analysis. For example, VOSviewer has better visualization and clear links among different nodes in the network images compared with R-bibliometrix. In contrast, R-bibliometrix has a Sankey diagram feature that is particularly useful in conducting thematic evolution analyses.

Results from Bibliometric Analysis
In this section, we present the results of the analysis using the network visualization, grid matrix, and Sankey diagram techniques. This analysis was divided into three parts: (1) co-authorship-author and co-authorship-country, (2) co-occurrence-words, and (3) thematic evolution.

Visualization of the Co-Authorship-Author and Co-Authorship-Country Relations
In this section, the co-authorship analysis was conducted by examining the relationship between authors and their countries of origin. In VOSviewer, the co-authorship-author menu was selected by limiting each author to a minimum of one article. This means that all articles were analyzed. Based on this provision, VOSviewer obtained 559 authors, but only 70 authors were connected with other authors. The co-authorship-author relation was divided into nine clusters, namely red, yellow, green, blue, orange, pink, aqua, brown, and purple, as shown in Figure 2. In this case, the most productive author on the topic under study was Zhang, Y with five articles, followed by Zhang, C and Wang, Y with four articles each. Furthermore, a bibliometric analysis was also carried out to assess the countries of origin of the authors involved in the network. The type of analytic used was the co-authorship-country relation, with the minimum number of documents from a country for a co-authorship being 1. VOSviewer detected 44 countries in our Dataset 1; however, only 36 countries had connections with other countries in the context of the co-authorship- Furthermore, a bibliometric analysis was also carried out to assess the countries of origin of the authors involved in the network. The type of analytic used was the coauthorship-country relation, with the minimum number of documents from a country for a co-authorship being 1. VOSviewer detected 44 countries in our Dataset 1; however, only 36 countries had connections with other countries in the context of the co-authorshipcountry relation. The 36 countries were divided into nine clusters, as shown in the network visualization in Figure 3. The clusters are indicated with different colors. From these results, the US had the most, with 68 articles (27%); followed by China, with 29 articles (12%); and then, India, with 23 articles (9%). As an example, the visualization also tells us that the authors in the US cooperated with authors in various countries such as Denmark, Poland, Slovenia, China, Hongkong, Brazil, Vietnam, South Korea, Italy, India, the Netherlands, Germany, Canada, and the United Kingdom. country relation. The 36 countries were divided into nine clusters, as shown in the network visualization in Figure 3. The clusters are indicated with different colors. From these results, the US had the most, with 68 articles (27%); followed by China, with 29 articles (12%); and then, India, with 23 articles (9%). As an example, the visualization also tells us that the authors in the US cooperated with authors in various countries such as Denmark, Poland, Slovenia, China, Hongkong, Brazil, Vietnam, South Korea, Italy, India, the Netherlands, Germany, Canada, and the United Kingdom.

Visualization of Co-Occurrence-Word Relation
To conduct a co-occurrence analysis in Dataset 1, we searched for the most frequent words that appeared in all documents. Dataset 1 contains data taken based on the title, keywords, and abstract only. VOSviewer has a support feature allowing us to conduct an assessment of the co-occurrence-author keyword relation from the menu on VOSviewer. We set up the minimum number of occurrences of a word in a document at two. From this, VOSviewer returned 467 words, and only 54 passed the threshold. The words that appeared at least two times in each document were divided into 12 clusters. The results show that the most frequent words appearing in Dataset 1 are "information diffusion", with 66 events; followed by "Twitter", with 44 events; and "social networks", with 37 events. This co-occurrence-word network visualization is shown in Figure 4. Note that the co-occurrence network has extensively been used in social media analyses and text analyses for discovering the relationships among people, organizations, concepts, and other areas of interests. Here, we observe that, for example, the information diffusion concept is often linked to various concepts, especially Twitter, social influence, social networks, user influence, contagion, and popularity prediction, to name a few.

Visualization of Co-Occurrence-Word Relation
To conduct a co-occurrence analysis in Dataset 1, we searched for the most frequent words that appeared in all documents. Dataset 1 contains data taken based on the title, keywords, and abstract only. VOSviewer has a support feature allowing us to conduct an assessment of the co-occurrence-author keyword relation from the menu on VOSviewer. We set up the minimum number of occurrences of a word in a document at two. From this, VOSviewer returned 467 words, and only 54 passed the threshold. The words that appeared at least two times in each document were divided into 12 clusters. The results show that the most frequent words appearing in Dataset 1 are "information diffusion", with 66 events; followed by "Twitter", with 44 events; and "social networks", with 37 events. This co-occurrence-word network visualization is shown in Figure 4. Note that the cooccurrence network has extensively been used in social media analyses and text analyses for discovering the relationships among people, organizations, concepts, and other areas of interests. Here, we observe that, for example, the information diffusion concept is often linked to various concepts, especially Twitter, social influence, social networks, user influence, contagion, and popularity prediction, to name a few.

Thematic Evolution
Using R-bibliometrix, we also acquired an overview of the evolution of themes. Topics that were in a certain quadrant in the previous period could be shifted to another quadrant

Thematic Evolution
Using R-bibliometrix, we also acquired an overview of the evolution of themes. Topics that were in a certain quadrant in the previous period could be shifted to another quadrant in the next period. This evolution was presented as a Sankey diagram. To determine the distribution of time periods or time slices used for thematic evolution analysis, the overall number of published articles in Dataset 1 was analyzed. The number of issues per year from Dataset 1 for the four databases can be seen in Figure 5.

Thematic Evolution
Using R-bibliometrix, we also acquired an overview of the evolution of themes. Topics that were in a certain quadrant in the previous period could be shifted to another quadrant in the next period. This evolution was presented as a Sankey diagram. To determine the distribution of time periods or time slices used for thematic evolution analysis, the overall number of published articles in Dataset 1 was analyzed. The number of issues per year from Dataset 1 for the four databases can be seen in Figure 5.  The selected thematic evolution parameters were author keywords, the number of words was 450, the minimum cluster frequency was 5, the number of labels was 2, and the number of cutting points was 2. A visualization of the evolution of these is presented based on three time slices, namely time slice 1 (2011-2016), time slice 2 (2017-2019), and time slice 3 (2020-2021), in Figure 6. The selected thematic evolution parameters were author keywords, the number of words was 450, the minimum cluster frequency was 5, the number of labels was 2, and the number of cutting points was 2. A visualization of the evolution of these is presented based on three time slices, namely time slice 1 (2011-2016), time slice 2 (2017-2019), and time slice 3 (2020-2021), in Figure 6.  For each time slice, this evolution in themes can be described more fully with the Callon centrality method [41]. The most frequently discussed themes in the literature are portrayed and mapped as clusters plotted in the grid diagram consisting of the four quadrants. The clusters are depicted in the form of circles of diverse sizes and colors. The size of the cluster represents the frequency that the word appears in the documents. The first quadrant includes the motor themes. In this quadrant, the cluster has a large centrality and density. This means that clusters have links with other clusters and strong internal links. The second quadrant includes the niche themes. In this quadrant, the links with other clusters are weak, but internally, the links are strong. Quadrant 3 includes the emerging or declining themes. In this quadrant, the centrality and density are small, which describes a new topic developing or having decreased. Quadrant 4 includes the basic themes. In this quadrant, the cluster is strongly connected to other clusters, but the internal link intensity is low. Using the R-bibliometrix tool for visualization, the thematic evolution in each time slice can be seen in Figure 7.  Figure 7 shows that the topics discussed in Dataset 1 are presented in certain quadrants and clusters that experience changes in each period. In the 2011-2016 period, all topics were spread out into three quadrants and eight different clusters. The three biggest clusters, namely information diffusion and social networks, Twitter and social influence, social media and continuous time Markov chain (CTMC), are in the basic themes, meaning that links with other clusters are strong, but internal link intensity is weak. Note that in some literature, CTMC is often called CTMP. In this paper, we sometimes use these two terms interchangeably to refer to the same concept, especially if the literature explicitly uses the term CTMP instead of CTMC.  Figure 7 shows that the topics discussed in Dataset 1 are presented in certain quadrants and clusters that experience changes in each period. In the 2011-2016 period, all topics were spread out into three quadrants and eight different clusters. The three biggest clusters, namely information diffusion and social networks, Twitter and social influence, social media and continuous time Markov chain (CTMC), are in the basic themes, meaning that links with other clusters are strong, but internal link intensity is weak. Note that in some literature, CTMC is often called CTMP. In this paper, we sometimes use these two terms interchangeably to refer to the same concept, especially if the literature explicitly uses the term CTMP instead of CTMC.
In the 2017-2019 period, the topic of CTMP did not appear anymore, while information diffusion joined Twitter, and a new topic emerged, namely influence maximization. Furthermore, in the 2020-2021 period, the topic of information diffusion remained the most studied topic and occupied the motor themes quadrant, meaning that the links with other clusters and internal links between clusters are strong. Twitter, which is in the same cluster as the social network analysis, is in the basic themes quadrant, meaning that links with other clusters are strong even though the internal cluster is weak. The topic influence maximization, which was originally in the basic themes quadrant but moved to the niche themes, meaning that links with other clusters were weak.
The topic CTMP is in the basic themes quadrant, and it is different from the clusters with "information diffusion" and "Twitter". This indicates that the topic CTMP has a very strong link with "information diffusion" and "Twitter". In brief, the topic of CTMP has not been frequently studied and is open to further research in connection with information diffusion on Twitter. This is our contribution in our next study.

Results from Systematic Literature Review
In this section, we present the results of a study on Dataset 2, namely 34 selected articles discussing information diffusion on Twitter. Articles in Dataset 2 were published within the 2012-2020 timeframe (with 41% published in 2020).

The Purpose of Research on Information Diffusion on Twitter
We conducted an analysis of information diffusion on Twitter to answer RQ1: What are the purposes behind information diffusion-related research on Twitter?
After examining all of the selected articles, we sorted them into three categories of study areas based on the purpose of each article. The percentages of these three categories can be seen in Figure 8. The three categories are as follows: 1.
Information Difference Model on Twitter-articles that focus on modeling how information diffuses or spreads on Twitter; 2.
Influential User on Twitter-articles that discuss how to find the most influential users on Twitter or to rank Twitter users; and 3.
Influence Maximization on Twitter-articles that discuss how to maximize the influence of users who share information on Twitter.

The Purpose of Research on Information Diffusion on Twitter
We conducted an analysis of information diffusion on Twitter to answer RQ1 are the purposes behind information diffusion-related research on Twitter?
After examining all of the selected articles, we sorted them into three categor study areas based on the purpose of each article. The percentages of these three cate can be seen in Figure 8. The three categories are as follows: 1. Information Difference Model on Twitter-articles that focus on modeling ho formation diffuses or spreads on Twitter; 2. Influential User on Twitter-articles that discuss how to find the most influent ers on Twitter or to rank Twitter users; and 3. Influence Maximization on Twitter-articles that discuss how to maximize the ence of users who share information on Twitter. From Figure 8, we can see that the information diffusion model (47%) and then ential users (41%) are the two most important purposes behind why scholars study From Figure 8, we can see that the information diffusion model (47%) and then influential users (41%) are the two most important purposes behind why scholars study Twitter analytics.

Methods Used in Information Diffusion Modeling on Twitter
This section intends to answer RQ2: What methods have researchers used regarding information diffusion models on Twitter? Based on our review of the information diffusion model on Twitter, various methods have been used by researchers, such as epidemic models [14,[42][43][44], the stochastic model [11,45,46], machine learning [47][48][49], and the independent cascade model (ICM) [50].
In the stochastic model, Foroozani [45] used discrete time-random walk (DTRW) and continuous time-random walk (CTRW); meanwhile, Li et al. [11] used the continuous time Markov process (CTMP). In their study, the authors of [11] used homogeneous CTMP, meaning that the rate of information dissemination was assumed to be constant. The methods used by researchers to conduct studies related to the complete information diffusion Model on Twitter can be seen in Tables 4-6. Modified forest-fire model based on mentioned, similarity score, user activity, topic significance retweet 14 [14] Susceptible-exposed-infected (SEI) tweet, retweet, reply, mention 15 [53] Textual-Homo-IC, Textual-Homo-PCM follow 16 [49] Poisson regression model retweet, quote

Twitter Metrics
Our third research question (RQ3) was as follows: What metrics do researchers use from Twitter data? Our study reveals that the most common metric used by researchers to process Twitter data is the number of "retweets"; however, some studies included replies, mentions, and follows.
The use of metrics from Twitter data can be seen in full in Tables 4-6.

Measures for Determining Influencers
This subsection tries to answer the fourth research question (RQ4): What measures do researchers use to rank influencers? For articles with a focus on the study of influential users on Twitter, the measures used to rank influential users are traditional measures such as closeness, betweenness, and PageRank [54][55][56][57]61,63], analytic hierarchy process [20], and buzz rank [62]. The types of measure used to assess influential users can be seen in Table 5.

Discussion
In this section, we discuss the results of the analysis that we obtained from the literature review (Dataset 2), totaling 34 articles.

The State-of-the-Art of Information Diffusion Application on Twitter
A review of Dataset 2, which includes the state-of-the-art of our research, is presented in a table, which consists of information diffusion model studies on Twitter in Table 4, influential user studies in Table 5, and influence maximization studies in Table 6. The tables were completed with the research objectives, the methods used, the metrics used, and the measurements used by the researcher.

Research Gaps
From the modeling perspective of Twitter data, our analysis from the previous literature selected showed three research gaps.
First, research in homogeneous CTMP for the information diffusion model on Twitter. Referring to Table 4, we observe that one of the methods used in the information diffusion model is the stochastic model. We only found one study, Li et al. [11], that applied homogeneous CTMP for the information diffusion model on Twitter. In such an approach, the rate of transition of information dissemination from Twitter users to other users is considered constant. We notice in Figure 7 that CTMP only appeared in the 2011-2016 period, and we did not observe such an approach used afterwards. On the contrary, information dissemination and Twitter are continuously the most studied topics in each period analyzed. Our analysis on time slice 1 of the thematic maps shows that CTMP lies in the basic theme quadrant, likely linking to the topic Twitter information diffusion. This can be observed from the strong link between this quadrant and other clusters. We also notice in the analysis in Section 4 that only a few studies are related to the application of homogeneous CTMP in Twitter analytics.
Second, research in non-homogeneous CTMP for the information diffusion model on Twitter. Looking at the phenomenon of information sharing among Twitter users, the transition rate from spreading information on Twitter is not completely constant but depends on the timing of information dissemination. In this case, the nonhomogeneous CTMP method can be considered an alternative to modeling the dissemination of information on Twitter.
Third, research on maximizing influence. As seen in Table 6, the number of publications about maximizing influence is still small (only four studies). Likewise, as seen in Figure 7b,c, in 2017-2019, the topic influence maximization is in the basic theme quadrant, which means that the connections with other clusters are strong but the links to internal clusters are weak. Then, in the 2020-2021 period, it moved to the niche theme quadrant, meaning that links with other clusters are weak but internal links are strong. This means that more opportunities are available to study influence maximization in the future.

Conclusions
In this paper, we presented a systematic literature review on information diffusion on Twitter. We screened 424 papers from four digital libraries, namely Scopus, Science Direct, Dimensions, and Google Scholar. After going through the selection of duplicates, titles, and abstracts, 204 articles were obtained (Dataset 1). We performed a bibliometric analysis for Dataset 1. We showed how the usage of the bibliographic mapping technique can reveal an overview of the existing themes as well as the changes over time. This study demonstrates that publications on the diffusion of information are continuously increasing every year. This description of themes can serve as a basis for deciding further studies.
Moreover, we conducted a manual selection of full texts and obtained 34 articles (Dataset 2). From the results of the SLR in Dataset 1, we found that 47% of the publications studied the information diffusion model, 41% studied the influence of users, and 12% studied influence maximization. We answered our research questions raised in the Introduction. To sum up, we found that publications on information diffusion models on Twitter have used various methods, such as epidemic models (e.g., SIR, EM-IPSI, and SEI), stochastic models (CTMC), machine learning, Regression, Bayesian, EGT, and those based on independent cascade models and threshold models. Additionally, the metrics used by researchers in general are retweets, mentions, and replies.
Publications about the influence of researchers use the methods of AHP, ACRA, WACRA, WMMEAI, PageRank, influence factorization, T-HT, LDA, and the cluster-based fusion technique. The measures used are degree centrality, closeness, betweenness, eigenvector, PageRank, buzz rank, and the T and HT measures.
Our study shows three gaps that could be future directions of influence analysis and information diffusion models on Twitter. First, very limited studies examine influence maximization, which is openly available as a future research direction. Second, we also noticed from our endeavor in this article that studies on the information diffusion model using homogeneous continuous time Markov chain are limited, where we only found one study on a homogeneous CTMC variant. Research in homogeneous CTMC for information diffusion model on Twitter can still be conducted as a future study. Third, the study of information diffusion models with nonhomogeneous CTMC is very open to future research, considering that this model is very realistic and that the transition rate of information dissemination on Twitter is not constant but depends on the time of information dissemination.
Our systematic literature review is not without limitations. First, we used four databases to mine data: Scopus, Science Direct, Dimension, and Google Scholar. We hope that most of the articles from other databases are already contained in the database we used. Second, we chose keywords related to our specific topic. Third, to minimize the subjectivity in articles selection, we performed a standard procedure regarding the title and abstract.