3.1. Model Description
In this subsection, we consider the phenomenon in which a piece of information (such as an original tweet) spreads from an information holder on a directed network with N nodes where each such node is referred to as an agent; here, an agent corresponds to a user of an SNS who sends or receives the target information on the Internet. Each agent can be in one of the following three states, which are similar to the SIR model states:
- State 0:
Target information has not yet been received, or the information has been sent from neighbors but overlooked.
- State 1:
Target information has been received and will be spread in the subsequent time steps.
- State 2:
Target information has been received and spread.
At time 0, all agents except for the initial information holders are in state 0 and the initial information holders broadcast the target information for all their adjacent agents at time 0. If agent
i receives the target information when it is in state 0, agent
i (
) changes from state 0 to 1 with probability
. In what follows,
is referred to as the response probability. Note that a transition from state 0 to 1 corresponds to the situation where agent
i notices the target information and decides to spread it in the subsequent time. If agent
i changes state from 0 to 1, it stays in state 1 for a set period of time. The length of time in which agent
i stays in state 1 follows an exponential distribution with rate parameter
. After staying in state 1 for this set period, agent
i then broadcasts the target information to all adjacent agents, transitions from state 1 to state 2, and remains in that state without rebroadcasting the information. Note that, in this model, in reproducing information by retweeting on Twitter, the information is assumed to be spread by broadcasting (from one to many). The state transition diagram is given as
Figure 1.
Although the above model is quite simple, it has the potential to explain real-world information spread on Twitter through retweets. Here, an example is presented to verify this claim. When Naomi Osaka, a tennis player with dual US and Japanese nationality, won the US Open in September 2018, a number of tweets were issued to mark her victory. Among those, a tweet issued by Shinzo Abe, Japanese Prime Minister, obtained over 20,000 retweets. The blue line in
Figure 2 shows the actual number of retweets per 10-minute interval, while the red line shows what was predicted in this respect by the model. We assumed that
and
for all
, and
and
q were time dependent. The time dependencies of
and
q are depicted in
Figure 3; these were manually determined so that the actual data (red line) roughly matches the simulation results (blue line). The simulation was conducted in a directional graph with 81,306 nodes and 1,786,149 links. The graph was constructed based on Twitter follower-followee data available in [
22]. The figure shows that the results of the model (blue line) accord well with the real data (red line). The time dependence of parameters
and
q is not explicitly considered in the analysis of this paper; rather, this example simply shows the potential applicability of the model to real-world phenomena.
3.2. Analysis
Let denote an adjacency matrix where if a directed link exists from agent i to agent j; otherwise, . We also let denote a random variable which is equal to 1 if agent j changes its state from 0 to 1 when it receives the target information from agent i; otherwise, . We let for all , assuming that the probability that a user notices the received information (tweet) does not depend on the sender of the information. In what follows, is referred to as the response matrix.
Hereinafter, we assume that the outcome of
Y is known. That is, whether
or 0 is known for all
. This assumption is equivalent to cutting off the directed link from
i to
j if
and making the information spread on the resultant network without any users being non-responsive. This assumption is relaxed in
Section 3.5.
The state of the network is expressed by
, where
denotes the state of agent
i at time
. The transition of
is governed by a continuous-time Markov chain. We define
With this definition, the probability that agent
i is in state
k at time
t,
, is given by
Agent
i changes its state from 0 to 1 only when it is in state 0 and (at least) one adjacent agent is in state 1. Agent
i changes its state from 1 to 2 at a constant rate
, but only when it is in state 1. This observation yields
Note that is the probability that agent i is in state 1 or 2 at time t; that is, agent i notices the target information sent from one of adjacent agents at time t. For later use, we prove the following identity.
Proof. Since (
3) holds if
or
, it is sufficient to prove (
3) when
and
. If
, agent
j has already sent the target information to all neighbor agents by time
t and if
, node
i notices the target information, which is sent from node
j. Thus, node
i should be in state 1 or 2 at time
t, that is,
, which completes the proof. □
Summing (
1) and (
2) yields
where the second equality comes from Lemma 1 and the third equality comes from
.
Remark 1. In the original SIR model, the spread of infection from an infected agent to its adjacent agents occurs independently at different times, which is different from the model described in this paper. Note that the information spread (spread of infection disease) is faster in the original SIR model than in the model considered in this paper. To see the difference, consider an agent (agent A) having two adjacent agents (agents B and C). Assume that agent A in state 1 at time t, and agents B and C are both in state 0 at time t. In the original SIR model, agent B receives the information at time and agent C receive the information at time , where and are mutually independent and exponentially distributed random variables with mean . In the model considered in this paper, agents B and C simultaneously receive the information at time , where T is an exponentially distributed random variable with mean . Since is stochastically smaller than T, the information transfer in the original SIR model stochastically occurs faster than in the considered model.
3.3. Strong Correlation Assumption
Here, the exact value of
cannot be obtained by solving (
4) because the
term is not known. However, the upper and lower bounds of
can be obtained using the following lemma.
Lemma 2. If and are non-negatively correlated, that is, they satisfy , then Proof. From the fact that
and
are non-negatively correlated, it follows that
which means that
. It is also seen that
and
(
5) and (
6) yield
which completes the proof. □
Most recent related studies have solved (
4) by assuming
(or by using assumptions similar to this) [
15,
16,
23,
24,
25], which is called the
independence assumption (IA) in this paper. In contrast, here we assume
, which we refer to as the
strong correlation assumption (SCA). To the best of the authors’ knowledge, no previous studies have considered this assumption. Under this assumption, (
4) is expressed as
The second line of the above equation means that the target information from agent i is received and noticed by agent j when . In other words, the SCA describes the target information spread from the agents that are most likely to have the target information to agents that are most unlikely to have the target information. It also gives the lower bound for the probability that an agent has the target information, while the IA provides the upper bound of the probability.
3.4. Upstream and Downstream Relationship
Here, we introduce the following order relationship between agents, which is referred to as the upstream and downstream relationship in this paper.
Definition 1. (upstream and downstream relationship). If for all t, we say that agent i is downstream of agent j (agent j is upstream of agent i) and denote this by .
If the upstream and downstream relationship exists for each pair of adjacent agents, the SCA gives the exact result. To show this, without loss of generality, we assume that agent
j is upstream of agent
i (
). Note that
if
because
. We also see that
which means that the SCA exactly holds. In addition to this, if the upstream and downstream relationship exists for all pairs of adjacent agents, (
7) holds for all
t and thus we have
If
i is not the initial information holder (that is,
), we obtain
where
is the Laplace-Stieltjes transform of
defined as follows:
Note that if i is the initial information holder because for and for .
The upstream and downstream relationship exists for all pairs of adjacent agents when the target information is spread over a tree-topology network from its root (
Figure 4). Assume that agent
i is located
n hops downstream from the root, and the path from the root to agent
i is given by
where
is the root and
. It follows from (
8) that
and thus we obtain
In particular, if
for all
, it is possible to analytically retransform
to
, and the closed-form representation of
is obtained as follows:
3.5. Taking the Average on the Response Matrix
Let
be the number of agents that have spread the target information by time
t, which corresponds to the number of retweets posted by time
t after the original tweet was posted at time 0. The information cascade can be described by
. Since
, the conditional expectation of
on the response matrix
Y,
, is calculated as
where
is the conditional probability that agent
i is in state 2 given the outcome of
Y. Note that the results in the previous subsections also have dependence on the outcome of
Y and thus, for example,
in (
4) should be written as
in a rigorous sense. We did not, however, use this rigorous notation in the previous subsections for simplicity of expression. Let
be the time at which agent
i received the target information and let
be the period between the time when agent
i receives and spreads the target information. We have
Thus,
where, as previously noted,
is the same as
in the previous sections because the outcome of
Y is assumed to be given in the previous sections. Note that
can be obtained by numerically solving (
7) if
Y is given.
To obtain
, we need to calculate
for all possible outcomes of
Y and sum the results using weights which equal the probability of the outcome. Calculating
for all possible outcomes of
Y is, however, impossible because the number of possible outcomes of
Y increases exponentially with the number of agents in the network. One possible approach for approximately calculating
is to obtain
through (
4) by assuming that
is equal to the response probability, that is,
. Under this assumption,
is approximately given as the solutions to the following differential equations:
By applying the SCA (or IA) to the above, we can numerically obtain
. Once
is obtained, we can calculate
by using the following relationship:
Most previous studies on epidemic spreading using the SIS model or SIR model [
4,
5,
6,
7,
8] adopted the assumption
. The response of each individual is first averaged and represented as a parameter
, which is often called the
rate of infection, and then the spread of the epidemic is analyzed by solving differential equations similar to (
10). We call this approach the “mean-response-based analysis”. Unfortunately, as shown in the next section, this analytical approach greatly overestimates
especially when
is equal to or less than 0.1. The mean-response-based analysis implicitly assumes that
and
are statistically independent, but this assumption does not hold, resulting in a discrepancy between the simulation results and the mean response-based analysis.
Another possible approach is to calculate for some (randomly selected) outcomes of Y and take their average. This approach, called the “representative-response-based analysis” in this paper, corresponds to the notion that some information spreading patterns are first obtained assuming representative response patterns of each user and then the results are averaged. Surprisingly, as shown in the next section, the dependence of on the outcome of Y is very small for large networks and the representative-response-based analysis yields a much better estimate of than the mean-response-based analysis.
Note that the exact result is available when the target information is spread over a tree-topology network from its root (
Figure 4). For example, if the path from the root to agent
i is
, and
for all
, then
3.6. Mean-Field Approximation
Under the IA
and SCA
, interpolating both approximations with linear equations using the parameter
yields
Setting
in the above equation gives the IA, while setting
gives the SCA. Among the two approximations mentioned in
Section 3.5, we use the first one and apply (
11) to get
Applying the mean-field approximation, where
,
, and
for all
, to (
12), yields
in which
d is the mean degree of the agent. The above differential equation has the following solution:
This result shows that the probability that an agent will have received the target information by time
t follows a logistic curve as per what appears in population growth models. In addition to this,
acts as a scale parameter of time
t, and larger
(which means the approximation is closer to the SCA has the effect of advancing the time more slowly. The expectation for the total number of agents that have spread the target information up to time
t can then be calculated by the following equation: