Infodemic Source Detection with Information Flow: Foundations and Scalable Computation

Wang, Zimeng; Zhao, Chao; Zhou, Qiaoqiao; Tan, Chee Wei; Chan, Chung

doi:10.3390/e27090936

Open AccessArticle

Infodemic Source Detection with Information Flow: Foundations and Scalable Computation^†

by

Zimeng Wang

¹

,

Chao Zhao

¹

,

Qiaoqiao Zhou

²

,

Chee Wei Tan

^3,*

and

Chung Chan

^1,*

¹

Department of Computer Science, City University of Hong Kong, Hong Kong, China

²

Faculty of Computer Science and Control Engineering, Shenzhen University of Advanced Technology, Shenzhen 518055, China

³

College of Computing and Data Science, Nanyang Technological University, Singapore 639978, Singapore

^*

Authors to whom correspondence should be addressed.

^†

This paper is an extended version of our paper published in Zhao, C.; Wang, Z.; Zhou, Q.; Tan, C.W.; Chan, C. Infodemic Source Detection: Enhanced Formulations with Information Flow. In Proceedings of the 2024 IEEE International Symposium on Information Theory (ISIT), Athens, Greece, 7–12 July 2024.

Entropy 2025, 27(9), 936; https://doi.org/10.3390/e27090936

Submission received: 9 July 2025 / Revised: 11 August 2025 / Accepted: 5 September 2025 / Published: 6 September 2025

(This article belongs to the Special Issue Applications of Information Theory to Machine Learning)

Download

Browse Figures

Versions Notes

Abstract

We consider the problem of identifying the source of a rumor in a network, given only a snapshot observation of infected nodes after the rumor has spread. Classical approaches, such as the maximum likelihood (ML) and joint maximum likelihood (JML) estimators based on the conventional Susceptible–Infectious (SI) model, exhibit degeneracy, failing to uniquely identify the source even in simple network structures. To address these limitations, we propose a generalized estimator that incorporates independent random observation times. To capture the structure of information flow beyond graphs, our formulations consider rate constraints on the rumor and the multicast capacities for cyclic polylinking networks. Furthermore, we develop forward elimination and backward search algorithms for rate-constrained source detection and validate their effectiveness and scalability through comprehensive simulations. Our study establishes a rigorous and scalable foundation on the infodemic source detection.

Keywords:

infodemic source detection; information flow; submodular optimization

1. Introduction

Identifying the source of a contagion is a critical task across multiple fields, including cybersecurity [1,2], public health [3,4], and social networks [5,6,7,8,9]. The rapid dissemination of misinformation and malicious content can have severe consequences, such as distorting public perception [10], inciting panic [11], and compromising digital infrastructure [12]. Therefore, the ability to accurately and efficiently pinpoint the original source of a contagion within a network becomes essential. This task typically involves analyzing a static snapshot of the network, referred to as the contagion graph, which illustrates the relationships among nodes classified as “infected” or “susceptible” according to their exposure status. The conceptual foundation for modeling contagion dynamics in networks is frequently built upon the classical Susceptible–Infectious (SI) model put forth by Bailey [13].

The task of detecting the source of the contagion was pioneered by [14] as a maximum likelihood (ML) estimation within the framework of the SI model with exponential spreading time for regular trees. Their method maximizes the likelihood of observing the state of infection by an initial source node. Then, the rumor center can be determined using message-passing algorithm [15]. They extended the framework by introducing the rumor centrality as the estimator, where nodes are ranked according to the likelihood that they were original sources of the contagion. Shah and Zaman [16] broadened the graph structure to irregular trees and overcame the limitation of the exponential spreading time by multi-type continuous time branching processes/generalized Polya’s urn models [17]. The rumor source detection problem over the SI model was then extended to degree-regular graph with multiple cycles by a backward contact tracing method [18,19]. Fan and Wang [20] extended the problem to continuous spreading time with probabilistic detection algorithms and proposed the joint maximum likelihood (JML) estimator, which incorporates both infected nodes and spreading time. Moreover, they demonstrated that the JML estimator can be computed with a computational complexity of

O (n L log L)

by utilizing the message-passing algorithm, where n is the number of infected nodes, and L is the effective range of the spreading time.

Although these graph-based approaches provide computationally efficient estimators, we observe that they may fail to correctly identify the source, even in a fundamentally simple graph structure, as explained in the next section. To address this limitation, it is essential to move beyond graphs and leverage the theory of information flow. Network coding theory established that mixing information at intermediate nodes achieves the multicast rate [21], and linear coding was shown to be sufficient for this purpose [22]. An algebraic formulation provided a systematic framework for constructing such codes [23], and polynomial-time deterministic algorithms were later developed for constructing capacity-achieving multicast codes [24]. Efficient algorithms were developed to compute coding solutions while maintaining the optimality guaranteed by classical max-flow bounds [25], whereas random linear network coding enabled capacity-achieving multicast in a fully distributed and probabilistic manner [26]. In parallel, submodular optimization emerged as a powerful tool for characterizing network information flow. The theory of submodular flows [27] and information-theoretic formulations of entropy and information inequalities [28,29] established a unified view of rate-constrained communication. These results motivate our generalization of source detection beyond graph-based contagion models, leading to a rate-constrained formulation grounded in information flow.

This work extends our previous conference paper [30]. The journal version introduces a new lazy–greedy forward search algorithm that improves the computational efficiency of rate-constrained source detection. Furthermore, we provide extensive new experiments on synthetic network models to demonstrate the scalability and the achieved multicast rate. For the sake of clarity, completeness, and readability, we explicitly restate the fundamental definitions, lemmas, propositions, and selected examples from our earlier paper throughout this article, including full proofs that were omitted in the concise five-page conference version.

The structure of this article is organized as follows: In Section 2, we re-define the SI model and the ML/JML estimators and reveal the limitation of conventional ML estimation methods. Section 3 generalizes the source detection problem with network flow and introduces the rate-constrained formulation. Section 4 proposes our main results on algorithmic solutions based on backward elimination and greedy forward search. We also present experimental results validating the effectiveness and scalability of the proposed methods. We discuss implications and future directions in Section 5 and conclude the paper in Section 6.

2. Preliminary

We first provide a brief overview of the two estimation methods, ML and JML, along with their formulations. Given a finite undirected graph

G = (V, E)

with vertex set V and edge set

E \subseteq (\binom{V}{2})

, at time

t = 0

, an initial set of infected nodes

W \subseteq V

is specified. For every ordered pair

(u, v)

with

{u, v} \in E

, there is an independent random time

τ_{u v} \sim Exp (λ)

, where

Exp (λ)

is the exponential distribution with mean

1 / λ

. Let

T_{u}

denote the time at which node u first becomes infectious. For

u \in W

, set

T_{u} = 0

. Recursively, for each susceptible v,

T_{v} = \min_{\begin{matrix} u : {u, v} \in E \\ T_{u} < \infty \end{matrix}} [T_{u} + τ_{u v}] .

For any

t \geq 0

, the infected set is

W (t) : = {u \in V : T_{u} \leq t} .

Now, given a non-empty subset

S \subseteq (\binom{V}{k})

of

k > 0

nodes which are currently infected, the ML estimator selects a source node as the solution to

\max_{s \in S} P [W^{(k)} = S | W^{(1)} = {s}],

where

W^{(k)}

is a random set denoting the set of first k infected nodes including the source node. The JML estimator selects a source node as the solution to

\max_{s \in S} sup_{t \geq 0} P [W (t) = S | W (0) = {s}],

where

W (t)

is a random set denoting the set of all infected nodes by time

t \geq 0

, assuming the source node is infected at time 0. Different from the ML estimator, the JML estimator further optimizes over the time t when S is observed.

We show that, in Figure 1, the ML estimation score is 1 when 0 is the source, which corresponds to the infection sequence

(0, 1, 2)

. If the infection sequence is

(1, 2, 0)

and

(1, 0, 2)

, and node 1 is the source, the ML estimation score is

\frac{1}{2} \times 1 + \frac{1}{2} \times 1 = 1

, which is exactly the same as the case for node 0. By symmetry, node 2 has the same chance of being the source as node 0 and, consequently, node 1. Such equal chances cause the source detection problem to become degenerate. Similarly for JML, the method also suffers from the same degeneracy because, for connected graphs, all nodes will eventually be infected, i.e., the JML estimation score increases to 1 as

t \to \infty

regardless of the choice of the source node. The likelihood probability when the set of infected nodes at time 0 is

{1}

is

\begin{matrix} P [W (t) = {0, 1, 2} | W (0) = {1}] = P [T_{0} ({1}) \leq t, T_{2} ({0, 1}) \leq t], \end{matrix}

where

T_{u} (B)

is the time set B infects node u. By symmetry, we consider the case when

W (0) = {0}

and achieve the following

\begin{matrix} P [W (t) = {0, 1, 2} | W (0) = {0}] = P [T_{0} ({1}) \leq t, T_{2} ({0, 1}) \leq t - T_{0} ({1})] . \end{matrix}

When considering the observation time t that achieves the supremum in the JML estimator, i.e.,

t \to \infty

, the JML estimation scores for both nodes 0 and 1 converge to 1, leading to a degeneracy.

Beyond the previously discussed degeneracy issues in ML and JML estimators, conventional graphical models also exhibit fundamental limitations in capturing the actual structure of information flow in networks. Specifically, they fail to account for the rate at which information spreads: if a set B is currently infected, the infection rate of a node

u \in V ∖ B

should be proportional to the capacity of a channel from B to u. Moreover, since a rumor can be viewed as a piece of information that spreads through the network, it is natural to characterize its spreading by a transmission rate. Suppose that the rumor is observed to reach a subset of nodes

A \subseteq V

at a rate of at least

r \geq 0

. Any candidate source set

S \subseteq V

must then be able to deliver information to A at no less than this rate. Or, the network multicast capacity must be no less than this rate. Such rate constraints are not considered in the existing model.

2.1. SI Model

To overcome these limitations, we re-define the SI model and improve the formulation of the estimator. To capture information flow in a network consisting of a discrete set V of nodes, we generalize the SI model using the following infection rate tuple

λ_{V} : = (λ_{u} | u \in V)

, where

λ_{u} : 2^{V} \to R_{+}

, where

\begin{matrix} λ_{u} (\emptyset) & = 0 and λ_{u} (B^{'}) \leq λ_{u} (B) & \forall B^{'} \subseteq B \subseteq V . \end{matrix}

(1)

A more elaborated model will be given in Definition 5.

To define the SI model for infodemic, define the time for

u \in V

to be infected by

B \subseteq V ∖ {u}

by the exponentially distributed random variables

T_{u} (B) \sim Exp (λ_{u} (B))

(2)

mutually independent over u and B. Let

U (t)

be the sequence of infected nodes at time

t \geq 0

, i.e.,

U (t) = (u_{1}, \dots, u_{k})

means

u_{i}

is the i-th infected node by time t for

i \in [k] : = {1, \dots, k}

. For notational convenience, let

u^{k} = (u_{1}, \dots, u_{k})

and

u_{i}^{j} = (u_{i}, \dots, u_{j})

for

i \leq j

. Then, the infection times determine the sequence of infected node as follows:

Definition 1

(SI). Given

U (τ) = u^{k}

, for all

t > τ

,

U (t) = u^{k + 1} \Leftrightarrow u_{k + 1} = \arg \min_{w \in V ∖ set (u^{k})} T_{w} (set (u^{k}))

(3)

where the set function

set (u^{k}) : = {u_{i}}_{i = 1}^{k}

(4)

turns the input sequence

u^{k}

into an unordered set.

A unique choice of

u_{k + 1}

is possible because, almost surely,

T_{w} (set (u^{k}))

s are distinct.

The source detection problem is to find

set (U (0))

given

set (U (T^{(k)})) = S

, where

T^{(k)} = inf {t \geq 0 ∣ | set (U (t)) | = k},

(5)

namely, the time to infect the first k nodes.

We propose the following source detection problem for the SI model:

Definition 2

(ML for independent observation time). The likelihood probability of observing the set S of infected nodes at time

t \geq 0

, given that the set of infected nodes at time 0 is W, can be expressed as

P [set (U (t)) = S | set (U (0)) = W] .

(6)

The maximum likelihood estimate of

set (U (0))

is a solution to

\max_{W \in W_{S}} P [set (U (T)) = S | set (U (0)) = W],

(7)

where

S \subseteq V

is a given set of infected nodes observed at some independently chosen random time

T

, and

W_{S} \subseteq 2^{S}

is a set of hypotheses.

The single-source detection problem corresponds to the case

W_{S} = {{s} | s \in S}

. In this case, we will show that our proposal (7) can be more meaningful than the existing formulations that optimize

\max_{s \in S} P [set (U (T^{(k)})) = S | set (U (0)) = {s}],

(8)

\max_{s \in S} sup_{t \geq 0} P [set (U (t)) = S | set (U (0)) = {s}],

(9)

where

T^{(k)}

is defined in (5). Equation (8) is the ML estimator considered in [14], and Equation (9) is the ML estimator considered in [20].

2.2. Fundamental Limitations of Conventional Estimators

We propose our computation of the likelihood probability (6) by summing the probability of all possible infection sequences:

\begin{matrix} P [set (U (t)) = S | set (U (0)) = W] = \sum_{u^{k} \in Π_{S | W}} P [U (t) = u^{k} | set (U (0)) = W], \end{matrix}

(10)

where

Π_{S | W}

is the set of permitted infection sequence

u^{k}

from W to S satisfying

\begin{matrix} u_{l} \in S ∖ (W \cup set (u^{l - 1})) and \\ λ_{u_{l}} (W \cup set (u^{l - 1})) > 0, \forall l \in [k] . \end{matrix}

(11)

To compute the probability of each permitted infection sequence, define

\begin{matrix} f_{W, u^{k}}^{(l)} (t) & : = P [U (t) = u_{l}^{k} | set (U (0)) = W \cup set (u^{l - 1})] . \end{matrix}

(12)

f_{W, u^{k}}^{(l)} (t)

can be computed recursively due to the following recurrence formula:

Proposition 1.

The likelihood probability in (6) can be computed by

\begin{matrix} P [set (U (t)) = S | set (U (0)) = W] = \sum_{u^{k} \in Π_{S | W}} f_{W, u^{k}}^{(1)} (t), \end{matrix}

where

\begin{matrix} f_{W, u^{k}}^{(l)} (t) & = \{\begin{matrix} (f_{W, u^{k}}^{(l + 1)} * g_{W, u^{l}}) (t) & l \in [k] \\ 1 & l = k + 1 \end{matrix} \end{matrix}

(13)

\begin{matrix} g_{W, u^{l}} (t) & : = p_{T_{u_{l}} (B)} (t) \prod_{v \in (V ∖ B) ∖ {u_{l}}} P [T_{v} (B) > t] \end{matrix}

(14)

\begin{matrix} B & : = W \cup set (u^{l - 1}) . \end{matrix}

(15)

Proof.

See Appendix A. □

Through a concrete example, we show that the existing ML estimator (8) and JML estimator (9) considered in [14,20], respectively, both fail to identify the rumor source. In contrast, our proposed estimator (7), which incorporates prior knowledge of the spreading time, can effectively identify the rumor source.

Example 1.

Consider the single-source detection problem in the network in Figure 1, where

λ_{01} = λ_{12} = 1

. Suppose node 1 is the unknown rumor source, and at time t, we observe that all other nodes are infected, i.e.,

S = V = {0, 1, 2}

.

Proposition 2.

For Example 1, we have

(i): Both estimators in (8) and (9) are degenerate, since

$\begin{matrix} \arg \max_{s \in S} P [set (U (T^{(k)})) = S | set (U (0)) = {s}] \\ = \arg \max_{s \in S} sup_{t \geq 0} P [set (U (t)) = S | set (U (0)) = {s}] \\ = S; \end{matrix}$
(ii): For all $t \in (0, \infty)$ , there is a unique s maximizing

$\begin{matrix} P [set (U (t)) = S | set (U (0)) = {s}]; \end{matrix}$
(iii): For all probability distributions on the spreading time $T$ , we have

$\begin{matrix} \arg \max_{s \in S} P [set (U (T)) = S | set (U (0)) = {s}] \\ = \arg \max_{s \in S} P [set (U (t)) = S | set (U (0)) = {s}] . \end{matrix}$

Proof.

See Appendix B. □

Regarding the maximum likelihood estimator (8) and joint maximum likelihood estimator (9), we observe that they satisfy the following relationship.

Proposition 3.

For

S \subseteq V

and

s \in S

, we have

\begin{matrix} sup_{t \geq 0} P [set (U (t)) = S | set (U (0)) = {s}] \\ = P [set (U (T^{(k)})) = S | set (U (0)) = {s}] sup_{t \geq 0} α^{(s)} (t, S), \end{matrix}

where

\begin{matrix} α^{(s)} (t, S) : = P [T^{(k)} \leq t < T^{(k + 1)} | set (U (0)) = {s}, set (U (T^{(k)})) = S] . \end{matrix}

Proof.

See Appendix C. □

Corollary 1.

For

S \subseteq V

, we have

\begin{matrix} \arg \max_{s \in S} P [set (U (T^{(k)})) = S | set (U (0)) = {s}] \\ = \arg \max_{s \in S} sup_{t \geq 0} P [set (U (t)) = S | set (U (0)) = {s}], \end{matrix}

if

α^{(s)} (t, S)

is independent of s.

Proof.

See Appendix D. □

3. Problem Formulation

In the following subsections, we give the alternative rate-constrained formulation and the network link model for general both directed and undirected information flow.

3.1. Rate-Constrained Model

Consider a communication network on a discrete set V of nodes, where

ρ (S, A)

denotes the multicast capacity from a set

S \subseteq V

of source nodes to a set

A \subseteq V

of sink nodes. The problem of interest is as follows:

Definition 3

(Rate-constrained feasible sources). Given some rumor with rate

r > 0

is multicast to a sink

A \subseteq V

, the set of inclusionwise minimal feasible sources is defined as

\begin{matrix} S_{r} (A) : = minimal {S \subseteq V | ρ (S, A) \geq r} . \end{matrix}

(16)

for some multicast capacity

ρ (S, A)

satisfying

\begin{matrix} 0 & \overset{(a)}{\leq} ρ (S, A^{'}) \leq ρ (S^{'}, A) \overset{(b)}{\leq} \infty, \end{matrix}

(17)

for all

S \subseteq S^{'} \subseteq V

and

A \subseteq A^{'} \subseteq V

. Equality holds for (a) if

S = \emptyset

and for (b) if

A = \emptyset

.

This definition identifies the smallest sets of nodes that can collectively support the required rumor spreading rate r. Minimality is imposed without loss of generality because

ρ (S, A)

is non-decreasing in S. More explicit formulations in terms of the network links model will be given in Definitions 6 and 7. If the rumor rate r is not available, but the size of the source is bounded, one may further consider the following problem instead:

Definition 4

(Size-constrained optimal sources). Given a sink

A \subseteq V

, the maximum multicast capacity from a feasible source of size at most

k \in [| V |]

is defined as

\begin{matrix} ρ^{(k)} (A) & : = \max {r > 0 | S \in S_{r} (A), | S | \leq k} \end{matrix}

(18)

\begin{matrix} = \max_{S \subseteq V : | S | \leq k} ρ (S, A), \end{matrix}

(19)

and

S^{(k)} (A)

is defined as the corresponding set of inclusion-wise minimal solution.

This definition is used to determine the highest possible multicast rate under the constraint that the size of the source set does not exceed k. The formulation can be illustrated as follows:

Example 2

(Directed butterfly network). Consider the butterfly network in Figure 2a on

V : = [6]

with independent noiseless channels from node

u \in V

to

w \in V

with capacity

λ_{u w} = 1

bit for all directed edges

(u, w)

.

For

A : = {5, 6}

, the usual multi-source multicast rate from

S \in 2^{V}

can be shown to be

\begin{matrix} ρ (S, {5, 6}) & = \{\begin{matrix} 0 & S \subseteq {5, 6} \\ 2 & {1, 2} \subseteq S \\ 1 & otherwise . \end{matrix} \end{matrix}

In particular, the source

{1, 2}

achieves a rate of 2 by network coding as shown in Figure 2a, while other sources in

{1, 2, 3, 4}

not including

{1, 2}

achieve a rate of 1 by routing. We have

\begin{matrix} S_{r} ({5, 6}) = \{\begin{matrix} {{u} | 1 \leq u \leq 4} & r \in (0, 1] \\ {{1, 2}} & r \in (1, 2] \\ \emptyset & otherwise, \end{matrix} \end{matrix}

(20)

\begin{matrix} S^{(k)} ({5, 6}) = \{\begin{matrix} {{u} | 1 \leq u \leq 4} & k = 1 \\ {{1, 2}} & otherwise, \end{matrix} \end{matrix}

(21)

and

ρ^{(k)} ({5, 6}) = \min {k, 2}

. In particular, the source

{1, 2}

can be identified with no ambiguity if the total rate of the rumor exceeds 1.

3.2. Network Links Model

In Example 2, we illustrated how the multiple multicast rate

ρ (S, A)

can be computed using a network links model, such as a weighted directed graph. To capture general information flow that goes beyond traditional graphs, we now introduce the following enhanced formulation based on the concept of a polylinking system. This generalization allows us to capture intricate communication structures and multicast capacities, accommodating broader classes of network scenarios beyond simple directed/undirected graphs.

Definition 5.

A directed network links model is a polylinking function

λ : {(2^{V})}^{2} \mapsto R_{+}

satisfying [31,32].

\begin{matrix} \begin{matrix} λ (\emptyset, B) & = λ (B, \emptyset) = 0 \\ λ (B^{'}, C^{'}) & \leq λ (B, C) \end{matrix} & \begin{matrix} \forall B^{'} \subseteq B \subseteq V, \\ C^{'} \subseteq C \subseteq V \end{matrix} \end{matrix}

(22)

\begin{matrix} \begin{matrix} \sum_{i = 1}^{2} λ (B_{i}, C_{i}) \geq \end{matrix} & \begin{matrix} λ (B_{1} \cap B_{2}, C_{1} \cup C_{2}) \\ + λ (B_{1} \cup B_{2}, C_{1} \cap C_{2}) \end{matrix} & \begin{matrix} \forall B_{i}, C_{i} \subseteq V, \\ i \in [2] \end{matrix} \end{matrix}

(23)

(22) means λ is normalized and non-decreasing, and (23) is called bisubmodularity. For simplicity, we write

λ_{u} (B) : = λ (B, {u})

.

This property generalizes the notion of submodularity from single-set functions to functions involving two sets. For submodular functions, adding an element to a larger set yields a smaller marginal gain compared to adding it to a smaller set, a property known as diminishing returns. Bisubmodularity extends this concept to linking functions

λ (B, C)

by jointly considering the source set B and the sink set C.

In particular,

λ

generalizes the following property of graph cuts that gives rise to polynomial-time max-flow min-cut algorithms.

Proposition 4.

f (C) : = λ (V ∖ C, C)

, called the incut function, is a normalized submodular function, i.e.,

f (\emptyset) = 0

, and

\begin{matrix} f (C_{1}) + f (C_{2}) & \geq f (C_{1} \cap C_{2}) + f (C_{1} \cup C_{2}) . \end{matrix}

(24)

Proof.

See Appendix E. □

The generality of the linking function is illustrated below:

Example 3.

Consider a discrete memoryless deterministic channel

P_{Y_{V} | X_{V}}

, where

X_{u}

and

Y_{u}

for

u \in V

are the discrete channel input and output of

u \in V

, respectively, and

Y_{V} : = (Y_{u} | u \in V)

and

X_{V} : = (X_{u} | u \in V)

satisfy

\begin{matrix} H (Y_{V} | X_{V}) & = 0 and H (X_{V}) = \sum_{i \in V} H (X_{i}) . \end{matrix}

(25)

The linking function can be chosen as

\begin{matrix} λ (B, C) & : = I (X_{B} \land Y_{C} | X_{V ∖ B}) for B, C \subseteq V \\ = H (Y_{C} | X_{V ∖ B}) = \underset{(a)}{\underset{︸}{H (Y_{C}, X_{V ∖ B})}} - \sum_{u \in V ∖ B} H (X_{u}) . \end{matrix}

Bisubmodularity in (23) follows immediately from the submodularity of the entropy in (a): (polylinking system is equivalent to based polymatroid [32], i.e.,

λ (B, C) = f (C | g (B))

for some bijection

g : V \to V^{'}

with

V^{'} \cap V = \emptyset

and a normalized non-decreasing submodular function

f : 2^{V \cup V^{'}} \to R_{+}

).

\begin{matrix} \begin{matrix} \sum_{i = 1}^{2} H (Y_{C_{i}}, X_{V ∖ B_{i}}) \geq \end{matrix} & \begin{matrix} H (Y_{C_{1} \cup C_{2}}, X_{V ∖ (B_{1} \cap B_{2})}) \\ + H (Y_{C_{1} \cap C_{2}}, X_{V ∖ (B_{1} \cup B_{2})}) \end{matrix} \end{matrix}

The directed network links model is the special case when

\begin{matrix} λ (B, C) & : = \sum_{u \in B, w \in C} λ_{u w}, \end{matrix}

(26)

where

λ_{u w}

is the capacity of an independent noiseless channel from

u \in V

to

w \in V

.

We consider the rumor to be a vector of independent messages, with one from a different source node. The multicast rate

ρ

can be efficiently computed from

λ

using the max-flow min-cut results (see for instance [21]), which were extended to the cyclic linking network model in [33], as follows:

Definition 6.

For directed networks, the maximum multi-source multicast rate is

\begin{matrix} \begin{matrix} ρ_{ms} (S, A) & : = \max {r (S) | r_{V} \in R_{+}^{V}, \\ r (B) \leq λ (B, V ∖ B) \forall B \subseteq V : B ⊉ A}, \end{matrix} \end{matrix}

(27)

where

r_{V} : = (r_{i} | i \in V)

, and

r (B) : = \sum_{u \in B} r_{u}

.

If source nodes can collude to encode the rumor, the capacity becomes the maximum single-source multicast rate

\begin{matrix} ρ_{ss} (S, A) & : = \min_{B \subseteq V : S \subseteq B ⊉ A} λ (B, V ∖ B) . \end{matrix}

(28)

If the sink nodes can collude to recover the rumor, the capacity becomes the maximum multiple access rate

\begin{matrix} \begin{matrix} ρ_{ma} (S, A) & : = \max {r (S) | r_{V} \in R_{+}^{V}, \\ r (B) \leq λ (B, V ∖ B) \forall B \subseteq V ∖ A} . \end{matrix} \end{matrix}

(29)

If the source nodes and sink nodes can collude respectively, the capacity becomes the maximum unicast rate

\begin{matrix} ρ_{uc} (S, A) & : = \min_{B \subseteq V ∖ A : S \subseteq B} λ (B, V ∖ B) . \end{matrix}

(30)

For undirected networks, the maximum multicast rates can be computed using the result of the matroid undirected network in [34]. More precisely, the directed network rates can be modified by an additional optimization over choice of the direction, which can be modeled as a choice of the base

x_{V} \in B (h)

of a normalized non-decreasing submodular function

h : 2^{V} \mapsto R_{+}

with the cut function replaced as follows

\begin{matrix} λ (B, V ∖ B) = h (V ∖ B) - x (V ∖ B) . \end{matrix}

Definition 7

(Undirected communication capacities). The capacities in Definition 6 for undirected networks are

\begin{matrix} \begin{matrix} {\tilde{ρ}}_{ms} (S, A) & : = \max {r (S) | r_{V} \in R_{+}^{V}, \\ r (B) \leq λ (B, V ∖ B) \forall B \subseteq V : B ⊉ A} \end{matrix} \end{matrix}

(31)

\begin{matrix} {\tilde{ρ}}_{ss} (S, A) & : = \min_{B \subseteq V : S \subseteq B ⊉ A} λ (B, V ∖ B) . \end{matrix}

(32)

\begin{matrix} \begin{matrix} {\tilde{ρ}}_{ma} (S, A) & : = \max {r (S) | r_{V} \in R_{+}^{V}, \\ r (B) \leq λ (B, V ∖ B) \forall B \subseteq V ∖ A} . \end{matrix} \end{matrix}

(33)

\begin{matrix} {\tilde{ρ}}_{uc} (S, A) & : = \min_{B \subseteq V ∖ A : S \subseteq B} λ (B, V ∖ B) . \end{matrix}

(34)

The source detection problem for information flow is very rich, as one may consider different possibilities of collusion in the calculations of the capacities. Furthermore, the different capacities are polynomial-time computable and can be viewed as bounds on each other as follows:

Proposition 5.

The multicast rates are related as follows:

\begin{matrix} ρ_{ss} (S, A), ρ_{ma} (S, A) & \in [ρ_{ms} (S, A), ρ_{uc} (S, A)], \end{matrix}

(35)

\begin{matrix} {\tilde{ρ}}_{ss} (S, A), {\tilde{ρ}}_{ma} (S, A) & \in [{\tilde{ρ}}_{ms} (S, A), {\tilde{ρ}}_{uc} (S, A)] . \end{matrix}

(36)

Furthermore, these rates are computable in polynomial time.

Proof.

See Appendix F. □

The rate-constrained source detection problem for undirected networks is illustrated below:

Example 4

(Undirected butterfly network). Consider an undirected butterfly network where each link can be redirected arbitrarily and fractionally. Figure 2 shows two different ways to direct the network. For

A : = {5, 6}

, the multi-source multicast rate from

S \in 2^{V}

is

\begin{matrix} ρ (S, {5, 6}) & = \{\begin{matrix} 0 & S = \emptyset, \\ 1 & S \in {{1}, {2}}, \\ 2 & otherwise . \end{matrix} \end{matrix}

(37)

In addition to the coding solution for the source

{1, 2}

shown in Figure 2a, Figure 2b shows how a rate of 2 can be achieved by routing from source

{3}

. It follows that

\begin{matrix} S_{r} ({5, 6}) = \{\begin{matrix} {{u} | 1 \leq u \leq 6} & r \in (0, 1] \\ {{1, 2}, {u} | 3 \leq u \leq 6} & r \in (1, 2] \\ \emptyset & otherwise . \end{matrix} \end{matrix}

(38)

ρ^{(k)} ({5, 6}) = 2

for

k \in [6]

with

S^{(1)} ({5, 6}) = {{u} | 3 \leq u \leq 6}

and

S^{(k)} ({5, 6}) = {{1, 2}, {u} | 3 \leq u \leq 6}, 2 \leq k \leq 6

. In particular, when the rumor rate exceeds 1, the set

{1, 2}

is not the only possible source, and there are simpler feasible sources such as

{3}

.

4. Main Results

4.1. Rate-Constrained Source Selection Algorithms

We first consider a backward elimination algorithm to find a feasible source in

S_{r} (A)

in (16) for the rate-constrained model.

Proposition 6.

The algorithm in Listing 1 gives a feasible solution in

S_{r} (A)

, if any, in time polynomial in

| V |

.

Proof.

Correctness follows from the fact that

ρ (S, A)

is non-decreasing in S (17). Since

ρ (S, A)

is computable in polynomial time by Proposition 5, and there are at most

| V |

such computations, the overall complexity is still polynomial. □

Listing 1. Backward elimination algorithm.

1 def feasible_source (A):

2 if ρ(V,A) < r: return None

3 S = V

4 for u in V:

5 if ρ(S\{u}, A) ≥ r: S = S\{u}

6 return S

When the rate r is not available, but the size of the source set is constrained by k, a forward search algorithm similar to the one for submodular function maximization can be used to approximate

ρ^{(k)} (A)

in (4).

The above greedy forward search algorithm has been shown to be a good approximation for the multi-source unicast scenario since

ρ_{u c} (S, A)

is submodular in S, and

ρ^{(k)} (S, A)

becomes a size-constrained submodular function maximization.

Proposition 7.

The algorithm in Listing 2 gives a lower bound of

ρ^{(k)}

in time polynomial in

| V |

. Furthermore, if

ρ (S, A)

is submodular in S, the lower bound is at least

(1 - 1 / e) ρ^{(k)} (S, A)

.

Proof.

Since

ρ (S, A)

is computable in polynomial time by Proposition 5, and there are at most

{| V |}^{2}

such computations, the overall complexity is still polynomial. If

ρ^{(k)} (S, A)

is submodular, the problem becomes the size-constrained submodular function maximization. The approximation factor was given by [35]. □

Listing 2. Forward search algorithm.

1 def approximate_rho (k):

2 S, r, u = ∅, 0, None

3 while |S| < k:

4 for w in V\S:

5 if ρ(S∪{w}, A) ≥ r:

6 u, r = w, ρ(S∪{w}, A)

7 S = S∪{u}

8 return r

4.2. Source Feasibility Under Multicast Rate Constraints

To simplify the calculation of (31), we use the max-flow min-cut result from network coding [21] and apply the node contraction to find the minimum cut value

λ (B, V ∖ B)

from source nodes to sink nodes.

\begin{matrix} \min_{B \subseteq V} λ (B, V ∖ B) & = \min_{j \in A} \min_{B \subseteq V ∖ {j}} λ (B, {j}) \end{matrix}

(39)

\begin{matrix} = \min_{j \in A} mincut (s, j), \end{matrix}

(40)

where s is the node by node contraction in B. Note that the simplification can be used in all the cases by adding the corresponding constraints.

Then, we reproduced (The source code of our experiments can be found at https://github.com/ZimengInfo/InfodemicSource, accessed on 11 August 2025) the result from Example 2 using (31) and the backward elimination in Listing 1. In this setup, the sink set was fixed to

A = {5, 6}

. By enumerating the multicast rate values, the feasible sources were the minimal set achieving the target rate. This exactly matches the optimal solution given in Example 2, thereby verifying the correctness of our implementation and the soundness of the backward elimination approach. For further verification, (31) is also treated as a linear programming problem with submodular constraints to show the precise solutions.

We extend our feasibility experiment beyond the butterfly network to synthetic graphs commonly used in network modeling: Barabási–Albert (BA) [36], Watts–Strogatz (WS) [37], and Erdős–Rényi (ER) [38,39], each with 20 nodes. These graphs were first generated as undirected, cleaned for weak connectivity, and then converted to directed graphs with consistent flow direction (from lower- to higher-indexed nodes). Unit capacities were assigned to all edges. For each graph, we selected a sink set A appropriate to its structure. Specifically, for the BA graph, which has hubs forming early in the index sequence, we selected the first two nodes, i.e.,

{0, 1}

, as sinks. For the other two, which have more uniform edge direction distributions, we selected the last two nodes, i.e.,

{18, 19}

. This ensures that the sinks are actually reachable from potential source nodes.

We then applied the backward elimination algorithm using the contraction-based multicast rate to test for feasible source sets under target rate thresholds

r \in {0.5, 1.0, 2.0, 3.0, 6.0}

. The goal was to find a minimal source set S such that

ρ_{ms} (S, A) \geq r

. From Table 1, we observe that all graphs admit small feasible sets for low target rates. As the rate increases, the feasible sets grow in size and, for the ER graph, become infeasible at higher thresholds. The BA graph, benefiting from its scale-free structure, typically requires fewer sources to reach the sinks than WS and ER graphs. These results demonstrate the practical behavior of our source selection algorithm and confirm the influence of graph topology on multicast feasibility.

4.3. Computational and Structural Behavior of Lazy–Greedy Forward Search

In the greedy forward search, the objective is to find a set S of source nodes that maximizes the multicast rate

ρ (S, A)

, where A is the sink set. At each step, we select the node w that provides the maximum marginal gain to the current set S, where the marginal gain

Δ_{w} (S)

is defined as [40]:

Δ_{w} (S) = ρ (S \cup {w}, A) - ρ (S, A) .

(41)

The marginal gain represents the additional benefit gained by adding node w to the set S. For each iteration, the greedy algorithm evaluates all nodes in

V ∖ S

and selects the one with the highest marginal gain. This process continues until the desired number of source nodes k are selected, see Listing 2.

We also propose the lazy–greedy forward search algorithm, which reduces the number of redundant marginal gain evaluations by caching and reusing previously computed values. The core idea is that once a node w has been identified as the best candidate in one iteration, it does not need to be re-evaluated unless it rises to the top of the priority queue again due to changes in the current source set S.

The lazy–greedy algorithm uses a priority queue to store nodes along with their cached marginal gains. At each step, the node with the largest cached marginal gain is selected, and the marginal gain is recomputed only if the cached value is outdated. This reduces the number of recalculations, especially for nodes that are not frequently selected, see Listing 3.

Listing 3. Lazy–greedy forward search algorithm.

1 def approximate_rho(k):

2 S, r = ∅, 0

3 PQ = [(− ρ({W}, A), w) for w in V]

4 heapify(PQ)

5 while |S| < k and PQ ≠ ∅:

6 g, w = heappop(PQ)

7 if w ∈ S: continue

8 Δ = ρ(S∪{w}, A)− r

9 if Δ < −g:

10 heappush(PQ, (−Δ, w))

11 continue

12 S = S∪{w}

13 r = r + Δ

14 return r

where PQ is the priority queue,

Δ

is the marginal gain, and g is the cached gain. The lazy–greedy approach introduces several improvements over the original greedy algorithm:

Efficiency: The primary advantage of the lazy–greedy algorithm is its efficiency in terms of the computational cost. By caching the marginal gains and only recomputing them when necessary, the lazy–greedy algorithm reduces the number of oracle calls and heap operations compared to the original greedy approach. While the greedy algorithm evaluates every node in $V ∖ S$ in each iteration, the lazy–greedy algorithm evaluates only those nodes that are most likely to provide the largest marginal gain, reducing unnecessary computations.
Time Complexity: The original greedy algorithm has a time complexity of $O (k \cdot | V |)$ because it must evaluate all nodes in $V ∖ S$ during each iteration. In contrast, the lazy–greedy version has $O (k \cdot log | V | + \sum_{w} τ (w) log | V |)$ , where $τ$ is the number of times node w is evaluated.
Approximation Guarantee: Both the greedy and lazy–greedy algorithms maintain the same approximation factor of $1 - 1 / e$ when $ρ$ is submodular, as demonstrated by Proposition 5.

Now let us evaluate the practical scalability and rate behavior with Listing 3, which we use throughout this work to approximate the maximum multicast rate

ρ (S^{(k)} A, A)

under a source budget constraint. This algorithm is designed to exploit the diminishing returns property of monotonic rate functions via a priority queue and marginal gain caching mechanism.

In the first experiment, we explore how the achieved multicast rate

ρ (S^{(k)} A, A)

evolves with increasing source budget k. We generate the Barabási–Albert graphs, which is a good representation of social networks [36], fix the graph size to

n = 200

, set a fixed sink set

A = {0, 1}

, and run the lazy–greedy algorithm for

k \in {1, 2, \dots, 20}

. The results, shown in Figure 3a, indicate a steady increase in the achieved rate up to approximately 11.0, after which the curve saturates. This saturation point reflects the inherent capacity limit (min-cut) between the selected source set and the sinks. The experiment confirms that the algorithm prioritizes high-impact nodes early and naturally exposes network bottlenecks, making it well-suited for capacity-aware source detection, i.e., the size-constrained optimal source problem.

We then assess how the number of oracle calls scales with the graph size. We generate BA graphs of increasing size

n \in {50, 100, 200, 400, 800, 1600}

, with unit edge capacities and maintain the sink set

A = {1, 0}

. For each graph, we fix the source budget

k = 10

and count the number of calls to

ρ_{uc}

during the execution of both the lazy–greedy and plain greedy algorithms. As shown in Figure 3b, both methods exhibit linear growth in the number of oracle calls as the number of nodes increases. This is consistent with the theoretical bounds, since the number of calls is

O (k \cdot | V |)

, and k is held constant. Importantly, the lazy–greedy algorithm consistency performs fewer oracle calls due to its ability to reuse cached marginal gains, confirming its practical scalability benefits.

These experiments collectively demonstrate that the lazy–greedy forward search algorithm is both computationally efficient and behaviorally robust. It scales linearly with the graph size under fixed budget constraints and achieves near-optimal rates without exhaustive evaluation. These properties make it a practical choice for source selection tasks in large-scale networks.

5. Discussion

This study revisits the problem of source detection and underscores the significant limitations of the conventional likelihood-based estimators. Theoretical analysis reveals that both maximum likelihood (ML) and joint maximum likelihood (JML) estimators exhibit degeneracy even in straightforward simple topologies, such as a three-node line graph. To address this issue, we have reformulated the source detection problem through the framework of rate-constrained multicast capacity, drawing upon network flow theory and polylinking systems. By employing multicast rate functions

ρ (S, A)

, we are able to precisely identify the feasible sources in a more general graphical structure.

The experimental results substantiate both the theoretical correctness and practical applicability of this framework. The backward elimination algorithm effectively identifies minimal feasible sources within a specified rate threshold, as shown in the example of the butterfly network. On the computational front, the lazy–greedy forward search algorithm achieves near-optimal rate performance while adhering to cardinality constraints. Notably, our empirical analysis indicates that the number of oracle calls increases linearly with the size of the graph when the source budget k remains fixed, thereby confirming the scalability of the proposed approach. Furthermore, the rate attained through greedy source selection quickly approaches saturation as k increases, aligning with the network’s min-cut bound and demonstrating the model’s capacity to reveal fundamental flow bottlenecks.

Our framework assumes access to either the rumor spreading rate r or the size of the source set k. If neither is available, the method cannot be directly applied. However, in practical cases, such as epidemic tracking, the set of infected nodes is observed incrementally, rather than all at once. As new infected nodes are discovered, the underlying graph and infection snapshot can be updated, allowing the source detection algorithms to be applied recursively. Thus, developing a systematic version of this recursive strategy presents a promising direction for future work.

Another avenue for future research may involve applying this rate-constrained model to large-scale real-world networks. This would involve challenges such as determining optimal guarantee for the submodular function maximization with Listing 2 in the most general case of

ρ_{m s} (S, A)

, as well as exploring the GPU-accelerated max-flow min-cut algorithm to improve the computational efficiency.

6. Conclusions

We introduced a rate-constrained framework for infodemic source detection, motivated by the failure of maximum likelihood-based estimators in even simple networks. Our approach leverages the structure of multicast capacity and information flow to define source feasibility and optimize source selection. We proposed polynomial-time algorithms based on backward elimination and forward greedy approximation.

Our results show that the proposed methods are effective and scalable, achieving near-optimal rates and revealing structural bottlenecks in both theoretical examples and synthetic graphs. The modeling flexibility and algorithmic efficiency of our framework provide a principled foundation for future extensions.

Author Contributions

Conceptualization, C.W.T. and C.C.; Formal analysis, C.C. and Q.Z.; Project administration, C.W.T. and C.C.; Software, Z.W.; Supervision, C.W.T. and C.C.; Writing—original draft, Z.W.; Writing—review and editing, C.W.T., C.C., Q.Z. and C.Z.; Funding acquisition, C.W.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded in part by the Guangdong Basic and Applied Basic Research Foundation under Grant 2025A1515012130 and in part by the Innovation and Technology Fund (ITF) Project ITS/188/20 from the Innovation and Technology Commission (ITC) of the Hong Kong Special Administrative Region, China.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Proof of Proposition 1

Proof.

We first contract all nodes in W into a single node

u_{0}

without loss of generality so that the infection process can be analyzed from a single source.

Consider any infection sequence

u^{k} \in Π_{S | W}

. For such a sequence, we expand the probability of observing

U (t) = u_{l}^{k}

step by step. Using the chain rule, we obtain

\begin{matrix} P [U (t) = u_{l}^{k} | set (U (0)) = set (u_{0}^{l - 1})] \\ = \int_{τ = 0}^{t} P [U (t) = u_{l}^{k} | U (0) = u_{0}^{l - 1}, T_{u_{l}} (set (u_{0}^{l - 1})) = τ, \\ T_{v} (set (u_{0}^{l - 1})) > τ, v \in V ∖ set (u_{0}^{l})] \\ \times P [T_{v} (set (u_{0}^{l - 1})) > τ, v \in V ∖ set (u_{0}^{l})] d P_{T_{u_{l}} (set (u_{0}^{l - 1}))} (τ) . \end{matrix}

Here, we condition on the event that node

u_{l}

becomes infected at time

τ

while all other nodes remain uninfected until

τ

.

By the memoryless property of exponential random variables, the process after time

τ

is equivalent to a fresh infection process starting from

u_{l}

. Thus, the above equals

\begin{matrix} \int_{τ = 0}^{t} P [U (t - τ) = u_{l + 1}^{k} | U (0) = u_{0}^{l}] \\ \times P [T_{v} (set (u_{0}^{l - 1})) > τ, v \in V ∖ set (u_{0}^{l})] d P_{T_{u_{l}} (set (u_{0}^{l - 1}))} (τ) . \end{matrix}

Because infection times are mutually independent across nodes, the second term factorizes into a product, leading to

\begin{matrix} \int_{τ = 0}^{t} P [U (t - τ) = u_{l + 1}^{k} | U (0) = u_{0}^{l}] \\ \times \prod_{v \in (V ∖ B) ∖ {u_{l}}} P [T_{v} (B) > τ] p_{T_{u_{l}} (B)} (τ), \end{matrix}

where

B : = W \cup set (u^{l - 1})

.

This expression exactly matches the convolution form

\begin{matrix} f_{W, u^{k}}^{(1)} = (f_{W, u^{k}}^{(l + 1)} * g_{W, u^{l}}) (t), \end{matrix}

where

g_{W, u^{l}} (t) : = p_{T_{u_{l}} (B)} (t) \prod_{v \in (V ∖ B) ∖ {u_{l}}} P [T_{v} (B) > t]

.

For the base case

l = k + 1

, all nodes in S are infected, so the recursion terminates with

\begin{matrix} f_{W, u^{k}}^{(k + 1)} (t) = 1 . \end{matrix}

This proves the desired recurrence relation. □

Appendix B. Proof of Proposition 2

Proof.

For the ML estimator (8), we have from (6) that

\begin{matrix} P [set (U (T^{(3)})) = {0, 1, 2} | set (U (0)) = {0}] \\ = 1, \end{matrix}

since

(0, 1, 2)

is the only possible spreading sequence of size 3 if we assume node 0 is the rumor source.

\begin{matrix} P [set (U (T^{(3)})) = {0, 1, 2} | set (U (0)) = {1}] \\ = 1, \end{matrix}

since

{0, 1, 2}

is the only possible set of size 3 in the network assuming node 1 is the rumor source. By symmetry, we have

\begin{matrix} P [set (U (T^{(3)})) = {0, 1, 2} | set (U (0)) = {2}] \\ = P [set (U (T^{(3)})) = {0, 1, 2} | set (U (0)) = {0}] \\ = 1 . \end{matrix}

It then follows that each node in

{0, 1, 2}

has the same probability of being the rumor source, i.e.,

\arg \max_{s \in S} P [set (U (T^{(k)})) = S | set (U (0)) = {s}] = S

.

For the JML estimator (9), we have from (10) that

\begin{matrix} P [set (U (t)) = S | set (U (0)) = {0}] \\ = \sum_{u^{2} \in Π_{{0, 1, 2} | {0}}} P [U (t) = u^{2} | set (U (0)) = {0}] \\ = P [U (t) = (1, 2) | set (U (0)) = {0}] \\ = (1 * e^{- t}) * e^{- t} \\ = \int_{τ = 0}^{t} [1 - e^{- (t - τ)}] e^{- τ} d τ \\ = {- e^{- τ} - e^{- t} τ|}_{τ = 0}^{t} \\ = - t e^{- t} - e^{- t} + 1 . \end{matrix}

\begin{matrix} P [set (U (t)) = S ∣ set (U (0)) = {1}] \\ = \sum_{u^{2} \in Π_{{0, 1, 2} | {1}}} P [U (t) = u^{2} | set (U (0)) = {1}] \\ = P [U (t) = (0, 2) ∣ set (U (0)) = {1}] \\ + P [U (t) = (2, 0) | set (U (0)) = {1}] \\ = (1 * e^{- t}) * (e^{- t} \times e^{- t}) + (1 * e^{- t}) * (e^{- t} \times e^{- t}) \\ = 2 \int_{τ = 0}^{t} [1 - e^{- (t - τ)}] e^{- 2 τ} d τ \\ = {- e^{- 2 τ} + 2 e^{- t - τ}|}_{τ = 0}^{t} \\ = e^{- 2 t} - 2 e^{- t} + 1 . \end{matrix}

By symmetry, we have

\begin{matrix} P [set (U (t)) = S | set (U (0)) = {2}] \\ = P [set (U (t)) = S | set (U (0)) = {0}] \\ = - t e^{- t} - e^{- t} + 1 . \end{matrix}

Altogether, the supremums over

t \geq 0

of the above three estimators are all 1.

Hence, each node in

{0, 1, 2}

has the equal probability to be the rumor source, i.e.,

\arg \max_{s \in S} sup_{t \geq 0} P [set (U (t)) = S | set (U (0)) = {s}] = S .

It follows from the above that

\begin{matrix} P [set (U (t)) = S ∣ set (U (0)) = {1}] \\ - P [set (U (t)) = S ∣ set (U (0)) = {0}] \\ = e^{- 2 t} - e^{- t} + t e^{- t} \\ = e^{- t} [t - (1 - e^{- t})] . \end{matrix}

It can be easily verified that

t \geq 1 - e^{- t}

. Thus, we obtain

\begin{matrix} P [set (U (t)) = S | set (U (0)) = {1}] \\ > P [set (U (t)) = S | set (U (0)) = {0}] \\ = P [set (U (t)) = S | set (U (0)) = {2}], \forall t > 0, \end{matrix}

which implies node 1 is the unique s maximizing

P [set (U (t)) = S | set (U (0)) = {s}]

for all

t \in (0, \infty)

. Furthermore, for any probability distribution on the spreading time, the above inequality implies

\begin{matrix} P [set (U (T)) = S | set (U (0)) = {1}] \\ > [set (U (T)) = S | set (U (0)) = {0}] \\ = [set (U (T)) = S | set (U (0)) = {2}] . \end{matrix}

Thus, we have

\begin{matrix} \arg \max_{s \in S} P [set (U (T)) = S ∣ set (U (0)) = {s}] \\ = \arg \max_{s \in S} P [set (U (t)) = S ∣ set (U (0)) = {s}] \\ = {1}, \end{matrix}

for all probability distributions on the spreading time. □

Appendix C. Proof of Proposition 3

Proof.

We begin with the definition of the likelihood probability in (6):

\begin{matrix} P [set (U (t)) = S | set (U (0)) = {s}] . \end{matrix}

Step 1: Decomposition using event equivalence. The event $set (U (t)) = S$ is equivalent to the joint occurrence of the event in which the first k infected nodes from the set S and the observation time t fall between the infection times of the k-th and $(k + 1)$ -th nodes. Therefore, we can rewrite the probability as

$\begin{matrix} P [set (U (T^{(k)})) = S, T^{(k)} \leq t, T^{(k + 1)} > t | set (U (0)) = {s}] . \end{matrix}$
Step 2: Chain rule factorization. Applying the chain rule, we separate the probability into two terms:

$\begin{matrix} P [set (U (T^{(k)})) = S | set (U (0)) = {s}] \\ \times P [T^{(k)} \leq t < T^{(k + 1)} | set (U (0)) = {s}, set (U (T^{(k)})) = S] . \end{matrix}$

Here, the first term depends only on the infection sequence up to size k, while the second term captures the conditional timing information.
Step 3: Taking the supremum over t. We now take the supremum over all observation times $t \geq 0$ . Since the first term does not depend on t, the supremum only affects the second term:

$\begin{matrix} sup_{t \geq 0} P [set (U (t)) = S | set (U (0)) = {s}] \\ = P [set (U (T^{(k)})) = S | set (U (0)) = {s}] \times sup_{t \geq 0} α (s) (t, s), \end{matrix}$

where

$\begin{matrix} α (s) (t, S) : = P [T^{(k)} \leq t < T^{(k + 1)} | set (U (0)) = {s}, set (U (T^{(k)})) = S] . \end{matrix}$

This shows that the supremum of the JML estimator factorizes into the ML term and a timing factor $α (s) (t, S)$ . □

Appendix D. Proof of Corollary 1

Proof.

We wish to compare the two

\arg \max

expressions in (8) and (9). Starting by recalling Appendix C, which rewrites the joint maximum likelihood (JML) (9) objective as the product of two factors:

\begin{matrix} sup_{t \geq 0} P [set (U (t)) = S | set (U (0)) = {s}] \\ = P [set (U (T^{(k)})) = S | set (U (0)) = {s}] \times sup_{t \geq 0} α (s) (t, s), \end{matrix}

where

\begin{matrix} α (s) (t, S) : = P [T^{(k)} \leq t < T^{(k + 1)} | set (U (0)) = {s}, set (U (T^{(k)})) = S], \end{matrix}

and the first term is exactly the likelihood appearing in (8).

When

α^{(s)} (t, S)

is independent of the candidate source node s, the supremum

{sup}_{t \geq 0} α^{(s)} (t, S)

is a constant with respect to s and hence cancels out when taking the

\arg \max

over

s \in S

. □

Appendix E. Proof of Proposition 4

Proof.

We firstly prove the normalization. By the definition of f and (22),

f (\emptyset) = λ (V ∖ \emptyset, \emptyset) = λ (V, \emptyset) = 0 .

We then continue with the proof of submodularity. Since

\begin{matrix} V ∖ (C_{1} \cap C_{2}) = (V ∖ C_{1}) \cup (V ∖ C_{2}), \\ V ∖ (C_{1} \cup C_{2}) = (V ∖ C_{1}) \cap (V ∖ C_{2}), \end{matrix}

the inequality can be reformulated by

\begin{matrix} λ (V ∖ C_{1}, C_{1}) + λ (V ∖ C_{2}, C_{2}) \\ \geq λ ((V ∖ C_{1}) \cup (V ∖ C_{2}), C_{1} \cap C_{2}) + λ ((V ∖ C_{1}) \cap (V ∖ C_{2}), C_{1} \cup C_{2}) . \end{matrix}

Set

B_{1} = V ∖ C_{1}

and

B_{2} = V ∖ C_{2}

. The above inequality can be expressed as

λ (B_{1}, C_{1}) + λ (B_{2}, C_{2}) \geq λ (B_{1} \cup B_{2}, C_{1} \cap C_{2}) + λ (B_{1} \cap B_{2}, C_{1} \cup C_{2}),

which precisely matches the bisubmodularity condition from (23). □

Appendix F. Proof of Proposition 5

Proof.

Equation (31) gives (32) as an upper bound by removing the constraints for

B ⊉ S

. Equation (31) gives (33) as an upper bound by removing the constraints for

B \cap A \neq \emptyset

. Removing both sets of constraints give (34). The polynomial time complexity follow from the submodularity of the cut in Proposition 4. In particular, the optimal direction can also be computed in polynomial-time as explained in [34]. □

References

Khan, M.S.S. A computer virus propagation model using delay differential equations with probabilistic contagion and immunity. Int. J. Comput. Netw. Commun. 2014, 6, 111. [Google Scholar] [CrossRef]
Ai, J.; Wang, T. Exploring Cyber Risk Contagion-A Boundless Threat. Variance 2023, 16, 1. [Google Scholar]
Tiwary, B.K. Computational medicine: Quantitative modeling of complex diseases. Briefings Bioinform. 2019, 21, 429–440. [Google Scholar] [CrossRef]
Zhao, J.; Cheong, K.H. Enhanced epidemic control: Community-based observer placement and source tracing. IEEE Trans. Syst. Man Cybern. Syst. 2025, 55, 2747–2758. [Google Scholar] [CrossRef]
Chierichetti, F.; Lattanzi, S.; Panconesi, A. Rumor spreading in social networks. Theor. Comput. Sci. 2011, 412, 2602–2610. [Google Scholar] [CrossRef]
Jin, R.; Wu, W. Schemes of propagation models and source estimators for rumor source detection in online social networks: A short survey of a decade of research. Discret. Math. Algorithms Appl. 2021, 13, 2130002. [Google Scholar] [CrossRef]
He, Q.; Zhang, S.; Cai, Y.; Yuan, W.; Ma, L.; Yu, K. A survey on exploring real and virtual social network rumors: State-of-the-art and research challenges. ACM Comput. Surv. 2025, 57, 1–37. [Google Scholar] [CrossRef]
Shelke, S.; Attar, V. Source detection of rumor in social network–a review. Online Soc. Netw. Media 2019, 9, 30–42. [Google Scholar] [CrossRef]
Ahmad, T.; Faisal, M.S.; Rizwan, A.; Alkanhel, R.; Khan, P.W.; Muthanna, A. Efficient fake news detection mechanism using enhanced deep learning model. Appl. Sci. 2022, 12, 1743. [Google Scholar] [CrossRef]
Shao, C.; Ciampaglia, G.L.; Varol, O.; Yang, K.C.; Flammini, A.; Menczer, F. The spread of low-credibility content by social bots. Nat. Commun. 2018, 9, 4787. [Google Scholar] [CrossRef]
Do Nascimento, I.J.B.; Pizarro, A.B.; Almeida, J.M.; Azzopardi-Muscat, N.; Gonçalves, M.A.; Björklund, M.; Novillo-Ortiz, D. Infodemics and health misinformation: A systematic review of reviews. Bull. World Health Organ. 2022, 100, 544. [Google Scholar] [CrossRef]
Cremer, F.; Sheehan, B.; Fortmann, M.; Kia, A.N.; Mullins, M.; Murphy, F.; Materne, S. Cyber risk and cybersecurity: A systematic review of data availability. Geneva Pap. Risk Insurance. Issues Pract. 2022, 47, 698. [Google Scholar] [CrossRef]
Bailey, N.T.J. The Mathematical Theory of Infectious Diseases and Its Applications, 2nd ed.; Charles Griffin & Company Ltd.: Bucks, UK, 1975; pp. xvi + 413. [Google Scholar]
Shah, D.; Zaman, T. Rumors in a Network: Who’s the Culprit? IEEE Trans. Inf. Theory 2011, 57, 5163–5181. [Google Scholar] [CrossRef]
MacKay, D.J.C. Information Theory, Inference & Learning Algorithms; Cambridge University Press: New York, NY, USA, 2003. [Google Scholar]
Shah, D.; Zaman, T. Rumor centrality: A universal source detector. SIGMETRICS Perform. Eval. Rev. 2012, 40, 199–210. [Google Scholar] [CrossRef]
Athreya, K.B.; Ney, P.E. Multi-Type Branching Processes. In Branching Processes; Springer: Berlin/Heidelberg, Germany, 1972; pp. 181–228. [Google Scholar]
Yu, P.D.; Tan, C.W.; Fu, H.L. Epidemic Source Detection in Contact Tracing Networks: Epidemic Centrality in Graphs and Message-Passing Algorithms. IEEE J. Sel. Top. Signal Process. 2022, 16, 234–249. [Google Scholar] [CrossRef]
Tan, C.W.; Yu, P.D. Contagion Source Detection in Epidemic and Infodemic Outbreaks: Mathematical Analysis and Network Algorithms. Found. Trends Netw. 2023, 13, 107–251. [Google Scholar] [CrossRef]
Fan, T.H.; Wang, I.H. Rumor Source Detection: A Probabilistic Perspective. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 4159–4163. [Google Scholar] [CrossRef]
Ahlswede, R.; Cai, N.; Li, S.Y.; Yeung, R.W. Network information flow. IEEE Trans. Inf. Theory 2000, 46, 1204–1216. [Google Scholar] [CrossRef]
Li, S.Y.; Yeung, R.; Cai, N. Linear network coding. IEEE Trans. Inf. Theory 2003, 49, 371–381. [Google Scholar] [CrossRef]
Koetter, R.; Medard, M. An algebraic approach to network coding. IEEE/ACM Trans. Netw. 2003, 11, 782–795. [Google Scholar] [CrossRef]
Jaggi, S.; Sanders, P.; Chou, P.; Effros, M.; Egner, S.; Jain, K.; Tolhuizen, L. Polynomial time algorithms for multicast network code construction. IEEE Trans. Inf. Theory 2005, 51, 1973–1982. [Google Scholar] [CrossRef]
Harvey, N.; Kleinberg, R.; Lehman, A. On the capacity of information networks. IEEE Trans. Inf. Theory 2006, 52, 2345–2364. [Google Scholar] [CrossRef]
Ho, T.; Medard, M.; Koetter, R.; Karger, D.; Effros, M.; Shi, J.; Leong, B. A Random Linear Network Coding Approach to Multicast. IEEE Trans. Inf. Theory 2006, 52, 4413–4430. [Google Scholar] [CrossRef]
Schrijver, A. Combinatorial Optimization: Polyhedra and Efficiency; Springer: Berlin/Heidelberg, Germany, 2003; Volume 24. [Google Scholar]
Yeung, R.W. Information Theory and Network Coding; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2008. [Google Scholar]
Zhang, Z.; Yeung, R. On characterization of entropy function via information inequalities. IEEE Trans. Inf. Theory 1998, 44, 1440–1452. [Google Scholar] [CrossRef]
Zhao, C.; Wang, Z.; Zhou, Q.; Tan, C.W.; Chan, C. Infodemic Source Detection: Enhanced Formulations with Information Flow. In Proceedings of the 2024 IEEE International Symposium on Information Theory (ISIT), Athens, Greece, 7–12 July 2024; pp. 3291–3296. [Google Scholar] [CrossRef]
Goemans, M.X.; Iwata, S.; Zenklusen, R. A flow model based on polylinking system. Math. Program. 2012, 135, 1–23. [Google Scholar] [CrossRef]
Schrijver, A. Matroids and linking systems. J. Comb. Theory, Ser. B 1979, 26, 349–369. [Google Scholar] [CrossRef][Green Version]
Chan, C.; Shum, K.W.; Sun, Q.T. Combinatorial flow over cyclic linear networks. In Proceedings of the 2013 IEEE Information Theory Workshop (ITW), Sevilla, Spain, 9–13 September 2013; pp. 1–5. [Google Scholar]
Chan, C. Matroidal undirected network. In Proceedings of the 2012 IEEE International Symposium on Information Theory Proceedings, Cambridge, MA, USA, 1–6 July 2012; pp. 1498–1502. [Google Scholar]
Nemhauser, G.L.; Wolsey, L.A.; Fisher, M.L. An analysis of approximations for maximizing submodular set functions—I. Math. Program. 1978, 14, 265–294. [Google Scholar] [CrossRef]
Barabási, A.L.; Albert, R. Emergence of scaling in random networks. Science 1999, 286, 509–512. [Google Scholar] [CrossRef]
Watts, D.J.; Strogatz, S.H. Collective dynamics of ‘small-world’networks. Nature 1998, 393, 440–442. [Google Scholar] [CrossRef]
Renyi, E. On random graph. Publ. Math. 1959, 6, 290–297. [Google Scholar]
Erdos, P.; Rényi, A. On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci 1960, 5, 17–60. [Google Scholar]
Krause, A.; Golovin, D. Submodular function maximization. Tractability 2014, 3, 3. [Google Scholar]

Figure 1. Example network where node 1 is the rumor source node and both nodes 0 and 2 are infected. Source node is in cyan. Orange dots with dashed red border represent infected nodes.

Figure 2. Multicasting two bits

M_{1}

and

M_{2}

of rumor from different sources to the sink

A : = {5, 6}

over butterfly networks with different link directions. Source nodes are represented by cyan dots, while sinks are in orange. (a) Coding for S = {1, 2}. (b) Routing for S = {3}.

Figure 2. Multicasting two bits

M_{1}

and

M_{2}

of rumor from different sources to the sink

A : = {5, 6}

over butterfly networks with different link directions. Source nodes are represented by cyan dots, while sinks are in orange. (a) Coding for S = {1, 2}. (b) Routing for S = {3}.

Figure 3. (a) Achieved multicast rate

ρ (S^{(k)} A, A)

using lazy–greedy forward search on a 200-node Barabási–Albert graph. (b) Number of oracle calls made by lazy–greedy and plain greedy forward search algorithms as a function of graph size n, with fixed source budget

k = 10

, using the unicast rate model (rate_uc).

Figure 3. (a) Achieved multicast rate

ρ (S^{(k)} A, A)

using lazy–greedy forward search on a 200-node Barabási–Albert graph. (b) Number of oracle calls made by lazy–greedy and plain greedy forward search algorithms as a function of graph size n, with fixed source budget

k = 10

, using the unicast rate model (rate_uc).

Table 1. Feasible source sets found by backward elimination on synthetic graphs using the contraction-based rate function

ρ_{ms}

. Each row shows whether a feasible source set exists for a given rate threshold, along with the size and node indices of the selected source set.

Table 1. Feasible source sets found by backward elimination on synthetic graphs using the contraction-based rate function

ρ_{ms}

. Each row shows whether a feasible source set exists for a given rate threshold, along with the size and node indices of the selected source set.

Graph	Target Rate	Feasible	Set Size	Source Set
BA	0.5	Yes	1	{19}
BA	1.0	Yes	1	{19}
BA	2.0	Yes	1	{18}
BA	3.0	Yes	3	{17, 18, 19}
BA	6.0	Yes	5	{11, 12, 15, 17, 18}
WS	0.5	Yes	1	{17}
WS	1.0	Yes	1	{17}
WS	2.0	Yes	2	{11, 16}
WS	3.0	Yes	3	{10, 11, 13}
WS	6.0	No	–	–
ER	0.5	Yes	1	{17}
ER	1.0	Yes	1	{17}
ER	2.0	Yes	1	{17}
ER	3.0	Yes	3	{16, 18, 19}
ER	6.0	No	–	–

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Z.; Zhao, C.; Zhou, Q.; Tan, C.W.; Chan, C. Infodemic Source Detection with Information Flow: Foundations and Scalable Computation. Entropy 2025, 27, 936. https://doi.org/10.3390/e27090936

AMA Style

Wang Z, Zhao C, Zhou Q, Tan CW, Chan C. Infodemic Source Detection with Information Flow: Foundations and Scalable Computation. Entropy. 2025; 27(9):936. https://doi.org/10.3390/e27090936

Chicago/Turabian Style

Wang, Zimeng, Chao Zhao, Qiaoqiao Zhou, Chee Wei Tan, and Chung Chan. 2025. "Infodemic Source Detection with Information Flow: Foundations and Scalable Computation" Entropy 27, no. 9: 936. https://doi.org/10.3390/e27090936

APA Style

Wang, Z., Zhao, C., Zhou, Q., Tan, C. W., & Chan, C. (2025). Infodemic Source Detection with Information Flow: Foundations and Scalable Computation. Entropy, 27(9), 936. https://doi.org/10.3390/e27090936

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Infodemic Source Detection with Information Flow: Foundations and Scalable Computation^†

Abstract

1. Introduction

2. Preliminary

2.1. SI Model

2.2. Fundamental Limitations of Conventional Estimators

3. Problem Formulation

3.1. Rate-Constrained Model

3.2. Network Links Model

4. Main Results

4.1. Rate-Constrained Source Selection Algorithms

4.2. Source Feasibility Under Multicast Rate Constraints

4.3. Computational and Structural Behavior of Lazy–Greedy Forward Search

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Proof of Proposition 1

Appendix B. Proof of Proposition 2

Appendix C. Proof of Proposition 3

Appendix D. Proof of Corollary 1

Appendix E. Proof of Proposition 4

Appendix F. Proof of Proposition 5

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Infodemic Source Detection with Information Flow: Foundations and Scalable Computation †

Abstract

1. Introduction

2. Preliminary

2.1. SI Model

2.2. Fundamental Limitations of Conventional Estimators

3. Problem Formulation

3.1. Rate-Constrained Model

3.2. Network Links Model

4. Main Results

4.1. Rate-Constrained Source Selection Algorithms

4.2. Source Feasibility Under Multicast Rate Constraints

4.3. Computational and Structural Behavior of Lazy–Greedy Forward Search

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Proof of Proposition 1

Appendix B. Proof of Proposition 2

Appendix C. Proof of Proposition 3

Appendix D. Proof of Corollary 1

Appendix E. Proof of Proposition 4

Appendix F. Proof of Proposition 5

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Infodemic Source Detection with Information Flow: Foundations and Scalable Computation^†