Peer Reporting: Sampling Design and Unbiased Estimates

Wen, Kang; Mou, Jianhong; Lu, Xin

doi:10.3390/e28010116

Open AccessArticle

Peer Reporting: Sampling Design and Unbiased Estimates

by

Kang Wen

,

Jianhong Mou

and

Xin Lu

^*

College of Systems Engineering, National University of Defense Technology, Changsha 410073, China

^*

Author to whom correspondence should be addressed.

Entropy 2026, 28(1), 116; https://doi.org/10.3390/e28010116

Submission received: 3 December 2025 / Revised: 4 January 2026 / Accepted: 7 January 2026 / Published: 18 January 2026

(This article belongs to the Special Issue Complexity of Social Networks)

Download

Browse Figures

Versions Notes

Abstract

The Ego-Centric Sampling Method (ECM) leverages individual-level reports about peers to estimate population proportions within social networks, offering strong privacy protection without requiring full network data. However, the conventional ECM estimator is unbiased only under the restrictive assumption of a homogeneous network, where node degrees are uniform and uncorrelated with attributes. To overcome this limitation, we introduce the Activity Ratio Corrected ECM estimator (

{ECM}_{ac}

), which exploits network reciprocity to recast the population–proportion problem into an equivalent formulation in edge space. This reformulation relies solely on ego–peer data and explicitly corrects for degree–attribute dependencies, yielding unbiased and stable estimates even in highly heterogeneous networks. Simulations and analyses on real-world networks show that

{ECM}_{ac}

reduces estimation error by up to 70% compared with the conventional ECM. Our results establish a theoretically grounded and practically scalable framework for unbiased inference in network-based sampling designs.

Keywords:

network sampling; ego-network; statistical inference; activity ratio; complex networks

1. Introduction

Sample surveys are fundamental to quantitative research in social, behavioral, and health sciences, forming the empirical basis for understanding population characteristics and health reasoning [1]. More broadly, recent work in informatics emphasizes that reliable inference often relies on extracting essential structure from incomplete and noisy observations rather than from fully observed data [2,3]. However, the validity of survey data is severely challenged when surveys involve sensitive topics, such as illicit drug use, sexual behaviors, or political dissent [4]. When faced with such questions, respondents may exhibit protective behaviors due to fears of disclosure and social judgment. These behaviors include direct refusal to participate (unit nonresponse), skipping specific questions (item nonresponse), or providing socially desirable answers. For example, populations at high risk for sexually transmitted infections, such as sex workers, injecting drug users, or men who have sex with men (MSM), often avoid responding to questions about highly sensitive or illegal behaviors, thereby concealing their health conditions [5,6,7]. In these settings, supplementing ego-level responses with information about social connections provides an opportunity to access broader population signals without relying solely on direct self-disclosure. Such measurement issues can cause systematic errors in estimates, which may misinform policy, distort scientific understanding, and ultimately undermine evidence-based decision-making [8,9].

To address these challenges, several indirect questioning methods have been developed. Classical examples include the Randomized Response Technique (RRT) [10] and the Item Count Technique (ICT) [11,12]. In RRT, respondents follow a simple randomization rule so that their answers remain confidential, allowing them to deny revealing sensitive information while still contributing to accurate group estimates. Such designs can also be implemented anonymously to further protect privacy [13,14]. Building on these ideas, the Network Scale-Up Method (NSUM) [15,16] asks respondents to report how many of their peers (alters) belong to a target group, thereby enabling indirect estimation of hidden populations. Many studies have extended NSUM to account for degree heterogeneity, social visibility, and non-random mixing, giving rise to generalized scale-up estimators [17,18,19]. While effective for population size estimation, these methods typically rely on additional assumptions, external information, or auxiliary samples, and often involve complex survey designs with limited statistical efficiency [20,21]. These trade-offs motivate the development of simpler yet robust alternatives for network-based inference.

An alternative approach utilizes the structure of social networks, building on the observation that respondents are often more willing to report on their peers than to disclose their own sensitive attributes [22,23]. Unlike scale-up methods that target population size, this line of work focuses on estimating population proportions from ego-centric samples. This innovative idea was first developed and implemented in the context of Respondent-Driven Sampling (RDS) [24,25]. By exploiting ego-centric network information, the inclusion probability of each ego being reported can be analytically derived, and the asymptotically unbiased estimator

P_{l u}

[26] was proposed to estimate population proportions in hidden populations [27]. Simulation studies on real-world networks as well as field applications among hard-to-reach populations have shown that

P_{l u}

substantially outperforms other RDS estimators in terms of both bias and efficiency [28,29,30].

Building on this idea, the Ego-Centric Sampling Method (ECM) [31] collects information through respondents’ ego-centric networks to infer sensitive attributes indirectly. In this design, a representative sample of egos is drawn from the population. Each ego reports the size of their personal network (degree) and the number of peers who possess a specific sensitive attribute (for example, attribute A). Instead of asking egos to disclose their own attributes, ECM makes use of their knowledge about peers to estimate population-level proportions. This indirect approach protects respondents’ privacy while retaining the statistical rigor of standard sampling, making it a simple and effective framework for studying sensitive attributes in social and behavioral research.

However, the conventional ECM estimator is unbiased only under the assumption of a homogeneous network, where node degrees are uniform and uncorrelated with attributes. This assumption often fails in heterogeneous networks that exhibit strong degree–attribute correlation. Specifically, if members of one group (for example, attribute A) are more active, meaning they have a higher average degree, they are more likely to be reported and thus included in the ego-centric sample once any of their peers is selected. The conventional ECM estimator does not adequately correct for this overrepresentation caused by differences in node activity, which is measured by the activity ratio (AR). As a result, it produces a structural bias that does not diminish with increasing sample size. To date, no general correction for this bias has been established.

To address this limitation, we propose the Activity Ratio Corrected ECM estimator (

{ECM}_{ac}

). The key idea of

{ECM}_{ac}

is to use the reciprocity property of undirected networks to correct for the imbalance in node activity. Instead of directly estimating the overall population proportion of nodes with attribute A, the method focuses on the probabilities of connections between groups (for example, links between A and B nodes). These edge-based quantities can be measured directly from the ego-centric sample, allowing

{ECM}_{ac}

to obtain an unbiased estimate of

P (A)

even when node degree and attribute are correlated. This correction substantially improves estimation accuracy in heterogeneous networks where the conventional ECM tends to be biased.

The remainder of this paper is organized as follows. Section 2 details the theoretical framework of the

{ECM}_{ac}

method. We then describe our experimental design, present the simulation and empirical results, and conduct a systematic sensitivity analysis. The paper concludes with a summary of our findings and discusses future research directions.

2. Estimation Framework

This section develops the theoretical foundation for our work. We first analyze the statistical properties of the conventional ECM estimator to derive the source of its bias, and then introduce our adjusted estimator,

{ECM}_{ac}

, as a direct correction.

2.1. Notation and Model Specification

Let

G = (V, E)

be an undirected simple graph, where

V = {v_{1}, v_{2}, \dots, v_{N}}

is the set of nodes and

E \subseteq V \times V

is the set of edges. Undirectedness implies

e_{i j} = e_{j i} \in {0, 1}

, where

e_{i j} = 1

indicates that an edge exists between nodes i and j. Each node i carries a binary attribute

a_{i} \in {0, 1}

, where

a_{i} = 1

denotes nodes belonging to attribute class A, and

a_{i} = 0

denotes nodes belonging to attribute class B. The degree of node i and its decomposition by neighbor attributes are defined as

k_{i} = \sum_{j = 1}^{N} e_{i j}, k_{i}^{A} = \sum_{j = 1}^{N} e_{i j} a_{j}, k_{i}^{B} = k_{i} - k_{i}^{A} .

(1)

The quantity of interest is the population proportion of nodes with attribute A, which is the main target of estimation:

P (A) = \frac{1}{N} \sum_{i = 1}^{N} a_{i} .

(2)

Let

A = {i : a_{i} = 1}

and

B = {i : a_{i} = 0}

with sizes

N_{A} = | A |

and

N_{B} = | B |

. The group mean degrees and the activity ratio are defined as

{\bar{k}}_{A} = \frac{1}{N_{A}} \sum_{i \in A} k_{i}, {\bar{k}}_{B} = \frac{1}{N_{B}} \sum_{i \in B} k_{i}, A R = \frac{{\bar{k}}_{A}}{{\bar{k}}_{B}} .

(3)

2.2. Observables from Ego-Centric Sampling

Ego-centric sampling collects local network information in two stages. First, a set of S respondents (egos) is drawn from the population, typically by simple random sampling. Second, each sampled ego

i \in S

reports their degree and the number of peers in each attribute class, recorded as

(k_{i}, k_{i}^{A}, k_{i}^{B})

, as illustrated in Figure 1.

We partition the sample according to the respondent’s attribute:

S_{A} = {i \in S : a_{i} = 1}, S_{B} = {i \in S : a_{i} = 0},

(4)

and define the total degrees within each group as

s_{A} = \sum_{i \in S_{A}} k_{i}, s_{B} = \sum_{i \in S_{B}} k_{i} .

(5)

The cross-group neighbor counts, aggregated from the respondents’ side, are defined as

m_{A B} = \sum_{i \in S_{A}} k_{i}^{B}, m_{B A} = \sum_{i \in S_{B}} k_{i}^{A} .

(6)

Based on these observations, we define

{\hat{P}}_{A B}^{(S)}

as the probability that a neighbor of an A-node belongs to B, and

{\hat{P}}_{B A}^{(S)}

as its reverse counterpart. The sample activity ratio is denoted by

\hat{A R}

:

{\hat{P}}_{A B}^{(S)} = \frac{m_{A B}}{s_{A}}, {\hat{P}}_{B A}^{(S)} = \frac{m_{B A}}{s_{B}}, \hat{A R} = \frac{s_{A} / | S_{A} |}{s_{B} / | S_{B} |} .

(7)

2.3. ECM Estimator

The traditional ECM estimator [31] begins with a decomposition by degree strata. Let

n_{k}

be the number of nodes with degree k, and

p (A ∣ k)

be the conditional probability that a node of degree k belongs to class A. The total number of edges emanating from A-nodes of degree k is given by

\sum_{i \in {i ∣ k_{i} = k}} k_{i}^{A} = n_{k} \cdot p (A ∣ k) \cdot k .

(8)

The expected number of nodes with attribute A in the population can therefore be written as

E (N_{A}) = \sum_{k = 1}^{k_{max}} p (A ∣ k) n_{k},

(9)

which implies that the population proportion is

P (A) = N^{- 1} \sum_{k} n_{k} p (A ∣ k)

. Combining this with (8), an idealized (theoretical) form of the estimator can be expressed as

{\hat{P}}_{ideal} (A) = \frac{\sum_{k = 1}^{k_{max}} \sum_{i \in {i ∣ k_{i} = k}} \frac{k_{i}^{A}}{k}}{N} .

(10)

In studies involving hidden or hard-to-reach populations, the full network is typically unobservable. Therefore, the sample mean is used as a practical substitute for the corresponding population quantity, yielding the ECM estimator:

{\hat{P}}_{ECM} (A) = \frac{1}{| S |} \sum_{i \in S} \frac{k_{i}^{A}}{k_{i}} .

(11)

ECM assumes that node degree is independent of attribute type. Under this condition, the unweighted average across sampled egos’ proportions of A-type neighbors equals the population proportion

P (A)

. When groups differ in their mean degree, the more active group becomes overrepresented and the ECM estimator is biased, which motivates the adjusted estimator

{ECM}_{ac}

.

2.4. ${ECM}_{ac}$ Based on Reciprocity

We derive

{ECM}_{ac}

from two basic equalities that hold in the undirected graph G. Because the graph is undirected, the number of cross-group edges counted from A to B must equal the number counted from B to A. This structural equality is referred to as reciprocity:

E_{A B} = E_{B A} .

(12)

Let

P_{A B}

and

P_{B A}

denote the probabilities that an edge attached to an A-node or a B-node, respectively, connects to the other group. These are defined as

P_{A B} = \frac{\sum_{i \in A} k_{i}^{B}}{\sum_{i \in A} k_{i}}, P_{B A} = \frac{\sum_{j \in B} k_{j}^{A}}{\sum_{j \in B} k_{j}} .

(13)

By definition, the number of cross-group edges can be expressed in two equivalent forms:

E_{A B} = N_{A} {\bar{k}}_{A} P_{A B}, E_{B A} = N_{B} {\bar{k}}_{B} P_{B A} .

(14)

Starting from this relation,

N_{A} {\bar{k}}_{A} P_{A B} = N_{B} {\bar{k}}_{B} P_{B A}

, we divide both sides by

N {\bar{k}}_{B}

and use

P (A) = N_{A} / N

,

1 - P (A) = N_{B} / N

, and

A R = {\bar{k}}_{A} / {\bar{k}}_{B}

to obtain:

P (A) A R P_{A B} = (1 - P (A)) P_{B A} .

(15)

From Equation (15), we derive a population-level relationship that forms the theoretical basis for the

{ECM}_{ac}

estimator: Expanding the right side of Equation (15), we get

P (A) \cdot A R \cdot P_{A B} = P_{B A} - P (A) \cdot P_{B A}

. Rearranging the terms to group

P (A)

yields

P (A) (A R \cdot P_{A B} + P_{B A}) = P_{B A}

.

From this, we derive the population-level relationship that forms the theoretical basis for the

{ECM}_{ac}

estimator:

P (A) = \frac{P_{B A}}{P_{B A} + A R \cdot P_{A B}} .

(16)

The corresponding sample-based estimator is obtained by the plug-in principle, substituting the population quantities in Equation (16) with their empirical counterparts:

{\hat{P}}_{ac} (A) = \frac{{\hat{P}}_{B A}^{(S)}}{{\hat{P}}_{B A}^{(S)} + \hat{A R} \cdot {\hat{P}}_{A B}^{(S)}} .

(17)

Using Equation (7), the estimator can be written directly in terms of the observed sample counts:

{\hat{P}}_{ac} (A) = \frac{m_{B A} s_{A}}{m_{B A} s_{A} + \hat{A R} m_{A B} s_{B}} .

(18)

2.5. Variance Estimation

The analytical variance of network-based estimators is typically intractable. It depends on complex networks features such as topology, degree heterogeneity, and inter-node dependence. To obtain empirical variance estimates and construct confidence intervals for

{ECM}_{ac}

, we adopt a nonparametric bootstrap approach.

The bootstrap procedure proceeds as follows:

(1): Draw a bootstrap replicate by resampling ego-centric networks (each ego together with its reported peers) with replacement, and denote the resulting bootstrap sample as $B_{b}$ ;
(2): Based on $B_{b}$ , compute the corresponding estimator ${\hat{P}}^{* (b)} (A)$ using ${ECM}_{ac}$ ;
(3): Repeat steps (1)–(2) for $b = 1, 2, \dots, B$ , obtaining a set of bootstrap estimates:

${{\hat{P}}^{* (1)} (A), {\hat{P}}^{* (2)} (A), \dots, {\hat{P}}^{* (B)} (A)} .$

(19)
(4): Sort these estimates in ascending order, and construct the $(1 - α)$ percentile confidence interval as

${CI}_{1 - α} = [{\hat{P}}^{* (⌈ α B / 2 ⌉)} (A), {\hat{P}}^{* (⌈ (1 - α / 2) B ⌉)} (A)] .$

(20)

Since different resampling schemes correspond to different assumptions about dependence in the data, we consider three bootstrap designs to reflect multiple sources of uncertainty:

BS-Ego assumes that ego-centric networks are independent sampling units. It captures between-ego variation and reflects uncertainty due to the limited number of egos.

BS-Tree adds within-ego resampling, treating each ego’s reported peers as nested observations. This design accounts for additional variability introduced by the hierarchical (ego–peer) structure of ego-centric data [32].

BS-Pool treats all ego–peer pairs as exchangeable edges and resamples them directly. It reflects uncertainty arising from the random formation of connections rather than from the selection of egos.

These three schemes differ only in how resampled ego-centric datasets are constructed. Together, they provide complementary perspectives on estimator variability under distinct dependence assumptions. BS-Ego assumes independent egos, BS-Tree allows dependent peers within the same ego, and BS-Pool assumes independence only at the edge level. A workflow overview is shown in Figure 2, and detailed algorithmic descriptions for each scheme are provided in Appendix A.

3. Experimental Design

3.1. Synthetic and Real-World Networks

We rigorously evaluate the performance of

{ECM}_{ac}

against the conventional ECM using a comprehensive framework spanning both synthetic and real-world networks.

Synthetic Networks. We generated Erdős–Rényi (ER) and Barabási–Albert (BA) networks of 10,000 nodes to represent homogeneous and heterogeneous degree distributions, respectively. Nodes were assigned a binary attribute (A or B) to achieve target proportions

P (A) \in {0.1, 0.2, 0.3, 0.4}

. Within these networks, we systematically control four key structural properties:

(1): Density ( $Ψ$ ): quantifies the overall connectivity of a network [33] and is calculated as

$Ψ = \frac{2 E}{N (N - 1)},$

(21)

where E is the number of edges, and N is the number of nodes in the network.
(2): Average Clustering Coefficient ( $C_{avg}$ ): is a measure of how nodes tend to cluster together [34]. For each node i, the local clustering coefficient is defined as

$C_{i} = \frac{2 Δ_{i}}{k_{i} (k_{i} - 1)},$

(22)

where $k_{i}$ is the degree of node i, and $Δ_{i}$ is the number of triangles that node i forms with its peers. The overall average clustering coefficient is then the mean of all individual $C_{i}$ :

$C_{avg} = \frac{1}{N} \sum_{i = 1}^{N} C_{i} .$

(23)
(3): Homophily (H): quantifies the extent to which nodes prefer connections within their own group rather than across groups. Let $S_{A A}^{*}$ denote the proportion of $A \to A$ links among all links originating from A-nodes [35,36]. The definition of H is as follows:

$S_{A A}^{*} = H + (1 - H) P_{A},$

(24)

when $H = 1$ , all A-nodes only connect to other A-nodes (perfect assortative mixing); when $H = 0$ , A-nodes connect to others proportionally to group sizes (random mixing); intermediate values $0 < H < 1$ indicate partial within-group preference. Negative values ( $H < 0$ ) correspond to disassortative mixing, i.e., a tendency to connect across groups [37].
(4): Activity Ratio (AR): is set to values in the range $[0.5, 2.5]$ by swapping attributes between high- and low-degree nodes to induce specific levels of degree–attribute correlation, while preserving both the network topology and marginal attribute counts [38].

Each network property was systematically tuned by modifying one factor at a time from its baseline BA configuration, ensuring that all other metrics remained stable within 1%. Specifically,

Ψ

was increased by randomly adding edges,

C_{avg}

was adjusted through targeted triad closure rewiring, and H was tuned by randomly reconnecting cross-group or within-group links. The controlled parameter ranges were

Ψ \in [0.002, 0.10]

(step t = 0.002),

C_{avg} \in [0.001, 0.03]

(t = 0.001),

H \in [- 0.30, 0.25]

(t = 0.05), and

A R \in [0.5, 2.5]

(t = 0.1).

3.2. Real-World Networks

To evaluate performance in practical settings, we selected six diverse real-world networks spanning both molecular and social domains, with broad variation in size, semantics, and degree heterogeneity. Each dataset provides node-level categorical labels and undirected connectivity from publicly available repositories.

AIDS: Derived from the Network Repository [39], representing molecular graphs where nodes are atoms and edges are chemical bonds (single, double, or aromatic).

PTC: From the Predictive Toxicology Challenge dataset [39], containing chemical compounds for carcinogenicity prediction, generated using the Chemistry Development Kit (v1.4).

Git: A GitHub developer network collected from the public API [40], where users are connected via mutual following relationships. Node metadata include location, employer, and starred repositories.

Flickr: An online community network [39], with users linked by shared interests or mutual following. Labels identify user groups or communities.

Tox: From the Tox21 toxicity database [39], comprising molecular graphs with atoms as nodes and bonds as edges. Labels correspond to atom types.

Twitter: A social interaction network from the Network Repository [39], where users are connected through interaction edges. Labels are derived from dominant textual themes (e.g., “love” and “sleep”).

For all datasets, the A-class is defined as the less frequent label to mimic imbalance in real populations. Summary statistics are reported in Table 1.

3.3. Sampling and Estimation Procedure

To implement our simulation protocol, we first draw a simple random sample of egos (10% of total nodes) without replacement from each network. For each sampled ego, we then collect neighborhood information using one of four distinct sampling strategies: full reporting (F); partial random sampling of 5 (P5) or 10 (P10) peers; and weighted sampling (W), where 10 peers are drawn with probabilities inversely proportional to their degrees.

3.4. Evaluation Metrics

Estimator performance is evaluated using Bias, Standard Deviation (SD), and Root Mean Squared Error (RMSE) for point accuracy, and empirical coverage rates for interval reliability. We also report the percentage of trials where an estimator achieves the lowest error (

P_{best}

).

3.5. Bootstrap Confidence Intervals

We evaluate interval estimation through a simulation-based bootstrap experiment. For each network configuration with a given

P (A)

, 10% of nodes are randomly sampled to form ego-centric samples. Using the bootstrap procedure in the Variance Estimation, 90% and 95% confidence intervals are constructed from

B = 1000

resamples, and the process is repeated

R = 1000

times to estimate empirical coverage. All simulations follow the same sampling settings described above.

4. Results

4.1. Performance on Synthetic Networks

We begin by examining the fundamental performance of the estimators in scale-free BA networks. Figure 3 illustrates a representative setting with homophily fixed at

H = 0.15

, AR

= 1.3

, and a true population proportion

P (A) = 0.40

. The results clearly show that the conventional ECM estimator is severely biased, with its estimate distribution peaking near 0.475, a substantial overestimation of the benchmark. In contrast, the

{ECM}_{ac}

distribution is centered on the true value, demonstrating its ability to correct for degree–attribute bias.

This finding is quantitatively substantiated in Table 2 and Table 3. Table 2 summarizes results across both BA and ER network models, showing that

{ECM}_{ac}

consistently achieves lower bias and RMSE than ECM whenever AR

\neq 1

. For instance, in BA networks with AR

= 1.5

and

P (A) = 0.1

,

{ECM}_{ac}

reduces the RMSE by more than 70% relative to ECM. As predicted by theory, when AR

= 1

the two estimators perform identically. Table 3 further demonstrates that

{ECM}_{ac}

’s advantage is robust across different sampling protocols (F, P5, P10, and W). Across all cases,

{ECM}_{ac}

not only yields the lowest RMSE but also attains the highest winning percentage (

P_{best}

), underscoring its practical utility.

4.2. Performance on Real-World Networks

Our empirical analysis across six diverse real-world networks further validates the effectiveness of

{ECM}_{ac}

, showing robustness in both molecular and social settings. We assess two complementary aspects: (i) how estimates converge as the sample fraction grows, and (ii) how bias distributions behave under different sampling strategies (F, P, W). Across all networks,

{ECM}_{ac}

remains centered on the true proportion and exhibits stable performance, whereas ECM shows systematic deviations whenever AR departs from one.

The results show that

{ECM}_{ac}

consistently aligns more closely with the true

P (A)

across all networks, whereas the conventional ECM exhibits systematic deviations whenever AR departs from one (see Figure 4).

For example, in networks with

A R < 1

such as AIDS (AR

\approx 0.54

), Git (AR

\approx 0.49

), and Tox (AR

\approx 0.58

), the conventional ECM persistently underestimates

P (A)

. In Git, this downward bias is particularly severe, with ECM converging to an estimate nearly 30% below the true value,

{ECM}_{ac}

effectively eliminates this discrepancy. Conversely, in networks with

A R > 1

, such as Twitter (AR

\approx 1.64

), ECM systematically overestimates the population proportion. Across all cases,

{ECM}_{ac}

produces estimates centered on the true value, underscoring that correcting for degree–attribute correlation is essential for accurate inference in empirical networks.

Furthermore,

{ECM}_{ac}

demonstrates strong robustness across different sampling strategies (F, P, W), as shown in Figure 5. This consistency is evident across all six networks. For instance, in Twitter (AR

\approx 1.64

), the distribution of

{ECM}_{ac}

estimates remains centered near the true value (

P (A) \approx 0.035

) under all strategies, whereas ECM consistently overestimates the proportion by approximately

+ 0.01

to

+ 0.015

. A symmetric pattern is observed in networks with

A R < 1

, such as AIDS (AR

\approx 0.54

) and Git (AR

\approx 0.49

), where ECM exhibits a downward bias of about

- 0.03

to

- 0.05

, while

{ECM}_{ac}

stays closely aligned with the true benchmark.

In summary, empirical evidence from these diverse real-world networks confirms that

{ECM}_{ac}

substantially reduces estimation bias relative to ECM, particularly when AR deviates substantially from one. These improvements establish

{ECM}_{ac}

as a reliable and practical estimator for network-based attribute inference in real-world applications.

For completeness, we provide an additional supplementary comparison with two representative baseline approaches (RDS-II and NSUM) on real-world networks in Appendix B. Because these methods rely on different sampling mechanisms and target estimands, the results are presented for reference only.

4.3. Bootstrap Coverage Rate

To evaluate interval reliability under controlled structural conditions, we conduct bootstrap experiments on synthetic networks generated by the BA model. We vary

A R

from 0.8 to 1.8 and

P (A)

from 0.10 to 0.40, and construct 90% and 95% percentile confidence intervals using three resampling schemes.

Table 4 and Table 5 report the empirical coverage. At the 95% level, BS-Ego is consistently closest to the nominal target across the

(A R, P (A))

grid, whereas BS-Tree is conservative and BS-Pool exhibits systematic under-coverage.

BS-Ego provides the best-calibrated inference for 95% confidence intervals, while BS-Tree only yields acceptable 90% coverage at the cost of wider intervals. This difference follows from the data-generating mechanism of ego-centric sampling: egos are the primary sampling units, with peer reports clustered within each ego-network. By resampling intact ego-networks, BS-Ego preserves this dependence structure and accurately reflects sampling variability, whereas BS-Tree inflates variability by resampling peers and BS-Pool ignores ego-level clustering.

5. Sensitivity Analysis

We now turn to a systematic analysis of how estimator performance varies with key network and attribute parameters. All supporting tables for the figures in this section are included in Supporting Information Tables S1–S5.

5.1. Population Proportion P(A)

We evaluate the effect of the population proportion on estimator accuracy using BA networks under fixed sampling settings, varying

P (A)

from 0.1 to 0.4 while keeping all other parameters constant. As shown in Figure 6, increasing

P (A)

leads both estimators to become more precise, as indicated by narrower 95% confidence intervals. Across all tested conditions, however,

{ECM}_{ac}

remains centered on the true population proportion, whereas ECM exhibits a persistent downward bias that becomes more pronounced when

P (A)

is small. For example, when

P (A) = 0.1

, the mean estimate of ECM is approximately 0.07, underestimating the true value by nearly 30%, while

{ECM}_{ac}

yields a mean of 0.10, closely matching the benchmark. When

P (A)

increases to 0.4, the mean estimate of ECM remains around 0.32, which is still about 20% below the true value, whereas

{ECM}_{ac}

stays nearly unbiased.

This demonstrates that even when sampling uncertainty decreases with larger

P (A)

, the degree–attribute bias inherent in ECM persists, while

{ECM}_{ac}

effectively eliminates it.

5.2. Activity Ratio (AR)

We examine the impact of degree–attribute imbalance by varying AR in both ER and BA networks while keeping other parameters fixed. Figure 7 presents results under the P5 and P10 sampling strategies, which represent realistic partial-reporting scenarios where egos disclose only a limited number of peers. The omitted full-reporting (F) and weighted (W) schemes exhibit nearly identical bias patterns, differing only in the overall width of confidence intervals rather than in systematic trends, with all showing the same direction of bias relative to AR.

Across both network types, ECM displays a distinct U-shaped error curve as AR deviates from one. In BA networks, the MAE of ECM rises sharply from approximately 0.03 at

A R = 1.0

to over 0.20 at

A R = 0.5

and

A R = 1.5

. By contrast,

{ECM}_{ac}

remains almost flat across all tested AR values, with MAE typically below 0.02 even under strong imbalance. This corresponds to an 85–90% reduction in MAE relative to ECM when

A R \leq 0.7

or

A R \geq 1.3

.

These results demonstrate that accounting for degree–attribute correlation is essential for reducing activity-induced bias, particularly under realistic partial-reporting conditions.

5.3. Network Density and Clustering

We further investigate how structural connectivity influences estimator performance by varying network density (

Ψ

) and average clustering coefficient (

C_{avg}

) while keeping

A R

and

P (A)

fixed. As shown in Figure 8, increasing either density or clustering slightly reduces estimation accuracy for both estimators. However, the deterioration of

{ECM}_{ac}

is much less pronounced. When network density increases from 0.002 to 0.10, the median bias of ECM rises gradually from approximately 0.03 to 0.08, whereas

{ECM}_{ac}

remains close to zero, with deviations below 0.01. A similar pattern is observed when

C_{avg}

reaches 0.03, where ECM’s bias stabilizes around 0.18 to 0.20, roughly six times higher than that of

{ECM}_{ac}

, whose bias remains below 0.03.

These results indicate that denser or more clustered networks amplify the bias arising from degree–attribute correlation, but the correction introduced by

{ECM}_{ac}

effectively stabilizes estimation performance across a wide range of structural conditions.

5.4. Combined Effects of AR and H

Building upon the previous single-factor analyses, we further examine how the interaction between degree–attribute imbalance and assortative mixing jointly shapes estimator bias. Specifically, we vary both AR and the homophily (H) to capture their combined influence on estimator performance.

Figure 9 illustrates these joint effects, showing that the bias field of

{ECM}_{ac}

remains stable across the full range of structural conditions. For ECM, bias patterns form a pronounced gradient across the

(A R, H)

plane, where negative homophily

(H < 0)

and extreme AR values (

A R > 2

or

A R < 0.7

) yield the largest deviations, with estimation errors reaching about +0.25 or −0.18. When homophily becomes positive, ECM tends to overestimate the true proportion, whereas negative homophily induces underestimation. In contrast,

{ECM}_{ac}

exhibits an almost uniform error surface, with absolute bias remaining below 0.03 under both P5 and P10 sampling.

It is worth noting that in highly assortative networks (

H \to 1

), cross-group connection probabilities approach zero (

P_{A B}, P_{B A} \to 0

), theoretically risking numerical instability. In practice, this is rare in connected networks and can be addressed by applying a small smoothing factor, as implemented in our bootstrap procedure.

These findings demonstrate that

{ECM}_{ac}

effectively mitigates the compounded bias arising from the coexistence of degree–attribute correlation and homophily.

5.5. Combined Effects of AR and P(A)

To further investigate how degree heterogeneity and group proportion jointly influence estimator performance, we analyze bias across the

(A R, P (A))

parameter space under partial reporting. Figure 10 presents the resulting bias surfaces for both P5 and P10 sampling strategies.

For ECM, bias increases sharply when either

A R \neq 1

or

P (A)

becomes large, forming a pronounced ridge around

(A R, P (A)) \approx (2.5, 0.6)

, where the bias reaches approximately 0.24. Under P10 sampling, the mean bias for ECM is 0.104 (SD = 0.061). In contrast,

{ECM}_{ac}

produces a much flatter bias surface, with values remaining below 0.04 and a mean of 0.026 (SD = 0.016). The lower panels show that local bias reduction can reach 0.25 at

(A R, P (A)) = (2.5, 0.35)

, confirming the consistent advantage of the activity-corrected estimator.

Smaller sampling strategies (P5) lead to slightly higher overall biases than P10, indicating that including more peers per ego improves estimator stability even when the number of egos is fixed. Overall, these results highlight that

{ECM}_{ac}

remains robust under the combined effects of group imbalance and degree heterogeneity, maintaining low bias across a wide range of network conditions.

6. Conclusions

While the conventional ECM estimator relies on degree-based weighting to infer group proportions from ego-centric samples, it often exhibits systematic bias in heterogeneous networks—particularly when node degree is correlated with node attributes. To overcome this limitation, this study introduces the Activity Ratio Corrected ECM estimator (

{ECM}_{ac}

), which explicitly incorporates degree–attribute correlation through an activity ratio adjustment, thereby enhancing both the adaptability and accuracy of proportion estimation in complex networks.

By leveraging the principle of reciprocity, we reformulated the group proportion estimation problem into an edge-based framework that estimates cross-group connection probabilities (

P_{A B}

and

P_{B A}

) without requiring global network information. This formulation effectively mitigates the structural bias inherent in the traditional ECM approach and enables unbiased proportion estimation using only locally observed ego-network data.

Simulation experiments demonstrate that

{ECM}_{ac}

consistently reduces estimation bias and lowers RMSE across a broad range of network structures, including variations in density, clustering, and homophily. Sensitivity analyses further show that

{ECM}_{ac}

maintains robustness across different attribute ratios and sampling designs, producing nearly flat bias surfaces and stable performance even under extreme network heterogeneity. Empirical validations on six real-world networks confirm that

{ECM}_{ac}

accurately corrects the overestimation and underestimation observed in traditional ECM.

Overall, the proposed

{ECM}_{ac}

estimator provides a theoretically sound and practically effective framework for attribute estimation in complex social networks. By addressing the key limitations of traditional ECM in heterogeneous environments, this method offers a reliable foundation for privacy-preserving surveys, social network inference, and other applications requiring indirect estimation from local sampling data.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/e28010116/s1, Table S1: Supporting results for Section 5.1 (Population Proportion P(A)). Representative ECM and ECM_ac estimates, biases, and network parameters for BA(AR-0.7); Table S2: Supporting results for Section 5.2 (Activity Ratio (AR)). Mean, SD, and 90% CI of ECM and ECM_ac across activity ratios; Table S3: Supporting results for Section 5.5 (Combined Effects of AR and P(A)). Mean bias of ECM and ECM_ac under P5 sampling; Table S4: Estimator error and 90% central range across homophily (H) and activity ratio (AR) settings—BA network; Table S5: Summary statistics for ECM and ECM_ac on the six networks, broken down by sampling strategy. References [41,42] are cited in the Supplementary Materials.

Author Contributions

Conceptualization, K.W., J.M. and X.L.; Methodology, K.W.; Software, K.W. and J.M.; Validation, K.W. and J.M.; Formal analysis, X.L.; Investigation, K.W.; Resources, X.L.; Data curation, J.M.; Writing—original draft preparation, K.W.; Writing—review and editing, X.L.; Visualization, J.M.; Supervision, X.L.; Project administration, X.L.; Funding acquisition, X.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the National Natural Science Foundation of China (72025405, 72421002, 92467302, 72474223, 72301285), the National Science and Technology Major Project for Brain Science and Brain-like Intelligence Technology (2025ZD0215700), the Hunan Science and Technology Plan Project (2023JJ40685, 2024RC3133), and the Major Program of Xiangjiang Laboratory (24XJJCYJ01001).

Data Availability Statement

All data and materials used in this study are publicly available. The datasets can be accessed from their original repositories as cited in the main text. The full code required to reproduce all simulations and figures is available on GitHub at https://github.com/kkangyoudianhan/Peer-Reporting-Sampling-Design-and-Unbiased-Estimates (accessed on 11 November 2025. All analyses were conducted using Python 3.8).

Acknowledgments

We thank the reviewers for their constructive comments.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Variance Estimation

Appendix A.1. Asymptotic Variance Estimation

An analytical approximation for the variance of the adjusted estimator can be derived using a first-order Taylor expansion, commonly known as the Delta method [43]. This approach provides theoretical insight into the estimator’s variability.

Let the function be

g (\hat{p} (x), \hat{A R}) = {\hat{P}}_{ac} (A)

, where

\hat{p} (x) = {\hat{P}}_{ECM} (A)

. The asymptotic variance of this function can be approximated as

Var ({\hat{P}}_{ac} (A)) \approx {(\frac{\partial g}{\partial p (x)})}^{2} Var (\hat{p} (x)) + {(\frac{\partial g}{\partial A R})}^{2} Var (\hat{A R}) + 2 (\frac{\partial g}{\partial p (x)}) (\frac{\partial g}{\partial A R}) Cov (\hat{p} (x), \hat{A R}),

(A1)

where the partial derivatives are evaluated at the true population values

(p (x), A R)

.

In practice, using this formula requires sample-based estimates for the variance and covariance terms. Deriving closed-form expressions for these terms is non-trivial due to the complex dependence structure of ego-network samples. Therefore, due to these practical challenges, we primarily rely on the more direct and robust nonparametric bootstrap methods detailed in Appendix A.2 for constructing confidence intervals in our experiments.

Appendix A.2. Bootstrap Methods for Confidence

This appendix provides technical details for the three bootstrap resampling schemes evaluated in Section 4.3. Let the original sample consist of a set of S egos,

S = {1, \dots, S}

. For each ego

i \in S

, we observe its attribute

Z_{i}

, degree

k_{i}

, and the list of reported neighbors

N_{i}

.

Appendix A.2.1. Resampling Designs

The three designs progressively relax the independence assumptions in ego-centric data—from the simplest (BS-Ego) to the most granular (BS-Pool).

(a)

Ego-level Resampling (BS-Ego). This is the standard nonparametric bootstrap that treats each ego-network as a single independent sampling unit.

i.: Draw a bootstrap sample of egos $S^{*}$ by sampling S egos from $S$ with replacement.
ii.: For each $i^{*} \in S^{*}$ , include its complete observed record (degree and alter list).
iii.: Recompute the estimator ${\hat{P}}_{ac}^{*} (A)$ using Equation (17) on the resampled ego set.

This approach preserves within-ego dependence but ignores the uncertainty arising from the second-stage alter sampling. It serves as the default bootstrap when no neighbor-level information is available.

(b)

Hierarchical Tree Bootstrap (BS-Tree). A two-level resampling scheme that mirrors the hierarchical structure of ego-centric sampling. Each ego is kept fixed, while its alter list is resampled to capture within-ego stochasticity.

i.: Level 1 (Ego layer): Retain all S egos from the original sample.
ii.: Level 2 (Alter layer): For each ego $i \in S$ , resample alters with replacement from its own alter list to form a bootstrap neighborhood $N_{i}^{*}$ .
iii.: Recompute ${\hat{P}}_{ac}^{*} (A)$ on the reconstructed bootstrap dataset.

(c)

Neighbor Pool Bootstrap (BS-Pool). This design resamples at the edge level, ignoring ego boundaries. To maintain directionality, edges are divided into two origin-based groups: those emanating from A-egos and those from B-egos. Within each group, alters are treated as conditionally i.i.d.

i.: Construct two disjoint edge pools:

$E_{A} = {edges from sampled A - egos}, E_{B} = {edges from sampled B - egos} .$

Code each edge in $E_{A}$ as $Y_{A \to B} \in {0, 1}$ (1 if the alter is B), and each edge in $E_{B}$ as $Y_{B \to A} \in {0, 1}$ (1 if the alter is A). Let $s_{A} = | E_{A} |$ , $s_{B} = | E_{B} |$ , and $m_{A B} = \sum_{E_{A}} Y_{A \to B}$ , $m_{B A} = \sum_{E_{B}} Y_{B \to A}$ .
ii.: For each bootstrap replication, draw $s_{A}$ samples with replacement from $E_{A}$ and $s_{B}$ from $E_{B}$ , obtaining

$m_{A B}^{*} = \sum Y_{A \to B}^{*}, m_{B A}^{*} = \sum Y_{B \to A}^{*} .$

Each draw corresponds to one resampled edge indicator from the respective pool.
iii.: Compute the resampled proportions ${\hat{P}}_{A B}^{* (S)} = m_{A B}^{*} / s_{A}, {\hat{P}}_{B A}^{* (S)} = m_{B A}^{*} / s_{B},$ and, keeping the sample activity ratio fixed, recompute ${\hat{P}}_{ac}^{*} (A)$ .

This edge-level approach does not account for correlations within each ego’s neighborhood, but it avoids mixing edges that originate from different groups, which could otherwise distort the estimates of

P_{A B}

and

P_{B A}

. Bootstrap samples in which no edges are drawn from either A- or B-egos (

s_{A} = 0

or

s_{B} = 0

) are excluded from analysis. To prevent numerical errors when proportions are extremely close to 0 or 1, the resampled proportions are bounded within

{\hat{P}}_{A B}^{* (S)}, {\hat{P}}_{B A}^{* (S)} \in [10^{- 6}, 1 - 10^{- 6}]

, while the reported point estimates use the original, unbounded values.

Appendix A.2.2. Confidence Interval Construction

For each bootstrap design, the overall procedure is as follows:

Compute the point estimate ${\hat{P}}_{ac} (A)$ from the original sample.
For $b = 1, 2, \dots, B$ :
(i)
Generate a bootstrap sample using one of the three designs: BS-Ego, BS-Tree, or BS-Pool.
(ii)
Recompute the estimator on this replicate, denoted ${\hat{P}}_{ac}^{* (b)} (A)$ .
Sort the bootstrap estimates in ascending order:

${\hat{P}}_{ac}^{* (1)} (A) \leq {\hat{P}}_{ac}^{* (2)} (A) \leq \dots \leq {\hat{P}}_{ac}^{* (B)} (A) .$
Construct the $(1 - α)$ percentile confidence interval as

${CI}_{1 - α} = [{\hat{P}}_{ac}^{* (⌈ α B / 2 ⌉)} (A), {\hat{P}}_{ac}^{* (⌈ (1 - α / 2) B ⌉)} (A)] .$

Appendix B. Supplementary Comparison with RDS and NSUM

This appendix reports supplementary comparisons between the proposed

{ECM}_{ac}

estimator and two representative baseline approaches, RDS-II and NSUM, on the same set of real-world networks. We emphasize that these baselines are designed for different sampling mechanisms and inferential targets (e.g., respondent-driven settings or scale-up style inference), and therefore the results below should be interpreted as qualitative reference rather than as a definitive ranking.

Figure A1 compares the empirical distributions of estimates under three ego-centric reporting regimes (Full, Partial, and Weighted) across multiple networks. Across panels,

{ECM}_{ac}

is generally more concentrated around the true proportion (dashed line), indicating improved stability and reduced bias relative to the conventional ECM. In contrast, ECM exhibits systematic deviations in networks where degree–attribute imbalance is present, while NSUM and RDS-II show noticeably larger dispersion in several cases, reflecting their mismatch with the ego-centric sampling context.

Figure A2 further examines how estimates evolve with increasing sample size. The curves suggest that

{ECM}_{ac}

converges faster toward the ground-truth proportion and maintains narrower empirical variability bands. By comparison, the baseline methods tend to converge more slowly and stabilize at values that remain offset from the benchmark in some networks, consistent with the fact that their estimands and inclusion mechanisms differ from the ego-centric design studied in this paper.

Figure A1. Comparison of ECM,

{ECM}_{ac}

, NSUM, and RDS-II on several real-world networks under different sampling strategies (Full, Partial, and Weighted). Each panel corresponds to a distinct network. The dashed horizontal line indicates the true population proportion.

{ECM}_{ac}

exhibits tighter concentration around the true value and reduced bias compared with the baselines across most networks and sampling regimes.

Figure A1. Comparison of ECM,

{ECM}_{ac}

, NSUM, and RDS-II on several real-world networks under different sampling strategies (Full, Partial, and Weighted). Each panel corresponds to a distinct network. The dashed horizontal line indicates the true population proportion.

{ECM}_{ac}

exhibits tighter concentration around the true value and reduced bias compared with the baselines across most networks and sampling regimes.

Figure A2. Performance of ECM,

{ECM}_{ac}

, NSUM, and RDS-II as a function of sample size on representative real-world networks. The dashed horizontal line indicates the true population proportion. Solid lines denote mean estimates across repetitions, while shaded regions indicate empirical variability.

{ECM}_{ac}

shows faster convergence and improved stability relative to the baselines as sample size increases.

Figure A2. Performance of ECM,

{ECM}_{ac}

, NSUM, and RDS-II as a function of sample size on representative real-world networks. The dashed horizontal line indicates the true population proportion. Solid lines denote mean estimates across repetitions, while shaded regions indicate empirical variability.

{ECM}_{ac}

shows faster convergence and improved stability relative to the baselines as sample size increases.

References

Beyrer, C.; Baral, S.D.; van Griensven, F.; Goodreau, S.M.; Chariyalertsak, S.; Wirtz, A.L.; Brookmeyer, R. Global Epidemiology of HIV Infection in Men Who Have Sex with Men. Lancet 2012, 380, 367–377. [Google Scholar] [CrossRef]
Lu, X.; Qin, W. Informatics in the Era of AI. Innov. Inform. 2025, 1, 100002. [Google Scholar] [CrossRef]
Wan, M.; Wang, J.; Wang, Y.; Cao, R.; Wang, Z.; Wang, Z.; Shi, P.; Zhao, Z. Understanding as Compression: A New Evaluation Framework for Large Language Models. Innov. Inform. 2025, 1, 100003. [Google Scholar] [CrossRef]
Tourangeau, R.; Rips, L.J.; Rasinski, K. (Eds.) The Psychology of Survey Response; Cambridge University Press: Cambridge, UK, 2000. [Google Scholar] [CrossRef]
Krumpal, I. Determinants of Social Desirability Bias in Sensitive Surveys: A Literature Review. Qual. Quant. 2013, 47, 2025–2047. [Google Scholar] [CrossRef]
Cai, M.; Huang, G.; Kretzschmar, M.E.; Chen, X.; Lu, X. Extremely Low Reciprocity and Strong Homophily in the World Largest MSM Social Network. IEEE Trans. Netw. Sci. Eng. 2021, 8, 2279–2287. [Google Scholar] [CrossRef]
Ward, M.K.; Meade, A.W. Dealing with Careless Responding in Survey Data: Prevention, Identification, and Recommended Best Practices. Annu. Rev. Psychol. 2023, 74, 577–596. [Google Scholar] [CrossRef]
Yan, T. Consequences of Asking Sensitive Questions in Surveys. Annu. Rev. Stat. Its Appl. 2021, 8, 109–127. [Google Scholar] [CrossRef]
Kreuter, F.; Presser, S.; Tourangeau, R. Social Desirability Bias in CATI, IVR, and Web Surveys: The Effects of Mode and Question Sensitivity. Public Opin. Q. 2008, 72, 847–865. [Google Scholar] [CrossRef]
Warner, S.L. Randomized Response: A Survey Technique for Eliminating Evasive Answer Bias. J. Am. Stat. Assoc. 1965, 60, 63–69. [Google Scholar] [CrossRef] [PubMed]
Dalton, D.R.; Wimbush, J.C.; Daily, C.M. Using the Unmatched Count Technique (UCT) to Estimate Base Rates for Sensitive Behavior. Pers. Psychol. 1994, 47, 817–828. [Google Scholar] [CrossRef]
Kowalczyk, B.; Niemiro, W.; Wieczorkowski, R. Item Count Technique with a Continuous or Count Control Variable for Analyzing Sensitive Questions in Surveys. J. Surv. Stat. Methodol. 2023, 11, 919–941. [Google Scholar] [CrossRef]
Aquilino, W.S. Interview Mode Effects in Surveys of Drug and Alcohol Use: A Field Experiment. Public Opin. Q. 1994, 58, 210–240. [Google Scholar] [CrossRef]
Corkrey, R.; Parkinson, L. A Comparison of Four Computer-Based Telephone Interviewing Methods: Getting Answers to Sensitive Questions. Behav. Res. Methods Instruments Comput. 2002, 34, 354–363. [Google Scholar] [CrossRef]
Ehler, I.; Wolter, F.; Junkermann, J. Sensitive Questions in Surveys: A Comprehensive Meta-Analysis of Experimental Survey Studies on the Performance of the Item Count Technique. Public Opin. Q. 2021, 85, 6–27. [Google Scholar] [CrossRef]
Laga, I.; Bao, L.; Niu, X. Thirty Years of the Network Scale-Up Method. J. Am. Stat. Assoc. 2021, 116, 1548–1559. [Google Scholar] [CrossRef]
Salganik, M.J.; Mello, M.B.; Abdo, A.H.; Bertoni, N.; Fazito, D.; Bastos, F.I. The game of contacts: Estimating the social visibility of groups. Soc. Netw. 2011, 33, 70–78. [Google Scholar] [CrossRef]
Maltiel, R.; Raftery, A.E.; McCormick, T.H.; Baraff, A.J. Estimating population size using the network scale up method. Ann. Appl. Stat. 2015, 9, 1247–1277. [Google Scholar] [CrossRef] [PubMed]
Feehan, D.M.; Salganik, M.J. Generalizing the Network Scale-Up Method: A New Estimator for the Size of Hidden Populations. Sociol. Methodol. 2016, 46, 153–186. [Google Scholar] [CrossRef] [PubMed]
Fisher, J.C.D.; Flannery, T.J. Designing Randomized Response Surveys to Support Honest Answers to Stigmatizing Questions. Rev. Econ. Des. 2023, 27, 635–667. [Google Scholar] [CrossRef]
Blair, G.; Imai, K.; Zhou, Y.Y. Design and Analysis of the Randomized Response Technique. J. Am. Stat. Assoc. 2015, 110, 1304–1319. [Google Scholar] [CrossRef]
Tourangeau, R.; Yan, T. Sensitive Questions in Surveys. Psychol. Bull. 2007, 133, 859–883. [Google Scholar] [CrossRef]
Helleringer, S.; Adams, J.; Yeatman, S.; Mkandawire, J. Evaluating Sampling Biases from Third-Party Reporting as a Method for Improving Survey Measures of Sensitive Behaviors. Soc. Netw. 2019, 59, 134–140. [Google Scholar] [CrossRef]
Heckathorn, D.D. Respondent-Driven Sampling: A New Approach to the Study of Hidden Populations. Soc. Probl. 1997, 44, 174–199. [Google Scholar] [CrossRef]
Lu, X.; Malmros, J.; Liljeros, F.; Britton, T. Respondent-driven sampling on directed networks. Electron. J. Stat. 2013, 7, 292–322. [Google Scholar] [CrossRef]
Abdesselam, K.; Verdery, A.; Pelude, L.; Dhami, P.; Momoli, F.; Jolly, A.M. The development of respondent-driven sampling (RDS) inference: A systematic review of the population mean and variance estimates. Drug Alcohol Depend. 2020, 206, 107702. [Google Scholar] [CrossRef] [PubMed]
Lu, X. Linked Ego Networks: Improving Estimate Reliability and Validity with Respondent-Driven Sampling. Soc. Netw. 2013, 35, 669–685. [Google Scholar] [CrossRef]
Verdery, A.M.; Merli, M.G.; Moody, J.; Smith, J.A.; Fisher, J.C. Brief Report: Respondent-Driven Sampling Estimators Under Real and Theoretical Recruitment Conditions of Female Sex Workers in China. Epidemiology 2015, 26, 661–665. [Google Scholar] [CrossRef]
Heckathorn, D.D.; Cameron, C.J. Network Sampling: From Snowball and Multiplicity to Respondent-Driven Sampling. Annu. Rev. Sociol. 2017, 43, 101–119. [Google Scholar] [CrossRef]
Beaudry, I.S.; Gile, K.J. Correcting for differential recruitment in respondent-driven sampling data using ego-network information. Electron. J. Stat. 2020, 14, 2678–2713. [Google Scholar] [CrossRef]
Chen, S.; Lu, X.; Liljeros, F.; Jia, Z.; Rocha, L.E.C.; Li, X. Indirect inference of sensitive variables with peer network survey. J. Complex Netw. 2021, 9, cnab034. [Google Scholar] [CrossRef]
Baraff, A.J.; McCormick, T.H.; Raftery, A.E. Estimating Uncertainty in Respondent-Driven Sampling Using a Tree Bootstrap Method. Proc. Natl. Acad. Sci. USA 2016, 113, 14668–14673. [Google Scholar] [CrossRef]
Newman, M. The Structure and Function of Complex Networks. SIAM Rev. 2003, 45, 167–256. [Google Scholar] [CrossRef]
Watts, D.J.; Strogatz, S.H. Collective Dynamics of ‘Small-World’ Networks. Nature 1998, 393, 440–442. [Google Scholar] [CrossRef]
McPherson, M.; Smith-Lovin, L.; Cook, J.M. Birds of a Feather: Homophily in Social Networks. Annu. Rev. Sociol. 2001, 27, 415–444. [Google Scholar] [CrossRef]
Lu, X.; Bengtsson, L.; Britton, T.; Camitz, M.; Kim, B.J.; Thorson, A.; Liljeros, F. The Sensitivity of Respondent-Driven Sampling. J. R. Stat. Soc. Ser. A Stat. Soc. 2011, 175, 191–216. [Google Scholar] [CrossRef]
Lu, X. Respondent-Driven Sampling: Theory, Limitations & Improvements. Ph.D. Thesis, Karolinska Institutet, Stockholm, Sweden, 2013. [Google Scholar]
Salganik, M.J.; Heckathorn, D.D. Sampling and Estimation in Hidden Populations Using Respondent-Driven Sampling. Sociol. Methodol. 2004, 34, 193–240. [Google Scholar] [CrossRef]
Rossi, R.A.; Ahmed, N.K. The Network Data Repository with Interactive Graph Analytics and Visualization. In Proceedings of the AAAI Conference on Artificial Intelligence, Austin, TX, USA, 25–30 January 2015. [Google Scholar] [CrossRef]
Rozemberczki, B.; Allen, C.; Sarkar, R. Multi-Scale Attributed Node Embedding. J. Complex Netw. 2021, 9, cnab014. [Google Scholar] [CrossRef]
Spiller, M.W.; Gile, K.J.; Handcock, M.S.; Mar, C.M.; Wejnert, C. Evaluating Variance Estimators for Respondent-Driven Sampling. J. Surv. Stat. Methodol. 2018, 6, 23–45. [Google Scholar] [CrossRef]
Gile, K.J.; Beaudry, I.S.; Handcock, M.S.; Ott, M.Q. Methods for Inference from Respondent-Driven Sampling Data. Annu. Rev. Stat. Its Appl. 2018, 5, 65–93. [Google Scholar] [CrossRef]
Ver Hoef, J.M. Who Invented the Delta Method? Am. Stat. 2012, 66, 124–127. [Google Scholar] [CrossRef]

Figure 1. Workflow of the Ego-Centric Sampling Method (ECM) and the activity-corrected estimator

{ECM}_{ac}

. A random sample of respondents (sampled egos) is drawn from the full network population. Each sampled ego then reports on the attributes of their direct connections (reported peers), providing data on their partial ego-network. As illustrated in the inset, this local data is aggregated to compute the activity ratio (AR) and connection probabilities required by the

{ECM}_{ac}

estimator for

P (A)

.

Figure 1. Workflow of the Ego-Centric Sampling Method (ECM) and the activity-corrected estimator

{ECM}_{ac}

. A random sample of respondents (sampled egos) is drawn from the full network population. Each sampled ego then reports on the attributes of their direct connections (reported peers), providing data on their partial ego-network. As illustrated in the inset, this local data is aggregated to compute the activity ratio (AR) and connection probabilities required by the

{ECM}_{ac}

estimator for

P (A)

.

Figure 2. Bootstrap workflow for

{ECM}_{ac}

. Starting from the original ego-centric sample, bootstrap datasets are generated under three schemes, and for each dataset

{\hat{P}}_{ac} (A)

is recomputed; percentile confidence intervals are then obtained from the empirical quantiles of the replicates. BS-Ego resamples whole egos together with their reported peers and treats egos as independent units. BS-Tree resamples egos and, within each sampled ego, resamples the reported peers to capture within-ego dependence, whereas BS-Pool resamples all reported ego–peer pairs as exchangeable edges.

Figure 2. Bootstrap workflow for

{ECM}_{ac}

. Starting from the original ego-centric sample, bootstrap datasets are generated under three schemes, and for each dataset

{\hat{P}}_{ac} (A)

is recomputed; percentile confidence intervals are then obtained from the empirical quantiles of the replicates. BS-Ego resamples whole egos together with their reported peers and treats egos as independent units. BS-Tree resamples egos and, within each sampled ego, resamples the reported peers to capture within-ego dependence, whereas BS-Pool resamples all reported ego–peer pairs as exchangeable edges.

Figure 3. Estimate distributions for ECM and

{ECM}_{ac}

in a BA network with

H = 0.15

,

A R = 1.3

, and a true

P (A) = 0.40

. Under both full reporting (A) and partial reporting of 10 peers (B), the conventional ECM estimator shows significant bias, with its distribution peaking near 0.475. The proposed

{ECM}_{ac}

estimator, however, remains centered on the true value in both cases.

Figure 3. Estimate distributions for ECM and

{ECM}_{ac}

in a BA network with

H = 0.15

,

A R = 1.3

, and a true

P (A) = 0.40

. Under both full reporting (A) and partial reporting of 10 peers (B), the conventional ECM estimator shows significant bias, with its distribution peaking near 0.475. The proposed

{ECM}_{ac}

estimator, however, remains centered on the true value in both cases.

Figure 4. Estimates of ECM and

{ECM}_{ac}

across six real networks as sample size increases. Solid lines show the mean estimates and shaded bands the 90% empirical confidence interval (5th–95th percentile);

{ECM}_{ac}

consistently tracks the true proportion (grey dashed line), while ECM converges to biased values in the presence of degree–attribute imbalance.

Figure 4. Estimates of ECM and

{ECM}_{ac}

across six real networks as sample size increases. Solid lines show the mean estimates and shaded bands the 90% empirical confidence interval (5th–95th percentile);

{ECM}_{ac}

consistently tracks the true proportion (grey dashed line), while ECM converges to biased values in the presence of degree–attribute imbalance.

Figure 5. Distributions of ECM and

{ECM}_{ac}

estimates under full, partial, and weighted ego-centric sampling across six real networks. The grey dashed line is the true

P (A)

. Dots are replicate estimates and boxplots summarize their spread. Across networks and strategies,

{ECM}_{ac}

stays close to the truth and reduces bias relative to ECM.

Figure 5. Distributions of ECM and

{ECM}_{ac}

estimates under full, partial, and weighted ego-centric sampling across six real networks. The grey dashed line is the true

P (A)

. Dots are replicate estimates and boxplots summarize their spread. Across networks and strategies,

{ECM}_{ac}

stays close to the truth and reduces bias relative to ECM.

Figure 6. Effect of

P (A)

on estimator accuracy in BA networks under fixed sampling settings. Curves show mean estimates and shaded bands denote 95% confidence intervals. As

P (A)

increases from 0.1 to 0.4, precision improves for both estimators, and ECM exhibits a persistent downward bias that is most pronounced at small

P (A)

, while

{ECM}_{ac}

remains centered on the true value across all levels.

Figure 6. Effect of

P (A)

on estimator accuracy in BA networks under fixed sampling settings. Curves show mean estimates and shaded bands denote 95% confidence intervals. As

P (A)

increases from 0.1 to 0.4, precision improves for both estimators, and ECM exhibits a persistent downward bias that is most pronounced at small

P (A)

, while

{ECM}_{ac}

remains centered on the true value across all levels.

Figure 7. Sensitivity to AR, illustrating the MAE of ECM and

{ECM}_{ac}

in ER and BA networks under P5 and P10 sampling. ECM shows a pronounced U-shaped MAE as AR departs from 1, whereas

{ECM}_{ac}

remains nearly flat with substantially lower error (largest gains when AR ≤ 0.7 or ≥ 1.3). Error bars denote 95% confidence intervals.

Figure 7. Sensitivity to AR, illustrating the MAE of ECM and

{ECM}_{ac}

in ER and BA networks under P5 and P10 sampling. ECM shows a pronounced U-shaped MAE as AR departs from 1, whereas

{ECM}_{ac}

remains nearly flat with substantially lower error (largest gains when AR ≤ 0.7 or ≥ 1.3). Error bars denote 95% confidence intervals.

Figure 8. Sensitivity to network density (

Ψ

) and clustering (

C_{avg}

). This plot shows the MAE of ECM and

{ECM}_{ac}

as

Ψ

and

C_{avg}

increase. Although higher connectivity slightly impacts both estimators, ECM’s error escalates significantly. In contrast,

{ECM}_{ac}

demonstrates strong robustness, maintaining a consistently low and stable error profile. Shaded areas represent 95% confidence intervals.

Figure 8. Sensitivity to network density (

Ψ

) and clustering (

C_{avg}

). This plot shows the MAE of ECM and

{ECM}_{ac}

as

Ψ

and

C_{avg}

increase. Although higher connectivity slightly impacts both estimators, ECM’s error escalates significantly. In contrast,

{ECM}_{ac}

demonstrates strong robustness, maintaining a consistently low and stable error profile. Shaded areas represent 95% confidence intervals.

Figure 9. Joint effect of AR and H. Estimation error of ECM and

{ECM}_{ac}

across the parameter space defined by AR and H, with bubble size encoding error magnitude. ECM’s error shows a strong, complex dependency on both factors, particularly at extreme AR values and under disassortative mixing (

H < 0

). In contrast,

{ECM}_{ac}

displays a uniformly low error surface, highlighting its robustness to the interaction between network structure and degree–attribute correlations.

Figure 9. Joint effect of AR and H. Estimation error of ECM and

{ECM}_{ac}

across the parameter space defined by AR and H, with bubble size encoding error magnitude. ECM’s error shows a strong, complex dependency on both factors, particularly at extreme AR values and under disassortative mixing (

H < 0

). In contrast,

{ECM}_{ac}

displays a uniformly low error surface, highlighting its robustness to the interaction between network structure and degree–attribute correlations.

Figure 10. Bias surfaces of ECM and

{ECM}_{ac}

in the

A R \times P (A)

parameter space. The top and middle rows correspond to the P5 and P10 sampling quotas, respectively. ECM’s bias forms a pronounced ridge, escalating as AR deviates from 1 and

P (A)

increases. In contrast,

{ECM}_{ac}

produces a consistently flat, low-bias surface. The bottom row visualizes the point-wise bias reduction, highlighting that the correction is most effective precisely where ECM is most biased.

Figure 10. Bias surfaces of ECM and

{ECM}_{ac}

in the

A R \times P (A)

parameter space. The top and middle rows correspond to the P5 and P10 sampling quotas, respectively. ECM’s bias forms a pronounced ridge, escalating as AR deviates from 1 and

P (A)

increases. In contrast,

{ECM}_{ac}

produces a consistently flat, low-bias surface. The bottom row visualizes the point-wise bias reduction, highlighting that the correction is most effective precisely where ECM is most biased.

Table 1. Overview of real-world network datasets.

Network	Nodes	Edges	Density	Clustering	P(A)	AR
AIDS	31,385	32,390	0.00007	0.005	0.1750	0.54
PTC	5110	54,690	0.00040	0.006	0.1411	0.60
Git	37,700	289,003	0.00041	0.168	0.2583	0.49
Flickr	80,513	5,900,000	0.00182	0.165	0.4420	1.22
Tox	127,998	130,481	0.00001	0.003	0.1611	0.58
Twitter	580,800	1,400,000	0.00001	0.394	0.0331	1.64

Table 2. Bias (SD) and RMSE (P^best) of sample, ECM, and

{ECM}_{ac}

estimators under varying AR in synthetic networks.

Table 2. Bias (SD) and RMSE (P^best) of sample, ECM, and

{ECM}_{ac}

estimators under varying AR in synthetic networks.

AR	Bias (SD)			RMSE (P^best)
AR	Sample	ECM	${ECM}_{ac}$	Sample	ECM	${ECM}_{ac}$
BA network, $P (A) = 0.1$
0.7	0.020 (0.014)	0.037 (0.008)	0.010 (0.007)	0.024 (27.8%)	0.038 (0.6%)	0.013 (71.6%)
1	0.021 (0.015)	0.008 (0.006)	0.008 (0.006)	0.026 (19.6%)	0.010 (37.2%)	0.010 (43.2%)
1.3	0.020 (0.015)	0.028 (0.011)	0.007 (0.006)	0.025 (20.4%)	0.030 (6.8%)	0.009 (72.8%)
1.5	0.020 (0.015)	0.049 (0.012)	0.007 (0.005)	0.025 (23.8%)	0.050 (0.2%)	0.009 (76.0%)
ER network, $P (A) = 0.1$
0.7	0.025 (0.018)	0.027 (0.008)	0.009 (0.007)	0.030 (18.2%)	0.028 (6.8%)	0.011 (75.0%)
1	0.022 (0.016)	0.008 (0.006)	0.007 (0.006)	0.027 (14.2%)	0.010 (37.4%)	0.009 (48.4%)
1.3	0.022 (0.017)	0.026 (0.011)	0.007 (0.006)	0.028 (17.6%)	0.028 (6.8%)	0.009 (75.6%)
1.5	0.022 (0.017)	0.043 (0.012)	0.007 (0.005)	0.027 (19.6%)	0.045 (0.8%)	0.009 (79.6%)

Note: Bold values indicate the best performance (lowest absolute Bias or RMSE) in each row.

Table 3. Bias (SD) and RMSE (P^best) for different sampling strategies in synthetic networks.

Sampling Strategy	Bias (SD)			RMSE (P^best)
Sampling Strategy	Sample	ECM	${ECM}_{ac}$	Sample	ECM	${ECM}_{ac}$
BA network, $P (A) = 0.1$
F	0.021 (0.016)	0.092 (0.014)	0.011 (0.008)	0.026 (29.1%)	0.094 (5.8%)	0.013 (65.1%)
P5	0.018 (0.014)	0.092 (0.013)	0.015 (0.010)	0.022 (41.8%)	0.093 (5.8%)	0.018 (52.4%)
P10	0.019 (0.015)	0.092 (0.013)	0.011 (0.009)	0.024 (33.0%)	0.094 (5.4%)	0.014 (61.6%)
W	0.021 (0.016)	0.092 (0.014)	0.010 (0.008)	0.026 (28.1%)	0.094 (5.7%)	0.013 (66.1%)
ER network, $P (A) = 0.1$
F	0.032 (0.024)	0.059 (0.012)	0.010 (0.008)	0.040 (19.4%)	0.061 (8.1%)	0.013 (72.6%)
P5	0.023 (0.017)	0.059 (0.012)	0.010 (0.008)	0.029 (26.9%)	0.060 (7.8%)	0.013 (65.3%)
P10	0.030 (0.022)	0.059 (0.012)	0.010 (0.008)	0.037 (20.6%)	0.060 (8.1%)	0.013 (71.2%)
W	0.032 (0.024)	0.059 (0.012)	0.010 (0.008)	0.040 (20.0%)	0.061 (7.7%)	0.013 (72.3%)

Note: Bold values indicate the best performance (lowest absolute Bias or RMSE) in each row.

Table 4.

{ECM}_{ac}

coverage rates at the 90% nominal level. Each cell shows the coverage rates for BS-Ego (BS-Tree, BS-Pool). The value closest to the nominal 0.90 level in each cell is bolded.

Table 4.

{ECM}_{ac}

coverage rates at the 90% nominal level. Each cell shows the coverage rates for BS-Ego (BS-Tree, BS-Pool). The value closest to the nominal 0.90 level in each cell is bolded.

PA	AR = 0.8	AR = 1.0	AR = 1.2	AR = 1.4	AR = 1.6	AR = 1.8
0.10	0.86 (0.95, 0.85)	0.83 (0.94, 0.83)	0.85 (0.98, 0.86)	0.93 (0.99, 0.91)	0.92 (0.99, 0.92)	0.88 (0.97, 0.87)
0.20	0.75 (0.92, 0.75)	0.83 (0.95, 0.77)	0.90 (0.96, 0.89)	0.83 (0.97, 0.88)	0.78 (0.94, 0.79)	0.81 (0.90, 0.81)
0.30	0.84 (0.95, 0.77)	0.84 (0.98, 0.86)	0.90 (0.97, 0.91)	0.85 (0.98, 0.84)	0.83 (0.95, 0.84)	0.89 (0.98, 0.88)
0.40	0.80 (0.93, 0.84)	0.86 (0.98, 0.88)	0.85 (0.96, 0.84)	0.84 (0.94, 0.84)	0.81 (0.94, 0.84)	0.82 (0.94, 0.83)

Note: Bold values indicate the best performance (lowest absolute Bias or RMSE) in each row.

Table 5.

{ECM}_{ac}

coverage rates at the 95% nominal level. Each cell shows the coverage rates for BS-Ego (BS-Tree, BS-Pool). The value closest to the nominal 0.95 level in each cell is bolded.

Table 5.

{ECM}_{ac}

coverage rates at the 95% nominal level. Each cell shows the coverage rates for BS-Ego (BS-Tree, BS-Pool). The value closest to the nominal 0.95 level in each cell is bolded.

PA	AR = 0.8	AR = 1.0	AR = 1.2	AR = 1.4	AR = 1.6	AR = 1.8
0.10	0.91 (0.98, 0.91)	0.89 (0.98, 0.90)	0.96 (1.00, 0.96)	0.96 (1.00, 0.97)	0.93 (0.99, 0.92)	0.95 (0.99, 0.95)
0.20	0.90 (0.99, 0.91)	0.91 (0.98, 0.92)	0.97 (0.99, 0.97)	0.97 (0.99, 0.96)	0.93 (0.99, 0.93)	0.91 (0.99, 0.93)
0.30	0.87 (0.98, 0.86)	0.97 (1.00, 0.96)	0.95 (0.99, 0.95)	0.96 (1.00, 0.95)	0.95 (0.99, 0.96)	0.89 (0.99, 0.90)
0.40	0.93 (0.99, 0.92)	0.93 (0.98, 0.92)	0.93 (0.99, 0.91)	0.94 (0.99, 0.93)	0.92 (0.99, 0.92)	0.85 (0.97, 0.91)

Note: Bold values indicate the best performance (lowest absolute Bias or RMSE) in each row.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wen, K.; Mou, J.; Lu, X. Peer Reporting: Sampling Design and Unbiased Estimates. Entropy 2026, 28, 116. https://doi.org/10.3390/e28010116

AMA Style

Wen K, Mou J, Lu X. Peer Reporting: Sampling Design and Unbiased Estimates. Entropy. 2026; 28(1):116. https://doi.org/10.3390/e28010116

Chicago/Turabian Style

Wen, Kang, Jianhong Mou, and Xin Lu. 2026. "Peer Reporting: Sampling Design and Unbiased Estimates" Entropy 28, no. 1: 116. https://doi.org/10.3390/e28010116

APA Style

Wen, K., Mou, J., & Lu, X. (2026). Peer Reporting: Sampling Design and Unbiased Estimates. Entropy, 28(1), 116. https://doi.org/10.3390/e28010116

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Peer Reporting: Sampling Design and Unbiased Estimates

Abstract

1. Introduction

2. Estimation Framework

2.1. Notation and Model Specification

2.2. Observables from Ego-Centric Sampling

2.3. ECM Estimator

2.4. ECM ac Based on Reciprocity

2.5. Variance Estimation

3. Experimental Design

3.1. Synthetic and Real-World Networks

3.2. Real-World Networks

3.3. Sampling and Estimation Procedure

3.4. Evaluation Metrics

3.5. Bootstrap Confidence Intervals

4. Results

4.1. Performance on Synthetic Networks

4.2. Performance on Real-World Networks

4.3. Bootstrap Coverage Rate

5. Sensitivity Analysis

5.1. Population Proportion P(A)

5.2. Activity Ratio (AR)

5.3. Network Density and Clustering

5.4. Combined Effects of AR and H

5.5. Combined Effects of AR and P(A)

6. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Variance Estimation

Appendix A.1. Asymptotic Variance Estimation

Appendix A.2. Bootstrap Methods for Confidence

Appendix A.2.1. Resampling Designs

Appendix A.2.2. Confidence Interval Construction

Appendix B. Supplementary Comparison with RDS and NSUM

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

2.4. ${ECM}_{ac}$ Based on Reciprocity