Estimation of Non-Linear Parameters with Data Collected Using Respondent-Driven Sampling

: Respondent-driven sampling (RDS) is a snowball-type sampling method used to survey hidden populations, that is, those that lack a sampling frame. In this work, we consider the problem of regression modeling and association for continuous RDS data. We propose a new sample weight method for estimating non-linear parameters such as the covariance and the correlation coefﬁcient. We also estimate the variances of the proposed estimators. As an illustration, we performed a simulation study and an application to an ethnic example. The proposed estimators are consistent and asymptotically unbiased. We discuss the applicability of the method as well as future research.


Introduction
Respondent-driven sampling (RDS) is a refined form of snowball sampling for collecting data from hidden populations that lack a sampling frame. Therefore, these populations are difficult to reach and cannot be dealt with using traditional sampling techniques. RDS was first introduced by Heckathorn [1] and was developed afterwards by Salganik and Heckathorn [2] and Volz and Heckathorn [3]. Some recent papers in parameter and variance estimation are given by [4][5][6]. Some popular examples of RDS are HIV at-risk people, the LGBTI community and injection drug users [7][8][9]. Other examples are given in [10,11].
A long-standing problem in non-probabilistic sampling is regression modeling for RDS data. Several authors have considered this issue from different perspectives. For instance, there have been a surge of studies in machine learning and statistical framework to solve similar problems. Wong et al. [12] studies the problem of biased standard errors of non-linear transport models. Imani et al. [13,14] use an approximation of a distribution using a Markov chain Monte carlo (MCMC) algorithm and an approximate MCMC implementation, respectively.
Model-fitting should incorporate sample weights as well as information about correlation between sample units [3]. Avery et al. [15] review some available methods in an RDS framework dealing with this problem in a simulation study with binary response. One popular method for addressing the problem of the correlation structure between recruits and recruiter in a network is clustering, that is, transforming RDS data into clustered data [16][17][18]. Clustering is connected with the homophily in the population, that is the tendency to associate with those with similar characteristics. Becket et al. [16] use clusters to consider correlation for estimating diabetes prevalence and discrete covariates in an empirical study. Nevertheless, using clustering can be problematic as sometimes clusters, if they exist, may be difficult to identify. Hubbart et al. [19] point some of the problems involved with using generalized estimating equations (GEE) when the number of clusters is small and Rao et al. [20] stress some of the problems involved with not adjusting properly to clusters.
Regression methods for analyzing data have not been fully validated [16]. Therefore, as no clear approach for regression modeling in RDS is available, some authors use standard statistical methods without adjusting for RDS data [21,22]. On the other hand, some authors use weighted regression for estimating prevalence of characteristics of interest in real-life examples [23][24][25][26]. Most of them use individual weights calculated (typically with the respondent driven sampling analytical tool (RDSAT)) and export them to standard statistical software to apply the weighted method. Methods incorporating sample weights will tend to improve their performance when homophily is small in the population, as they typically do not account for the potential dependence of the units. While this might be an issue in populations with high homophily, there is no clear reliable regression method in RDS accounting for clustering, that can be extensively used in applications. Adjusting for clusters requires knowledge of the population and if it is not well performed in practice, or even if clusters do not actually exist in the population, might result in biased estimates [20]. Our method addresses the problem of regression modeling and association between continuous variables by proposing a new sample weight estimation method for continuous data. The focus of our work was to propose a method for estimating non-linear parameters such as the covariance and the correlation coefficient. We derived expressions for the estimators that make use of the RDS estimators admitting continuous data and showed that they share properties with them, such as being consistent and asymptotically unbiased. A diagram is given in Figure 1. We also estimated the variances of the proposed methods. Our method may fill a gap as no such an approach has achieved in an RDS framework: most studies incorporate the weights using standard statistical software and unlike our proposal, they are focused on prevalence estimation.  We begin in the next section by introducing respondent-driven sampling. In Section 3, we propose estimators for the population covariance and for the correlation coefficient. Estimators of the variances of the proposed estimators are considered in Section 4. A simulation study was performed to illustrate their performance, described in Section 5. An application to study the living conditions of Indigenous, Montubios and Afro-Ecuadorian young people are presented in Section 6. Finally, Section 7 presents concluding remarks.

Background
The main idea behind the estimation in RDS [3] is to treat this sampling as a random walk on an undirected network. It is well known from Markov chain theory that the stationary (equilibrium) probability of a node is then proportional to its degree.
We assume the target population consists of N people (nodes) with labels 1, ..., N. We assume the target population is connected by a network of mutual relations with N × N adjacency matrix Z. That is z ij = z ji = 1 if i and j are connected and 0 otherwise. We define the nodal degree of a the person i, δ i = ∑ j z ij , as the number of network ties or alters of node i.
An small initial sample s is selected from the population members accessible to researchers that are called the seeds and comprise wave 0 of the sample. Each member of wave v is given a number of uniquely identified coupons to distribute among their alters. Coupon recipients returning their coupons to the study center are subsequently enrolled in the study. The wave number of a respondent is one more than that of their recruiter. This procedure is repeated until the desired sample size, n, is attained.
Let the N-vector y represent a variable of interest. If y have binary response and groups A and B, the more usual estimators in RDS are the RDS-I ratio estimator, the RDS-II estimator [3] and the Gile and Hanckock [27] version for sampling with replacement.
The RDS-I estimator for estimating proportions with y binary response and groups A and B is defined asp withĈ AB = r AB r AB +r AA , r AB is the number of people of A s recruiting B s in the sample, r AA the number of people of A s recruiting A s in the sample,Ĉ BA = r BA r BA +r BB , r BA is the number of people of B s recruiting A s in the sample, n A and n B the number of sample units belong to groups A and B respectively, andD A andD B are the average degree of people in groups A and B, respectively.
The RDS-SS [27] estimator for estimating proportions: withπ(δ k ) the estimated population distribution of degrees through successive sampling. The RDS-II estimator of the mean Y allows continuous variables and takes the form of the Hajek estimator as follows: with δ k the degree reported by respondent k.

Estimation of Some Non-Linear Parameters
The widespread use of regression based on sample survey data requires a careful assessment of the use of standard techniques. It is clear that usual estimators of parameters involved in regression are not valid in the case of RDS scheme. In this section, we develop some estimators for population variances, covariances and the correlation coefficient.

Estimation of the Variance and the Covariance
We define the population covariance as: We can write this parameter as: Similarly, the finite population variances are defined as Let us construct estimators for these parameters assuming that y k and x k are observed for the units of the RDS sample s.
If there existsθ 1 ,θ 2 andθ 3 consistent estimators of θ 1 , θ 2 and θ 3 , a consistent estimator of S yx will beŜ We can estimate these totals with the RDS-II estimator: Then, the estimator of the covariance iŝ If N is large,T yHH can be written in a more straightforward way that does not depends on N: Using the idea of the RDS-SS estimator, we propose to estimate the totals as: T ySS = ∑ s y kπ (δ k ) −1 ,T yxSS = ∑ s y k x kπ (δ k ) −1 andT yySS = ∑ s y 2 kπ (δ k ) −1 , beingπ(δ k ) the estimated population distribution of degrees through successive sampling.
Then, the estimator of the covariance iŝ If N is unknown, a consistent estimator for S yx is: RDS-SS and RDS-II estimators of a total are asymptotically unbiased, thus the proposed estimators will be asymptotically unbiased.

Estimation of the Correlation Coefficient
In this section, we consider the estimation of the correlation coefficient between two variables, say y and x, defined by Two estimators for this parameter can be obtained by using RDS-II and RDS-SS estimators which are previously defined: and

Estimation of the Variances
We consider the variance estimation of the covariance of S yx Using a Taylor linearization, we write . They are estimated byŵ 1 = w 1 ,ŵ 2 = −T x N(N−1) ,ŵ 3 = −T y N(N−1) .
Note: A more straightforward computational expression can be derived from formulae 5.5.10 in Särndal et al. [28] We estimate the variances and covariances of the above-mentioned totals for the RDS-II estimator asV The proposed RDS-II estimator is only analogous to the Hansen and Hurvitz estimator [29], but as data are correlated in an RDS framework, the above-mentioned estimators can perform poorly. Even though Volz and Heckathorn [3] derived a variance estimator that accounts the MCMC structure of the sample for categorical variables, we can not use this variance estimator in this context.
We estimate now the variances and covariances of the totals for the RDS-SS estimator by using the Deville and Särndal [30] method for estimating the variance of the Horvitz-Thompson estimator. The variances are estimated as − ∑ l∈s a l y l x l /π(δ l )) 2 , The covariances are estimated as where a k = (1 −π(δ k ))/ ∑ l∈s (1 −π(δ l )).
As the correlation coefficient estimators are ratio estimators, the estimators of their variances can be easily obtain by using Taylor linearization (see e.g., [31]).

Simulation Experiments
In this section, a limited simulation study was carried out to illustrate the performance of the proposed estimators under different scenarios. The main factor of interest was the estimation of the population covariance and the correlation between continuous covariates. We used our own code written in R to compute the proposed estimators. Programming details and code are available from the authors.
The simulated population size was N = 10000. A NxN network connection indicator matrix C was randomly generated, with c ij either 0 or 1, a connection indicator between node i and j, for i, j = 1, . . . , N. Resulting c ij will determine degree, as ∑ i∈U,i =j c ij = δ j . Ten seeds were selected at random from the network with probability proportional to their degree, with three maximal coupons issued for each participant.
The values of the variable of interest y were generated from a normal distribution y j ∼ N(5000, 500), for j = 1, . . . , 5000. Three auxiliary variables were then generated from the values of y, which were: x 1 = (y − e 1 )/0.5 with e 1 ∼ N(500, 500), x 2 = (y − e 2 )/0.5 with e 2 ∼ N(500, 700) and x 3 = (y − e 3 )/0.5, where e 3 ∼ N(500, 300). The resulting correlation coefficients were ρ = 0.7007 for x 1 , ρ = 0.571 for x 2 and ρ = 0.8579 for x 3 , respectively. The simulations were also performed for other different covariates and therefore different values of ρ, but the results were qualitatively similar and hence are not reported here. Sample size was n = 500 and samples were selected using simple random sampling without replacement, just like RDS is usually conducted in practice. For each regression model, we computed the two proposed estimators of the population covariance S yx and the correlation coefficient ρ. We investigated the percent relative bias rb% = E MC ( θ − θ)/θ * 100, and the percent relative mean squared error for each estimator S yx and ρ. Simulation results were based on B = 1000 samples and E MC denotes the average of the Monte Carlo replications.
The estimators of the covariance are approximately unbiased, as relative biases are around 1% for all scenarios considered, with even lower biases for the correlation coefficient estimates, with all of them less than 1%, as shown in Tables 1 and 2. Small relative efficiency values for estimating the parameters with quite similar results obtained with both estimators, indicating that they are effective in estimating these non-linear parameters.

Application to a Real Survey
In this section, the proposed estimators were applied to a real survey involving discrimination and the under-representation of young Indigenous, Montubios and Afro-Ecuadorian people in Ecuador. The RDS methodology was applied to a population of young (18 to 29 years old) Indigenous, Montubios and Afro-Ecuadorian people living in the city of Riobamba (Ecuador). They have historically been suffering from exclusion and under-representation and therefore, this group lacks a reliable sampling frame [32][33][34][35]. A total of 814 people were recruited in six waves and questioned on their social and economic background and living conditions using a dual system of incentives to motivate recruitment. The reported income of the household is the variable of interest and the age of the respondent is the covariate. This is unpublished data that is intended for publication in a manuscript that is in preparation [36].
Good overall performance of the two proposed estimators for the covariance and the correlation coefficient, with a bias approximately around 5% and similar small values of the relative mean squared error rmse, as shown in Tables 3 and 4.

Discussion
RDS were used extensively to study the prevalence of a disease. As more RDS practitioners are incorporating this methodology to their toolbox, model-fitting in an RDS framework has become an important issue of interest. We proposed a new sample weight estimation method for continuous data. Our approach is most appropriate for situations in which homophily is small. While we consider this is a novel approach for continuous RDS data, accounting for clustering remains an open question. It is possible to extend this methodology to adjusting to clusters, as part of future research.
As an illustration of the applicability of the proposed method, we performed a simulation study and an application to an ethnic example. Nevertheless, the focus of our work has been to propose a method for estimating non-linear parameters with new sample weights. We derived expressions of the variances and showed that the proposed estimators have desirable properties. Our simulation study does not show significant differences in terms of bias or root mean square error between the two proposed estimators. Furthermore, the calculation complexity of the two estimators is similar. There is therefore no objective reason to prefer one over the other.
Taken together, the results about the dependence between continuous variables presented in this paper add to the growing literature on respondent-driven sampling, allowing researchers to obtain better information about key hidden populations.
Author Contributions: The authors contributed equally to this work in conceptualization, methodology, software and original draft preparation. All authors have read and agree to the published version of the manuscript.