Design-Based Approach for Analysing Survey Data in Veterinary Research

Sample surveys are an essential approach used in veterinary research and investigation. A sample obtained from a well-designed sampling process along with robust data analysis can provide valuable insight into the attributes of the target population. Two approaches, design-based or model-based, can be used as inferential frameworks for analysing survey data. Compared to the model-based approach, the design-based approach is usually more straightforward and directly makes inferences about the finite target population (such as the dairy cows in a herd or dogs in a region) rather than an infinite superpopulation. In this paper, the concept of probability sampling and the design-based approach is briefly reviewed, followed by a discussion of the estimations and their justifications in the context of several different elementary sampling methods, including simple random sampling, stratified random sampling, and one-stage cluster sampling. Finally, a concrete example of a complex survey design (involving multistage sampling and stratification) is demonstrated, illustrating how finding unbiased estimators and their corresponding variance formulas for a complex survey builds on the techniques used in elementary sampling methods.


Introduction
Sample surveys, where data from a subset, or sample, of a population are used to make inferences about that population, are a traditional research methodology which has been widely used in veterinary research and investigation [1,2]. However, in this era of "big data", with modern techniques such as machine learning, bioinformatics, or other computer-based technologies being increasingly used in veterinary research [3] across areas such as animal behaviour [4] and disease detection [5] and prediction [6], the sample survey is in danger of appearing "old fashioned" and "out-dated".
However, the "old-fashioned" sample survey still has some advantages over cuttingedge big data methodologies. Firstly, in a sample survey, information or data can be collected actively in order to answer a specific research question, whereas the research question that can be answered by "big data" techniques is dependent on what information is available in the big data source. Secondly, in a well-planned sample survey the target population can be framed in advance and followed by a well-designed sampling process so that the samples are representative of the population [7]. This representativeness is often not achieved during the passive "big data" collection process, with data often being collected only from a particular subset of the target population-e.g., Revilla, et al. [8] analysed more than 10.5 million measurements from~13,000 pigs obtained using automatic feeding systems. However, this dataset was collected from only one boar testing station, making generalisation to the wider population potentially difficult. Finally, it is not economically feasible to collect "big data" when novel information is required for some specific research topics.
Once the specific data for a research topic are collected, a rigorous and robust data analysis is essential to gain insight from the sample survey. As per Alexis de Tocqueville: "when statistics are not based on strictly accurate calculations, they mislead instead of guide" [9]. Generally, there are two approaches for analysing survey data: the model-based approach and the design-based approach [10]. The former is possibly better understood by many veterinary researchers who have undertaken standard quantitative research methodology training, as mainstream statistical training usually treats the observed datae.g., production data or diagnostic test outcomes-as realisations of some relevant random variables. However, one important assumption in this method is usually overlooked-that the underlying population is treated as a "superpopulation" which contains infinitely many animals [11]. Strictly speaking, an estimated model parameter therefore refers to a property of the hypothetical "superpopulation" rather than a characteristic of the finite population which is of actual interest [7].
For example, suppose that simple random sampling had been implemented on a dairy farm to study the prevalence of bovine digital dermatitis using a large sampling fraction (~70% of the herd size). The analyst then fitted an intercept-only logistic regression to estimate the intercept. As the intercept represents the logit of the prevalence, using a suitable back-transformation, the analyst was then able to report the estimated prevalence of digital dermatitis on the farm. However, given a large sampling fraction, this estimate actually represents the prevalence of a hypothetical superpopulation from which the sample was drawn rather than the prevalence on the farm of interest [12]. To make an inference to the actual finite population, a superpopulation approach can be used [12]; however, although mathematically correct, explaining the approach is likely to create difficulties in communications with other rural professionals or companion animal practitioners.
The design-based approach for analysing survey data avoids the complexity in analysis and communication seen with the model-based approach. One key advantage of the design-based approach is that it focuses on inferences related to the target finite population(s) without introducing extra assumptions about the parametric form of the outcome variable. In addition, the analysis steps are consistent with the sampling steps, so the process of checking for potential mistakes during the analysis process is clearer [13]. Thirdly, the design-based approach has no requirement of an "assumed" probability distribution dictated by the design itself [14]. The aim of this review article is to provide a comprehensive introduction to the design-based approach for analysing survey data by (1) describing the analytical methods for elementary probability sampling methods, including simple random sampling, stratified random sampling, and cluster sampling, and (2) to demonstrate the key ideas necessary to understand and interpret those analytical methods, as well as how those ideas can be used to develop methods for any specific complex survey design.

Overview of Probability Sampling
First, we define a set U as a target population including M animals in a finite population (e.g., animals on a farm or all >2-year-old Jersey cows in a region). There are various ways to obtain samples which are just some subsets of U. Let us denote S for a particular sample chosen from U, then S ⊂ U. With the proper subset notation "⊂", we restrict the sample size to being smaller than the population size. Suppose we want to obtain a random sample with m animals, we could have M m = L different samples: S 1 , S 2 , S 3 , . . . We can then define a set "sample space" (denoted as Ω) that contains all these samples. With probability sampling, a probability can be explicitly assigned for each of the samples, with the constraint that ∑ L i=1 P(S i ) = 1, as the axiom states that the probability of a sample space is 1 and the union of all the samples forms the sample space. The probability of obtaining each of the L samples does not have to be constant-i.e., P(S 1 ) = P(S 2 ) is absolutely acceptable-and we can also restrict the probability of a particular sample to 0 if some animals within the sample are considered inappropriate as study units. The other feature of these samples is that two samples can include the same animals, and the probability of an animal k being selected (π k ) is calculated by summing the probabilities of all samples including this animal-i.e., π k = ∑ S:k∈S P(S). An intuitive numeric example is displayed in Figure 1. Eventually, we define the sampling weights w k as the reciprocal of the inclusion probability π k for any type of sampling method [15]. Generally, it is recommended that the veterinary researcher interprets the sampling weight of the animal k as the number of animals in the target population represented by this animal (a deeper treatment of sampling weights can be found in Gelman [16]; however, non-response adjustments are beyond the scope of this article).

Overview of Design-Based Approach
In the design-based approach, the observed value (production record or test outcome) y k is not considered to be a realisation generated from some data generation mechanism (or "population"); instead, it is regarded as a fixed constant, with the randomness arising solely from the sample selection [17]. In plainer language, although y k is fixed, it remains unknown unless the animal is selected in the sample [18]. If all animals are tested from a given population, then the test results for all the animals are known without any uncertainty. In contrast, if we only test a sample of these animals, the only outcomes that we know are the animals included in the sample. If we randomly draw samples of a fixed size repeatedly from a target population, a particular animal may not be sampled repeatedly due to the randomness in the sample selection; hence, the sample statistics can vary across the samples. That is the only source of randomness. Therefore, it is natural to define a Bernoulli random variable to indicate whether an animal in the target population is also in the sample. For example, if the kth animal is included in the sample, the Bernoulli random variable Z k = 1; otherwise, it is Z k = 0. This random variable maps the (hypothetical) animal ID in the population into numeric values for selection status. This idea is essential when studying the properties of the estimators (unbiasedness) and deriving the variances of these estimators [19].
The design-based approach is of particular value when the finite population characteristics are of interest, as when a design-based approach is used, the researcher can direct inferences about the finite target population, even if the sample size to population size ratio is not small (i.e., when the finite population correction must be considered). For example, the prevalence in a finite population is interpreted as the proportion of diseased animals in that population. Assuming that 70% of the population is sampled, a design-based approach gives a direct estimate of this population proportion, which is often a key target of veterinary investigations. In addition, the estimators (i.e., the rules or formulas) for estimating the finite population characteristics are consistent with the sampling method. Therefore, the estimation process is naturally understandable and easier to communicate with non-statistically inclined veterinarians and researchers [20]. Finally, with the designbased approach, the analyst does not need to decide which potential model generated the data, as the observed values are treated as fixed constants. For example, if the average milk production in a herd is of interest, one does not need to assume that milk yield from a cow is generated from a normal random variable, particularly when it is not. One just needs to calculate the sample mean as an estimate of the average milk production in the herd. Finally, whether an estimator is unbiased (i.e., whether its expected value and the true value of the parameter are effectively equivalent) is not dependent on the parametric form of the observed value.

Overview of Model-Based Approach
Although this approach may be more familiar to researchers, we do not advocate this approach in this paper from a practical point of view. As with most mainstream statistical methods, this approach treats the observed value as a realisation of an underlying random variable. For example, the test result y k of the kth animal is generated from a random variable Y k , whose parametric form must be decided. If Y k is a Bernoulli random variable, then with the model-based approach a likelihood-based method is the most common approach for estimating the probability that a random animal will test positive. However, this probability is not a finite population prevalence; it is more correctly interpreted as a hypothetical infinite superpopulation prevalence. Although extra steps can help to make an inference back to the finite population, this significantly adds complexity to aspects of analysis and in communicating the results to stakeholders without a statistical background [12]. The other major disadvantage of the model-based approach is that the estimated parameters may be biased if the model is mis-specified [20].

Simple Random Sampling
Simple random sampling (SRS) is the most basic form of probability sampling. In this process, all possible samples of a given size have the same probability of being selectedi.e., P(S) is constant for every possible sample. As a result, all the animals in the population have an equal probability of being included in the sample-i.e., the inclusion probability π is the same for every animal [21]. This sampling process has been applied to many veterinary studies, including recent investigations of lumpy skin disease [22], bovine mastitis [23], and foot-and-mouth disease [24]. Despite its simplicity, in the right situation it can be a powerful sampling method and provide the theoretical basis for more complicated sampling methods. There are two forms of SRS-with and without replacement. In this article, we will limit the discussion to SRS without replacement (the sample contains no duplicated animals,) as this is by far the most common practice in veterinary research.
The statistics in which a researcher is usually interested are the properties of the population, e.g., the average milk production of the herd or the prevalence of a disease within the herd. We denote this finite population mean as µ and prevalence p is just a special case of the finite population mean when the individual outcome value y k can only be 1 or 0. In the SRS setting, estimating the mean is straightforward. However, for other sampling processes this is not always the case; hence, it is easier to start to estimate the finite population total before moving on to the mean (which is a linear function of the total). To ensure a consistent methodology is used in this review, we will stick with the two-step process-estimating the total first and then the mean or prevalence.
Suppose we have a herd with M animals, of which a sample of m animals has been obtained using SRS. The Horvitz-Thompson (HT) estimator of the finite population total is [25]: (1) In the SRS setting, the sampling weight w k is a constant as the inclusion probability is the same for every animal, such that π k ≡ P(Z k = 1) = m M (see Appendix A for technical details), where Z k is the Bernoulli random variable for selection and Z k = 1 if the animal k is selected; otherwise, Z k = 0. The HT estimator is, by design, unbiased-i.e., its expected value is equal to the true value of the finite population characteristic [7]: where E[] is the expectation operator which takes all the possible values generated by the random variable and returns the weighted average value, so E[ The unbiased estimator for the mean y or the prevalencep is, therefore: The proof is trivial. By observing Equation (2), we see that, in SRS, the sample mean or sample proportion is the unbiased estimator for the population mean or prevalence. This means that, for other sampling strategies, building up the sample mean from SRS will also result in an unbiased estimator if done correctly.
To derive the variance of the estimator for the mean or the prevalence, it is also easier to start with the variance of the total. The detailed derivation can be seen in Appendix B; here, we only provide the formulas for the variances. First, the variance for the estimated population total is: where is the variance of the finite population. In the special case where we estimate prevalence, we can replace µ with p with some algebra (see Appendix B), resulting in σ 2 = Mp(1−p) M−1 . Therefore, the variances for y andp are given as follows: However, the finite population variance depends on an unknown quantity µ or p, which we are attempting to estimate; in practice, we often replace σ 2 with , which is the sample variance (or p withp). Therefore, the estimated variance for y andp is: where 1 − m M is usually referred to as the finite population correction factor [26]. To illustrate this process, consider an investigator who wants to estimate the prevalence of digital dermatitis in lactating cows in a dairy herd. A random sample of 100 cows is obtained from a herd of 300 cows, of which 35 sampled cows are diagnosed as diseased. These 35 cows have records y k = 1 and the remining 65 sampled cows have records y k = 0. The estimated prevalence is calculated using Equation (2), thus it is 0.35. The variance of this estimate is calculated using Equation (7). As the actual prevalence is unknown, we need to use the estimated prevalence to calculate the estimated variance:

Stratified Random Sampling
In the stratified random sampling procedure (STRRS), the target finite population (e.g., the total number of animals within a herd) is partitioned into non-overlapping groups based on some pre-defined attributes and each of the groups is referred to as a stratum. These strata constitute the entire population; therefore, each animal belongs to a specific stratum. Within each stratum, SRS is commonly used to sample animals, and the sampling processes in the different strata are independent [27]. There is no requirement to select all strata within a population. If only some strata are of interest (e.g., only those which include lactating cows), these can be selected and strata that are not of interest can be excluded. If this approach is used, it needs to be made clear that the target population is no longer the entire finite population, but rather the population represented by the selected strata.
The finite population mean or prevalence is then estimated by pooling the information from all the strata. Like SRS, STRRS is commonly used in veterinary research, for example stratification by area. This allows the researcher to investigate prevalences and associations across a country or a region-e.g., Heayns and Baugh [28] investigated the opinions of veterinarians across the UK about serological testing to assess revaccination requirements in dogs. In this study, each county of Great Britain was considered as a stratum and 10% of the small animal veterinary practices within each stratum were randomly selected (if there were fewer than 10 practices in a county, one practice was randomly chosen to represent the county). Similarly, Atuman et al. [29] investigated dog ecology, dog bites, and rabies vaccination rates in Bauchi, Nigeria, using STRRS. They stratified Bauchi into five areas, and within each area randomly selected 10% of the streets for direct street counts and the administration of a questionnaire. However, other sources of strata are also used-e.g., as part of a randomised clinical trial of footrot treatments in Kashmir, India, Kaler, et al. [30] allocated sheep with acute footrot to one of three treatments using STRRS, with the strata being based on each sheep's maximum footrot score. Stratification is useful to ensure that the sample includes individuals which could otherwise be missed by chance in SRS due to the limited number of individuals in their stratum. For example, at a certain period a pig farm in Hong Kong may keep few finisher pigs, but many piglets and sows are present on the farm. With SRS, it is likely that none of the finisher pigs is included in the sample, therefore one can argue that there is error in the representation of the population which could potentially dimmish the accuracy of the estimate. For this reason, it is also common to sample a fixed number of individuals in each stratum. Compared to SRS, however, extra information such as the variable used for stratification (membership) must be obtained for all sampling units.
If STRRS has been used, care is required when pooling the information from the strata in order to obtain an unbiased estimator for the finite population mean or prevalence. A "natural" estimator for the mean/prevalence might involve summing up all the observed values in the sample and dividing by the sample size (equivalent to the process of the SRS). However, this estimator is unbiased if the sample size in each of the strata is proportional to the actual size of the stratum-i.e., there has been proportional allocation (this is demonstrated in more detail in Appendix C). The more general common approach to obtain an unbiased estimator for the finite population mean or prevalence follows the two principles we have mentioned: (1) following the actual sampling process and (2) starting with the finite population total. Consider a farm with M animals. A researcher has created J strata based on the ages of the animals. For the jth stratum, there are M j animals, and clearly M = ∑ J j=1 M j . Suppose that m j animals are sampled using SRS independently from each of the strata and that the value of the variable of interest is denoted as y jk for the kth animal in the jth stratum.
The unbiased estimator (using weight notation) for the finite population total: where w jk is the sampling weight which is the reciprocal of the inclusion probability π jk . For STRRS, this is the probability of the kth animal in the jth stratum being selected. However, writing the estimator in this form is not very intuitive, and it can be rewritten into a different formula in order to provide a more intuitive and meaningful picture for veterinary researchers. As SRS has been implemented within each of the strata, the inclusion probability π jk for the kth animal in the jth stratum is simply the sample size m j divided by the stratum size M j , which leads to w jk = 1 π jk = M j m j . Now, Equation (8) can be rewritten as:τ This formula says that in order to estimate the finite population total, we need to first compute the mean/prevalence for each of the strata y j ≡p j = ∑ m j k=1 y jk m j using the estimator we have seen in SRS and then multiply it by the stratum size M j to obtain the estimated total for each stratum. We then sum up all these estimated stratum totals to obtain the estimated finite population total. This is consistent with and follows the actual sampling process, as well as producing an unbiased estimator: where Z jk is the Bernoulli random variable for selection, representing whether the kth animal in the jth stratum is selected with an inclusion probability π jk , and E Z jk = π jk = m j M j due to SRS. Once the estimated total is found, the estimated finite population mean or prevalence is just the total divided by the population size: Since each stratum is independently sampled, building on the SRS, the variances for y andp using STRRS are also straightforward: where both σ 2 j and p j are unknown quantities representing the population variance and prevalence in the jth stratum. Similar to the SRS, the estimated variances are obtained by substituting estimated quantities into the unknowns, such as: where s 2 j is the sample variance of the jth stratum and the formula is given in the SRS section.
To illustrate this, consider an investigation of the seroprevalence of pseudorabies on a farm where STRRS is used. First, pigs are divided into groups based on the five production stages (strata): piglets, weaners, growers, finishers, and sows (breeding herds). The total numbers of pigs in each stratum are 30, 30, 40, 20, and 60, respectively. Within each stratum, a fixed number of pigs (10) are sampled using SRS and the numbers of infected pigs are 5, 6, 3, 2, and 7. The estimated prevalence can then be calculated using Equation (10): 30 10 ×5+ 30 10 ×6+ 40 10 ×3+ 20 10 ×2+ 60 10 ×7 30+30+40+20+60 = 0.506. The variance of this prevalence estimate can then be estimated using Equation (14). This is carried out stratum by stratum; for example, for the . This process is then repeated for all the strata, and the estimated variance is the sum of the quantities calculated for each stratum. In the example, the final estimated variance is 0.004.

Cluster Sampling
In this sampling method, the animals in a finite population (animals in a herd, region, or country) are aggregated into larger sampling units: clusters. A cluster is similar to a stratum; however, the sampling process is different. In a cluster sampling procedure, a set of (n) clusters is sampled using SRS from a population with N clusters. These clusters are usually referred to as primary sampling units, and the members within each cluster as secondary sampling units. Within the primary sampling units, all secondary sampling units may be measured or observed (one-stage cluster sampling) or the secondary sampling units may be sampled using SRS (two-stage cluster sampling). The selected individuals within the selected clusters then form a sample of the finite population [26]. In contrast, in STRRS all strata of interest must be included, and SRS is usually used to sample individuals within each stratum. These different sampling strategies mean that the sources of variability in cluster sampling are different from those in STRRS. In STRRS, the variability of the estimated population mean/prevalence arises only from individual variability within a stratum. For cluster sampling, the variability of the estimated population mean/prevalence comes from one or more sources [27]. In one-stage cluster sampling, where all individuals in a selected cluster are included, the variability of the estimated population characteristic or quantity is dependent on the variability between clusters. In two-stage cluster sampling, where only a sub-sample is collected from selected clusters, the variability of the estimated population characteristic comes from two sources: the within-and between-cluster variabilities [31]. One advantage of cluster sampling is that it overcomes some of the logistics issues associated with SRS or STRRS and therefore generally requires less spending on administration and travel expenses. However, the estimates provided by cluster sampling are usually less precise than those provided by SRS, given the same sample size [27].
Cluster sampling is possibly the most widely used approach in livestock research. Usually, a farm or a herd is regarded as a cluster and a number of farms/herds are selected. This was the approach adopted by Getahun, et al. [32], who studied mastitis and antibiotic resistance patterns in dairy cows in central Ethiopia. This design treated a farm as a cluster and a number of farms were chosen using SRS; within each farm, all the dairy cows were sampled. A similar approach was later used to estimate the prevalence of bovine tuberculosis in southern Ethiopia [33]. In this study, the target population was only cows above 6 months of age, and all cows above 6 months old were included on the selected dairy farms (clusters). We list here three examples of two-stage cluster sampling in veterinary research for interested readers [34][35][36]. In the rest of this section, we will first provide insights into the estimation process for one-stage cluster sampling and do the same for a two-stage cluster sampling where STRRS instead of SRS is used at the second stage (essentially a complex sampling) with details.

One-Stage Cluster Sampling
In one-stage cluster sampling, all animals within a farm are sampled; therefore, the farm total τ i = ∑ M i k=1 y ik is directly measured, where y ik is the value of the variable of interest measured for the kth animal on the ith farm given the herd size of M i . Common research tasks might be to estimate the farm-level and animal-level averages, such as the average milk production or average number of positive animals per farm and average milk production per cow or overall prevalence at the animal level. Suppose n farms are sampled from N farms in a region using SRS. As before, to estimate the population mean or prevalence it is always recommended to start by estimating the total. Since SRS is used for sampling clusters, the unbiased estimator for the finite population total (e.g., the number of all diseased dairy cows in a region) is straightforward and therefore given without proof: The variance and estimated variance for this estimator can also be straightforwardly de- The estimated farm-level average and its corresponding variance and estimated variance are straightforward: The total number of animals in the region is M = ∑ N i=1 M i . Hence, the estimated average at the cow level is given by: The variances and estimated variances for the cow-level average or overall prevalence are given as: Note that at the farm level, we work on counts of positive animals instead of binary values even if we are estimating prevalence, therefore the variance formulas for y andp are indistinguishable.

Two-Stage Cluster Sampling
The main purpose of this section is to illustrate the estimation process for a complex survey-i.e., how to obtain the unbiased estimators and derive their corresponding variances. Suppose there are M dairy cows in a region with N dairy herds. The herd size for herd i is M i . The cows are separately managed based on a certain criterion; that is, within the ith herd there are J groups, and within each of the groups there are M ij cows. The groups can be treated as strata, as they are not overlapping and constitute the entire herd. A research team is interested in knowing the prevalence of a disease among cows in this region. Based on the demographic information, a two-stage cluster sampling is decided. First, n herds will be selected using SRS. Within each of the sampled herds, STRRS will be used to sample cows from each of the strata in each of the herds. Before going to the estimation process, we shall define some notations (Table 1). Table 1. Quantities used in a two-stage cluster sampling design, where stratified random sampling is implemented in the second stage.

N
The number of dairy herds in the region. n The number of dairy herds in the sample.

M ij
The number of cows in the jth stratum in the ith herd. m ij The sample size in the jth stratum in the ith herd.

M i
The number of cows in the ith herd (herd size for herd i), The sample size for herd i.

M
The total number of cows in the region, The disease outcome (1/0) of the kth cow in the the jth stratum in the ith herd.
τ i The total number of diseased cows in the ith herd, The herd prevalence for the ith herd, The total number of diseased cows in the region, The overall prevalence in the region, The ultimate goal for this sample survey is to estimate p; however, as in the previous examples it is the best to start by estimating the total τ. Additionally, the computation process needs to be consistent with the actual sampling steps. Thus, we start by estimating the total diseased animals in the jth stratum in the ith herd. Within each stratum, SRS is used, therefore the estimated total can be computed based on Equation (1). The second step is to estimate the total diseased animals in the ith herd. Because we used STRRS, this can be achieved by adopting Equation (9). Finally, we can estimate the total number of diseased animals in the region by using Equation (15), as SRS is used to select herds. Hence, the unbiased estimated region total is computed in the following way: To prove that the outcome of this process is unbiased, we simplify the notation, lettinĝ We know thatτ i is unbiased (namely, E[τ i ] = τ i ), because we have used STRRS. Secondly, we specify a binary indicator variable Z i that = 1 if herd i is selected or 0 if it is not. Let π i denote the probability that herd i is selected (inclusion probability of a herd); we then have π i ≡ P(Z i = 1) = E[Z i ] = n N , since SRS is used for the first stage of selection (i.e., the selection of herds). Given that sampling within any herd is independent of the sampling in any other herd and thatτ i is independent of Z = (Z 1 , Z 2 , . . . , Z N ), we have: N n Z i τ i (unbiased estimator for stratified random sampling for each herd) linear property of expectation). Therefore, the unbiased estimator for the overall prevalence is simply: In order to find the variance formula forp, it is easier to start withτ. It is necessary to first identify the sources of variability. In this two-stage cluster sampling process, we have between-and within-herd variances. The variance partition formula can thus decompose the total variance into two parts:  Var(τ i ). Since STRRS is implemented within each herd, Var(τ i ) can be easily obtained from Equation (11) or Equation (12) depending on the nature of y ijk . In our particular example, where y ijk takes a binary value (either 1 or 0), we have ij . Therefore, the general formula for the variance ofτ is given as: where , µ ij is the unknown mean of the jth stratum in the ith herd. When y ijk takes a binary value, the special form is given by applying the method introduced in the SRS section (see Equation (5)): Again, this variance depends on some unknown quantities which we have estimated. These estimates can then be used to replace these unknown quantities, as we have done previously. Thus, the estimated variance (general form) will be: where σ τ Note the difference between σ τ 2 and s 2 τ in the one-stage cluster sampling; s 2 ij = 1 m ij −1 ∑ m ij k=1 (y ijk − y ij ) 2 is the sample variance within the jth stratum in the ith herd, with y ij = 1 m ij ∑ m ij k=1 y ijk being the estimated sample mean. When y ijk takes a binary value, the special form is given as: Finally, the variance and estimated variance forp are found simply by multiplying the results of Equations (25) and (27) by a constant 1 M 2 . The same process can be applied to find the variance and estimated variance for y when y ijk is not limited to binary values. A numerical illustration example in this design would be tedious to present manually; we have therefore provided the Python code for computation (see the Supplementary Materials: Python code for the two-stage cluster sampling where stratification is implemented within the clusters).

Sample Size Consideration
Although the paper has focused principally on the estimation of outcomes of interest, sample size calculations are also a critical part of the study design process. For ready-to-use sample size calculation formulas, readers are directed to Stevenson 2021 [37]. However, for a complex survey where the formula needs to be derived on a case-by-case basis, it is of value to briefly introduce the principles behind the sample size calculation. The formulas for sample size calculations are closely related to the sampling distributions of the estimators. The investigator needs to come up with an expected value for the finite population characteristic of interest and then think about how precise the estimate needs to be. The narrower the sampling distribution of an estimator, the more precise the estimate needs to be (and thus the larger the sample size). Therefore, it is natural to think what the sampling distribution is and which quantity defines the spread of the distribution.
Here, we use the SRS as an example, as the SRS serves as the theoretical basis for other more complicated sampling methods. Suppose that in the SRS for estimating prevalence in a finite herd, the sampling distribution forp is approximately normal, with mean = p and variance = M−m M−1 p(1−p) m (see Equation (5) of this paper for the variance formula) [27]. Note that for sample size calculations as opposed to calculating the variance of an estimator from the empirical data, we use the theoretical variance formula instead of the estimated variance formula. Clearly, the variance determines the spread of a distribution and it is a function of the sample size m; thus, this is the equation we are targeting.
The investigator then needs to specify the expected prevalence p = p 0 and think about the quantiles (q α/2 and q 1−α/2 ) of this sampling distribution at the 1 − α confidence level (usually, we set 1 − α = 0.95). These quantiles can be interpreted as the farthest acceptable estimates from the expected prevalence, and q α/2 = −q 1−α/2 due to the symmetry of the normal density curve. This suggests that P(−q 1−α/2 ≤p ≤ q 1−α/2 ) = 1 − α. Let SE denote the square root of the variance in this specific example. Standardisation giveŝ p−p 0 SE ∼ N(0, 1); thus, ± q 1−α/2 −p 0 SE = ±z 1−α/2 , where ±z 1−α/2 are the quantiles of standard normal distribution that we know-for example, ±z 1−α/2 ≈ ±1.96 for 95% confidence. We thus need to find a sample size value which makes z 1−α/2 SE equal to the absolute difference (d) between the farthest acceptable estimate and the expected prevalence. Now, it is just a matter of solving this equation. With a little algebra, the sample size is:

Conclusions
This paper has provided a brief overview of the principles of probability sampling and the design-based approach for estimating finite population characteristics. In addition, we summarised the analytical methods for various commonly used sampling methods in detail. Instead of feeding the formulas to the readers, we have attempted to introduce and illustrate the ideas to help the readers understand, interpret, derive, and prove the unbiased estimators of the design-based samples and their corresponding variance formulas. We hope the ideas and methods presented in this paper can inspire the readers, so that the readers are encouraged to find the proper estimators and corresponding variances in their own sample surveys.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/10 .3390/vetsci8060105/s1, File S1: Python code for the two-stage cluster sampling where stratification is implemented within the clusters.  Here, we focus on deriving the variance formula for the within-herd component E[Var(τ|Z)]. We apply the conditional variance formula inside the expectation operator: E[Var(τ|Z)] = E E[τ 2 Z − E [τ|Z] 2 . We will work on the two terms inside the expectation operator, then apply the expectation. A gentle courtesy here is to explain the notations again. A Bernoulli random variable Z i ∼ Bern(π i ) represents whether the herd i is selected, and we have π i ≡ P(Z i = 1) = E[Z i ] = E Z 2 i = n N . Sampling within any herd is independent of the sampling in any other herd andτ i is independent of Z = (Z 1 , Z 2 , . . . , Z N ).
(τ i and Z are independent random variables, real valued functions f and g defined for τ i and Z are also independent random variables; this justifies E τ i 2 Z = E τ i 2 as f and g are a quadratic and an identity functions, respectively; more generallyτ i ,τ h and Z are independent, thus f (τ i ,τ h ) and g(Z) are independent; this justifies E[τ iτh |Z] = E[τ iτh ] as f (τ i ,τ h ) =τ iτh and g(Z) = Z) (sampling independently from the farms implies independence betweenτ i andτ h , for two independent random variables, Var(τ i ) (sampling independently from the farms).