Estimation of the Average Kappa Coefﬁcient of a Binary Diagnostic Test in the Presence of Partial Veriﬁcation

: The average kappa coefﬁcient of a binary diagnostic test is a measure of the beyond-chance average agreement between the binary diagnostic test and the gold standard, and it depends on the sensitivity and speciﬁcity of the diagnostic test and on disease prevalence. In this manuscript the estimation of the average kappa coefﬁcient of a diagnostic test in the presence of veriﬁcation bias is studied. Conﬁdence intervals for the average kappa coefﬁcient are studied applying the methods of maximum likelihood and multiple imputation by chained equations. Simulation experiments have been carried out to study the asymptotic behaviors of the proposed intervals, given some application rules. The results obtained in our simulation experiments have shown that the multiple imputation by chained equations method provides better results than the maximum likelihood method. A function has been written in R to estimate the average kappa coefﬁcient by applying multiple imputation. The results have been applied to the diagnosis of liver disease.


Introduction
A binary diagnostic test (BDT) is a medical test used to determine whether or not a patient has a certain disease. Scintigraphy for the diagnosis of liver disease is an example of BDT. Sensitivity and specificity are the fundamental parameters to assess the effectiveness of a BDT. Sensitivity (Se) is the probability of a positive result for the BDT when the patient has the disease, and specificity (Sp) is the probability of a negative result for the BDT when the patient does not have the disease. When considering the losses associated with a misclassification with the BDT, the effectiveness of a BDT is measured with the weighted kappa coefficient [1,2], which depends on Se and Sp of the BDT, on the disease's prevalence and on a weighting index, which is a measure of the relative loss between the false positives and the false negatives and it is a value set by the clinician and takes a value between 0.5 and 1 when the BDT is used as a screening test, and the weighting index takes a value between 0 and 0.5 when the BDT is used as a confirmatory test. Therefore, the investigator must assign a value to the weighting index according to the utility of the BDT (screening test or confirmatory test). Roldán-Nofuentes and Olvera-Porcel [3] have defined a new measure to evaluate the effectiveness of a BDT based on the weighted kappa coefficient: the average kappa coefficient. The average kappa coefficient depends on the Se and Sp of the BDT and on the disease prevalence, but it does not depend on the weighting index; the average kappa coefficient solves the problem of assigning values to the weighting index.
In order to obtain unbiased estimators of the parameters of a BDT it is necessary to know the disease status of each patient in a random sample. The medical test through which the disease status of a patient is known is called gold standard (GS), and therefore the effectiveness of a BDT is assessed in relation to a GS. A biopsy for the diagnosis of liver disease is an example of GS. The most common sampling design to evaluate the effectiveness of a BDT is cross-sectional sampling. This design consists of applying the BDT and the GS to all patients in a random sample. In this situation the true disease state (disease present or disease absent) is known for all patients in the sample. In the cross-sectional sample there are no missing data and therefore corresponds to a complete data situation.
In clinical practice, it is common that when evaluating a BDT, the GS is not applied to all patients in the sample, giving rise to a problem called partial verification of the disease [4]. If the GS is an expensive medical test or a medical test that involves risks for the patient, then the GS is not applied to all the patients in the sample. In this situation, if Se and Sp are estimated without considering the patients for whom the GS is unknown, the estimators are affected by so-called verification bias [4,5]. Begg and Greenes [4] deduced the maximum likelihood estimators of Se and Sp when the missing data mechanism is missing at random (MAR). The MAR assumption holds that the selection of a patient to verify their disease status with the GS depends only on the result of the BDT. Therefore, the true disease state (disease present or disease absent) is unknown for a subset of patients; the missing information is the true disease status for this subset of patients in the sample. Harel and Zhou [6] have studied the estimation of Se and Sp of a BDT through multiple imputation, assuming the MAR assumption, and they have shown through simulation experiments that multiple imputation provides better results than the method of Begg and Greenes [4]. A review of the impact of verification bias in estimating the accuracy of a BDT (and a continuous test) can be seen in Alonzo [7]. Roldán-Nofuentes and Luna [8] have studied the estimation of the weighted kappa coefficient in the presence of partial disease verification.
In this manuscript we study the estimation of the average kappa coefficient in the presence of verification bias. The manuscript is structured as follows: in Section 2, the weighted kappa coefficient and the average kappa coefficient of a BDT are presented. In Section 3 we study the estimation of the average kappa coefficient with complete data. In Section 4 we study the estimation of the average kappa coefficient when there are missing data, applying the maximum likelihood method and the multiple imputation by chained equations method. In Section 5, simulation experiments are carried out to study the asymptotic behaviors of the confidence intervals proposed in Section 3. In Section 6, we present a function written in R to estimate the average kappa coefficient in the presence of missing data. In Section 7, the results obtained are applied to an example on the diagnosis of liver disease, and in Section 8 the results are discussed.

Weighted Kappa Coefficient and Average Kappa Coefficient
Let us consider a BDT whose performance is assessed in relation to a GS. Let L be the loss that occurs when the BDT gives a negative result for a diseased patient, and let L' be the loss that occurs when the BDT gives a positive result for a non-diseased patient. Losses are assumed to be zero when BDT correctly classifies a diseased patient or a non-diseased patient. Loss L is associated with a false negative and loss L' with false positive. For example, let us consider the diagnosis of liver disease using scintigraphy as a diagnostic test. If the scintigraphy is positive for a non-disease patient (false positive), the patient will undergo a biopsy which will finally be negative. Loss L' will be determined from the economic costs of the diagnosis, taking into account the risks, stress and anxiety caused for the patient. If the scintigraphy is negative for a disease patient (false negative), the patient may be diagnosed later. In this situation the disease can progress or get worse, decreasing the chance of successful treatment for the disease. Loss L will be determined from these considerations. Therefore, losses L and L' are not only measured in economic terms but also in terms of risk, stress, anxiety, etc. Therefore, in practice it is not possible to determine the values of the losses L and L'. Finally, we examine the weighted kappa coefficient and the average kappa coefficient.

Weighted Kappa Coefficient
The weighted kappa coefficient κ(c) is a measure of the beyond chance agreement between the BDT and the GS, and it is expressed [1,2] as where p is the disease prevalence, q = 1 − p, Y = Se + Sp − 1 is the Youden index [9], Q = pSe + q(1 − Sp) = P(T = 1), and c = L/(L + L) is the weighting index. The weighted kappa coefficient can also be written as The value of the weighting index is assumed depending on the clinician's knowledge about false positives and false negatives [1,2]. If the clinician is more concerned about false positives, as is the case in which the BDT is used as a confirmatory test prior to the application of a risk treatment (for example a surgical operation), then L > L and 0 ≤ c < 0.5. For example, if the clinician decides that a false positive is three times more important than a false negative then L = 3L and c = 1/(1 + 3) = 0.25. If the clinician is more concerned about false negatives, as is the case in which the BDT is used as a screening test, then L > L and 0.5 < c ≤ 1. For example, if the clinician decides that a false negative is five times more important than a false positive then L = 5L and c = 5/(5 + 1) = 5/6. Value c = 0.5 is used for a simple diagnosis (false positives and false negatives have the same importance), being κ(0.5) the Cohen kappa coefficient.
The weighted kappa coefficient can be classified in the following scale of values [10]: 0-0.20, the agreement is slight; 0.21-0.40, the agreement is fair; 0.41-0.60, the agreement is moderate; 0.61-0.80, the agreement is substantial; and 0.81-1, the agreement is almost perfect. Another scale based on levels of clinical significance is [11]: <0.40, poor; 0.40-0.59, fair; 0.60-0.74, good; and 0.75-1, excellent. The weighted kappa coefficient has the following properties: (a) if c = 0 then κ(0) = {Sp − (1 − Q)}/Q and if c = 1 then κ(1) = (Se − Q)/(1 − Q); (b) if Se = Sp = 1 then κ(c) = 1, and the agreement between BDT and GS is perfect; (c) if the sensitivity and the specificity are complementary (Se = 1 − Sp) then κ(c) = 0, and the BDT and the GS are independent (the BDT is random and therefore not informative); (d) the weighted kappa coefficient is a function of the index c, which is increasing if Q > p, decreasing if Q < p, or equal to the Youden index if Q = p.

Average Kappa Coefficient
From the weighted kappa coefficient, Roldán-Nofuentes and Olvera-Porcel [3] have defined a new measure to evaluate the performance of a BDT with respect to a GS: the average kappa coefficient. For fixed values of Se, Sp and p, the weighted kappa coefficient is a continuous function of the index c. If the clinician considers that L > L, and therefore 0 ≤ c < 0.5, the average kappa coefficient is [3] i.e., the average kappa coefficient (κ 1 ) is the average value of κ(c) when 0 ≤ c < 0.5. If the clinician considers that L > L , and therefore 0.5 < c ≤ 1, the average kappa coefficient is [3] i.e., the average kappa coefficient (κ 2 ) is the average value of κ(c) when 0.5 < c ≤ 1, where As the weighted kappa coefficient is a measure of the beyond-chance agreement between the BDT and the GS, the average kappa coefficient is a measure of the beyondchance average agreement between the BDT and the GS, and does not depend on the weighting index c. The values of the average kappa coefficient can be classified on the same scales [10,11] as the values of the weighted kappa coefficient. The average kappa coefficients κ 1 and κ 2 have the following properties [3]: Coefficient κ 1 is greater than κ 2 if p > Q, and κ 1 is lower than κ 2 if Q > p.  (1) and for a specific sample it is possible to calculate a value of the weighting index c associated to the estimated average kappa coefficient. Thus, the estimation of the average kappa coefficient allows us to estimate how much greater (or smaller) the loss due to the false negatives is than the loss due to the false positives.

Estimation with Complete Data
When the BDT and the GS are applied to all patients in a random sample sized m, the observed frequencies in Table 1 are obtained, where the variable T models the result of the BDT (T = 1 when the result is positive and T = 0 when it is negative) and the variable D models the result of the GS (D = 1 when the patient has the disease and D = 0 when the patient does not have the disease). In Table 1, each observed frequency x i (y i ) is the number of diseased (non-diseased) patients in which T = i, x = x 1 + x 0 , y = y 1 + y 0 , m i = x i + y i and n = x + y = m 1 + m 0 , with i = 0, 1. In this situation the disease status (disease present or disease absent) of all patients is verified by applying the GS, and it corresponds to a cross-sectional sampling.
In this situation, the maximum likelihood estimator (MLE) of the weighted kappa coefficient [1,2] isκ and that the MLEs of κ(0) and κ(1) arê Finally, the MLEs of the average kappa coefficients κ 1 and κ 2 are [3] respectively. If x 0 = y 1 = 0 then κ i cannot be estimated. If x 1 y 0 = x 0 y 1 thenκ i = 0. If x 1 y 0 < x 0 y 1 , or if x 1 = 0 or y 0 = 0, thenŶ < 0 and it is necessary to interchange the results of the BDT (the positive result should be T = 0 and the negative result should be T = 1). A fundamental analysis in inference statistics is formign a confidence interval (CI) for an unknown parameter. In this context and with respect to the average kappa coefficient, Roldán-Nofuentes and Olvera-Porcel [3] have studied various CIs for κ 1 and κ 2 . These CIs are approximate and their asymptotic behaviors have been studied through simulation experiments. Following this work, two confidence intervals (CIs) for κ 1 and κ 2 studied by Roldán-Nofuentes and Olvera Porcel (Wald CI and logit CI) are summarized and a new CI (arcsine CI) is also presented.

Wald CI
Based on the asymptotic normality of (κ where z 1−α/2 is the 100(1 − α/2)th percentile of the normal standard distribution. Expressions of the estimated variances are shown in Appendix A.

Based on the logit transformation ofκ
Taking exponential in this expression, the 100 where the estimated variance is obtained by applying the delta method, i.e.,

Arcsine CI
The arcsine is a transformation that has been used to estimate parameters, for example, see the work of Martín-Andrés et al. [12] on the estimation of a binomial proportion. A new CI for κ i can be obtained by applying this transformation. Based on the asymptotic normal- where the varianceVar sin −1 √κ i is easily obtained by applying the delta method, i.e., Finally, undoing the transformation, the 100(1 − α)% arcsine CI for κ i is

Estimation in the Presence of Partial Verification
The evaluation of a BDT in the presence of partial verification gives the frequencies in Table 2, where the variables T and D are the same as in Section 3, and the variable V models the verification process, i.e., V = 1 when the disease status of a patient is verified with the GS and V = 0 when it is not. Table 2. Observed frequencies in the presence of partial verification.

Observed Frequencies of the 3 × 2 Table
Let λ ij be the probability of verifying the disease status of a patient with the GS in which T = i and D = j, i.e., Assuming that the missing data mechanism is missing at random (MAR) [13], then The MAR assumption takes that the verification process only depends on the result of the BDT and not the GS. This circumstance obtains in two-phase studies: in the first phase, the BDT is applied to all patients in the sample; in the second phase, the GS is applied to only a subset of patients in the sample, depending only on the result of the BDT. Subject to the MAR assumption, the observed frequencies (s 1 , r 1 , u 1 , s 0 , r 0 , u 0 ) are the product of a multinomial distribution whose probabilities are: Next, estimation of the average kappa coefficient applying the maximum likelihood (ML) method and applying multiple imputation (MI) is studied.

Maximum Likelihood
Assuming that the missing data mechanism is MAR the MLEs of sensitivity, specificity and prevalence in the presence of partial verification are [4,5] Se pv = s 1 n 1 /(s 1 + r 1 ) Substituting in Equations (2) and (3) parameters with their MLEs in the presence of partial verification, the MLEs of κ 1 and κ 2 in the presence of partial verification arê whenp pv =Q pv , whereQ pv = n 1 /n. The expressions of the estimatorsκ 1pv andκ 2pv are long and complicated whenp pv =Q, so statistical software is necessary to calculate them (see Section 6). Next, three asymptotic CIs for κ i in the presence of partial verification are proposed.

Wald CI
Based on the asymptotic normality of The expressions of the estimated variances are shown in Appendix B. These expressions are long and complicated, so it is necessary to use a statistical program to calculate them (see Section 6).

Logit CI
The logit CI is based on the asymptotic normality of the logit of logit κ ipv − logit(κ i ) / V ar logit κ ipv . The logit CI for κ i has a general expression similar to that obtained in Section 3.2, although the expressions for the estimators and the variances are different. The expressions of the variances are shown in Appendix B, and it is necessary to use a statistical program to calculate them.

Arcsine CI
The arcsine CI is also based on the asymptotic normality of sin −1 κ ipv − sin −1 √ κ i / V ar sin −1 κ ipv and its general expression is similar to that given in Section 3.3, where the variances are shown in Appendix B.

Multiple Imputation
Multiple imputation (MI) [14][15][16][17] is a computational method used to solve estimation problems with missing data. MI consists of constructing M complete data sets, obtained by replacing the missing data with M independent imputed sets. In each complete data set, the estimators of the parameters and their standard errors are calculated, and these are combined appropriately to calculate the global estimators, their standard errors and their confidence intervals. Harel and Zhou [6] have applied MI to estimate the sensitivity (specificity) of a BDT in the presence of partial verification and have shown that this method provides CIs with better asymptotic behavior than the CIs obtained by applying the ML method. Montero-Alonso and Roldán-Nofuentes [18] have studied the estimation of the likelihood ratios of two BDTs in the presence of partial verification using the MI by chained equations (MICE) method and have also shown that this method provides CIs with better asymptotic behavior.
In our context, from the 3 × 2 table given in Table 2, M 2 × 2 tables are imputed (as in Table 1), and from each one of these M tables the estimator of κ i , its standard error and the CIs given in Section 3 are calculated. The M results are then combined by applying the Rubin rules [14] and, in this way, the CI for κ i is calculated. Regarding the imputation of missing data, MICE method was used. MICE method requires the MAR assumption and can be used with different types of variables. In the problem posed in this article there are two binary random variables: variable T and variable D. The work by White et al. [19] explains in detail the imputation of binary variables using the MICE method. For variable T there are no missing data since BDT is applied to all patients. However, variable D is not observed in all patients and therefore this variable has missing data. Firstly, all missing values are filled in at random. Variable D is then regressed on the variable T through a logistic regression. The estimation is thus restricted to individuals with observed T. Missing values in D are then replaced by simulated draws from the posterior predictive distribution of variable D. This process is called a cycle, and in order to stabilize the results the process is repeated for a determined number of cycles in order to obtain a set of imputed data. Applying multiple imputation, the estimator of κ i is the mean of the estimators obtained in M complete data sets, and their standard errors are calculated by applying the Rubin rules [14]. In the situation studied in this article, the application MICE requires that s i > 0 and r i > 0.

Simulation Experiments
Monte Carlo simulation experiments have been carried out to study the asymptotic behavior (coverage probability and average length) of the CIs studied in Section 4. The relative biases of the estimators of the average kappa coefficients obtained through ML and through MI have also been studied. These experiments consisted of the generation of 10,000 random samples of multinomial distributions sized n = {50, 100, 200, 500, 1000}, and whose probabilities have been calculated from equations. These probabilities have been calculated in the following way: with respect to verification probabilities, we have taken two sets of values, (λ 1 = 0.70, λ 0 = 0.25) and (λ 1 = 0.95, λ 0 = 0.40), which can be considered low and high verification probability values. As values of disease prevalence we took the values p = {10%, 30%, 50%, 70%} and as values of κ 1 and κ 2 we took the values {0.20, 0.40, 0.60, 0.80}. Once we have set the values of κ 1 and κ 2 , the values of κ(0) and κ(1) are obtained solving with the Newton-Raphson method the system made by Equations (1) and (2), only considering those values whose solutions are between 0 and 1. Once we have obtained the values of κ(0) and κ(1), as the value prevalence p has been set previously, the values of Se and Sp are calculated solving the system made by equations , and then the probabilities of the multinomial distributions are calculated. Therefore, the samples have been generated by fixing κ 1 and κ 2 . The random samples have been generated in such a way that κ 1 and κ 2 and their standard errors can be estimated in all of them, and also verifying thatκ i > 0 (and, in this way, to be able to calculate all CIs). For example, if, in a sample, a frequency s i or r i is equal to 0, then MICE cannot be applied; in this situation this sample has been ruled out and another one has been generated instead until we have obtained 10,000 samples. The simulation experiments have been carried out using the R program [20] and the "mice" library [21]. Regarding MICE, this has been carried out using M = 20 data sets and performing 100 cycles. The M = 20 complete data sets are generated in such a way that κ 1 and κ 2 (and their standard errors) can be estimated in all of them. Thus, for example, if, in a complete data setκ i < 0, then that complete data set is neglected and another is generated in its place, and so on until obtaining 20 complete data sets. In a first phase of these experiments, we have considered M = 20 and M = 50 complete data sets and we have also considered 100 and 200 cycles in each case, obtaining very similar results. Therefore, we have considered M = 20 and 100 cycles to save computation time and stabilize the results. These 20 complete data sets have been generated in such a way that κ 1 and κ 2 and their standard errors can be estimated in all of them, verifying that each estimate of κ i is greater than 0. In each sample generated, we have calculated the three CIs (95% confidence) given in Section 3 along with the MICE method and the three CIs given in   From the results of these experiments we reach the following conclusions: (a) With respect to ML, the verification probabilities do not have a clear effect on the coverage probabilities (CPs) of the CIs. With respect to the CIs, in general terms their CPs far exceed 95% when the sample size is small (n = 50) or moderate (n = 100-200), fluctuating around 95% when the sample size is large (n = 500-1000). The Wald CI has a CP that fluctuates around 95% when the sample size is moderate or large. The logit CI has a higher CP than that of the Wald CI, especially when the sample size is small or moderate. The arcsine CI can have a CP of less than 90% when the sample size is small and κ 1 is small (κ 1 = 0.2) and fluctuates around 95% when the sample size is large. In general terms, the Wald CI is the interval with the best performance when the sample size is small or moderate, while all three CIs have a very similar asymptotic behavior when the sample size is large. (b) With respect to MICE, the verification probabilities do not have a clear effect on the CPs of the CIs. The Wald CI has a coverage probability that exceeds 95% when the sample size is small or moderate and the value of κ 1 is small (κ 1 = 0.2), fluctuating around 95% in the other situations and sample sizes. The logit CI has a CP that is slightly higher than that of the Wald CI, especially when the sample size is small or moderate. The arcsine CI has a CP closer to 95% when the sample size is small, and in the rest of sample size its CP is slightly higher than that of the Wald CI.

Function Eakcpv
We have written a function in R [20], called "eakcpv" (Estimation of the Average Kappa Coefficient in the presence of Partial Verification), to estimate the average kappa coefficient of a BDT in the presence of partial disease verification. The command to run the "eakcpv" function is "eakcpv(s 1 , r 1 , u 1 , s 0 , r 0 , u 0 , con f , imp, cycl)", where (s 1 , r 1 , u 1 , s 0 , r 0 , u 0 ) are the observed frequencies, "conf" is the confidence level, "imp" is the number of complete data sets and "cycl" is the number of cycles. The complete data sets are generated in such a way that κ 1 and κ 2 (and their standard errors) can be estimated in all of them. Thus, for example, if, in a complete data, setκ i < 0, then that complete data set is neglected and another is generated in its place, and so on until obtaining "imp" complete data sets. The function always checks that the values are valid and that the analysis can be performed (e.g., no frequency s i or r i is equal to 0, etc.). The function estimates κ 1 and κ 2 applying MICE, along with the Wald and arcsine CIs. The function estimates the relative loss between false positives and false negatives, and also estimates how much greater (or less) the loss associated with a false positive is than the loss associated with a false negative. The results obtained are recorded in a file called "results_eakcpv.txt" in the same folder from which the function is run. The function "eakcpv" is available as Supplementary Materials of this manuscript.

Example
The results obtained have been applied to the study of Drum and Christacopoulos [22] on the diagnosis of liver disease. Drum and Christacopoulos [22] have studied the diagnosis of liver disease using a hepatic scintigraphy as BDT and a biopsy as GS. In Table 7, we show the observed frequencies, where variable T models the result of the hepatic scintigraphy, variable V models the verification process and variable D models the result of the biopsy. Table 7. Diagnosis of liver disease. it is obtained thatκ pv (0) = 0.597 andκ pv (1) = 0.507. With respect to κ 1 , it is obtained that κ 1mice = 0.572, its standard error is 0.059 and the 95% Wald CI for κ 1 is (0.452 , 0.691). The estimated relative loss between the false positives and the false negatives isĉ = 0.252, and the loss associated with the false positives (L') is 2.97 times greater than the loss associated with the false negatives (L). With respect to κ 2 , it is obtained thatκ 2mice = 0.526, its standard error is 0.066 and the 95% Wald CI for κ 2 is (0.393 , 0.660). Estimated relative loss between the false positives and the false negatives isĉ = 0.752, and the loss associated with the false negatives (L) is 3.03 times greater than the loss associated with the false positives (L'). When hepatic scintigraphy is to be used as a confirmatory test prior to risky treatment (L > L and 0 ≤ c < 0.5), the beyond-chance average agreement between the hepatic scintigraphy and the biopsy is moderate (κ 1mice = 0.572), and in terms of the Wald CI, the beyondchance average agreement between the hepatic scintigraphy and the biopsy is a value between moderate and substantial (95% confidence). Estimated relative loss between the false positives and the false negatives is 0.252. As c = L/(L + L ) = (L/L )/{1 + (L/L )}, it is possible to calculate which loss (L or L') is greater. Loss associated with the false positives (L') is 2.97 times greater than the loss associated with the false negatives (L). Therefore, if the clinician considers that L > L, then the beyond-chance average agreement between the hepatic scintigraphy and the biopsy is moderate (κ 1 = 0.572), and the loss that occurs when erroneously classifying a non-diseased patient with the hepatic scintigraphy is 2.97 times greater than the loss that occurs when erroneously classifying a diseased patient with the hepatic scintigraphy.

Observed Frequencies of the Study of Drum and Christacopoulos
When hepatic scintigraphy is to be used as a screening test (L > L and 0.5 < c ≤ 1), the beyond-chance average agreement between the hepatic scintigraphy and the biopsy is moderate (κ 2mice = 0.526). In terms of the Wald CI, the beyond-chance average agreement between the hepatic scintigraphy and the biopsy is a value between fair and substantial (95% confidence). Estimated relative loss between the false positives and the false negatives is 0.752, so that the loss associated with the false negatives (L) is 3.03 times greater than the loss associated with the false positives (L'). Therefore, if the clinician considers that L > L , then the loss that occurs when erroneously classifying a diseased patient with the hepatic scintigraphy is 3.03 times greater than the loss committed when erroneously classifying a non-diseased patient with the hepatic scintigraphy.

Discussion
The average kappa coefficient is a measure of the beyond-chance average agreement between the BDT and the GS, and depends only on the Se and Sp of the BDT and on disease prevalence. The average kappa coefficient solves the problem of assigning values to the weighting index of the weighted kappa coefficient. In this manuscript we study the estimation of the average kappa coefficient when the gold standard is not applied to all patients in a sample. We study the estimation of the average kappa coefficient when the gold standard is not applied to all patients in a sample, a situation known as partial verification of the disease. The estimation of the average kappa coefficient has been carried out by applying two methods: the maximum likelihood method and the MICE method. As both methods require that the verification process be MAR, it therefore follows the verification process does not depend on disease status.
We have carried out simulation experiments to study the asymptotic behavior of the proposed ICs, both using the maximum likelihood approach and MICE. The relative biases of the two estimators (maximum likelihood and MICE) of the average kappa coefficient have also been calculated. MICE method along with the arcsine CI is the interval that has been shown to have a better coverage probability when the sample size is small, while MICE method along the Wald CI has shown to have a better coverage probability when the sample size is moderate or large. Regarding the relative biases, the difference between the relative biases of both types of estimators is small, such that both methods give rise to estimators that on average are very similar. Therefore, we recommend using MICE instead of the maximum likelihood method.
As in other studies [6,17], multiple imputation has proven to be a good method (and better than the maximum likelihood method) to estimate parameters of a binary diagnostic test in the presence of partial verification of the disease. In the situation studied here, the application of MICE has been carried out by generating 20 data sets. Rubin [14] recommended imputing five complete data sets in order to be able to apply multiple imputation. As our simulations have given stable values with 20 and 50 data sets, we decided, finally, to use 20.
The MICE method is requires the missing data to be MAR, so if the verification process depends on disease status then the MAR assumption is not verified and MICE cannot be applied. Therefore, it is necessary to study other methods of estimating the average kappa coefficient when the MAR assumption is not verified. The application of the method used by Kosinski and Barnhart [23] may be a solution to this problem. Future research should also focus on estimating the average kappa coefficient when covariates are observed in all patients in the sample.
Finally, we have written a function in R to estimate the average kappa coefficient in the situation studied in this manuscript, applying MICE. The function is available as Supplementary Materials.

Supplementary Materials:
The following are available online at https://www.mdpi.com/article/10 .3390/math9141694/s1. The function "eakcpv" is a function written in R that allows estimating the average kappa coefficient by applying the MICE method.

Appendix A
Roldán-Nofuentes and Olvera-Porcel [3] have deduced (applying the delta method) the expressions of the estimated variances of the estimators of the average kappa coefficients when the BDT and GS are applied to all patients in a sample.
The varianceVar(p) and covariancesĈov p,Ŝe andĈov p,Ŝp are obtained by applying the delta method. Let τ be the positive predictive value of the BDT, let υ be the negative predictive value of the BDT, let Q be the probability of a positive result of the BDT, and let ψ = (τ, υ, Q) T . Applying the delta method, the variance-covariance matrix of ψ is [25] ∑ ψ = Diag MLEs of predictive values in the presence of partial verification are [26]τ pv = s 1 /(s 1 + r 1 ) andυ pv = r 0 /(s 0 + r 0 ), and the MLE of Q isQ pv = n 1 /n. Therefore, in the presence of partial verification, the estimators of the predictive values coincide with the naïve estimators (those obtained regardless of the unverified patients) when the MAR hypothesis is assumed [26]. Let θ = (Se, Sp, p) T be the vector whose components are the sensitivity, the specificity and the prevalence. As the sensitivity, specificity and prevalence can be written in terms of the predictive values and of Q as