Homogeneity Test of the First-Order Agreement Coefficient in a Stratified Design

Gwet’s first-order agreement coefficient (AC1) is widely used to assess the agreement between raters. This paper proposes several asymptotic statistics for a homogeneity test of stratified AC1 in large sample sizes. These statistics may have unsatisfactory performance, especially for small samples and a high value of AC1. Furthermore, we propose three exact methods for small pieces. A likelihood ratio statistic is recommended in large sample sizes based on the numerical results. The exact E approaches under likelihood ratio and score statistics are more robust in the case of small sample scenarios. Moreover, the exact E method is effective to a high value of AC1. We apply two real examples to illustrate the proposed methods.


Introduction
In the medical field, it is necessary to judge the accuracy and the interchangeability of different diagnostics. Inter-rater agreement is widely used to quantify the closeness of ratings for subjects by two raters. The recommendation of an efficient and economical method should guarantee a high degree of agreement between its result and the goldstandard method. A simple example is that independent raters A and B assess each subject with binary outcomes (e.g., +/−, and Yes/No). Let n ij (i, j = 1, 2) be the numbers of independent subjects judged by two raters as (+, +), (−, +), (+, −), and (−, −), and P ij (i, j = 1, 2) be the corresponding probabilities, respectively. Denote n = n 1+ + n 2+ = n +1 + n +2 = ∑ 2 i=1 ∑ 2 j=1 n ij . The data can be arranged into a 2 × 2 original table (Table 1).
Researchers have developed several indices by which to measure the degree of agreement between raters on a nominal scale category, where the unordered categories are independent, mutually exclusive, and exhaustive. Denote P i = Pr(the probability is classified as "+" by Rater i), i = A, B. We call P i (i = A, B) as the marginal probability. Cohen [1] showed that the χ 2 test was indefensible because of the null hypothesis with independence, not agreement. Furthermore, he presented the kappa coefficient to compute the extent of agreement between raters. For the problem of nominal scale agreement between raters A and B in Table 1, there are only two relevant quantities: the overall agreement probability p a , and the chanceagreement probability p c . Cohen's kappa coefficient is defined by κ = (p a − p c )/(1 − p c ), where p a = P 11 + P 22 , p c = P A P B + (1 − P A )(1 − P B ). That is to say, the coefficient κ is the G = {both raters agree}, R = {at least one rater performs random rating}.
Thus, the probability of agreement expected by chance can be defined by p e P(G ∩ R) = P(R)P(G|R). Generally, a random rating may classify an individual into either category with the same probability of 1/2. Since agreement may occur in either type, we have P(G|R) = 2 × (1/2) 2 = 1/2. As for the probability of random rating P(R), a normalized measure of randomness (Ψ) is used to approximate it as follows, where π + represents the probability that a random rater classifies a randomly chosen individual into the "+" category. That is to say, p e can be quantified by p * e = P(G|R)Ψ = 2π + (1 − π + ). Then, the AC 1 coefficient can be expressed as γ = (p a − p * e )/(1 − p * e ), where p a denotes the agreement probability. In the above example,π + = (123/125 + 120/125)/2 = 0.9720 and p * e = 0.0544. By the definition of AC 1 , we have γ = 0.9408. Thus, the AC 1 coefficient is more consistent with the observed extent of agreement than Cohen's κ and Scott's π coefficients. There have been quite a few pieces of literature about agreement coefficients [9][10][11].
As with Scott's π coefficient, Ohyama [12] assumed that two raters have a common marginal probability, that is, P A = P B π + . Thus, P 12 = P 21 , and Table 1 can be simplified as Table 2. Define X ij = 1, if rater i classifies the subject j into category +, 0, otherwise for i = A, B, j = 1, 2, . . . , n. Suppose that the underlying probability of classifying a subject depends not on raters but subjects, which is P(X ij = 1|j) = p j . We can obtain the overall agreement probability (p a ) based on the idea of Vanbelle and Albert [13]. The agreement can occur in "(+, +) and "(−, −) , and the corresponding probabilities for jth subject are P 11j = p 2 j and P 22j = (1 − p j ) 2 , respectively. Thus, the agreement probability of two raters for the jth subject is p aj = p 2 j + (1 − p j ) 2 . We denote the mean of positive classification probability as E(p j ) = ∑ n j=1 p j /n π + , and the corresponding variance as Var(p j ) = ∑ n j=1 (p j − π + ) 2 /n σ 2 , where n is the size of the population. Then, the probabilities of "(+, +) and "(−, −) ratings over the population are P 11 = E(p 2 j ) = Var(p j ) + (E(p j )) 2 = σ 2 + π 2 + and P 22 = E((1 − p j ) 2 ) = 1 − 2π + + σ 2 + π 2 + . Finally, the agreement probability over the population is p a = P 11 + P 22 = 1 + 2(σ 2 − π + (1 − π + )). The AC 1 coefficient (γ) for a binary outcome judged by two raters is rewritten by Up until now, the application [14,15] and the statistical inference [12] of the AC 1 have been concentrated at the situation without stratification. However, the ignorance of confounding variables or covariates may lead to a biased conclusion. Researchers often stratify the data into multiple strata to control the influence of these factors. A stratified analysis is applied to evaluate the relationship between the nontreatment factors of a clinical trial (age, gender, or severity of disease, etc.) and agreement. A test of homogeneity is the first step of the stratified analysis. It is essential to analyze the factors that lead to heterogeneity when we reject the homogeneity hypothesis. Suppose K levels of the subject covariates are introduced into Table 2 for two raters with binary outcome, and the data can be arranged in a 3 × K table of observed cell counts. Generally speaking, a sample can be classified as a large or small sample by the sample size. Hannah et al. [16] analyzed the data about the alcohol-drinking status of twins. A subject is categorised as nondrinker if he/she consumes less than 30 gm alcohol per week, and otherwise is a drinker. Thus, the binary outcome is the drinking status (drinker or nondrinker). A number of same-sex twins are stratified by zygosity, including monozygotic (MZ), and dizygotic (DZ). Nam [17] used the kappa index to investigate the agreement of alcohol-drinking status between twins. The data structure of male twins is shown in Table 3. The large-sample inference has been performed for the data type, including score, likelihood ratio, and Wald-type statistics [18]. Honda and Ohyama [19] proposed score and goodness-of-fit tests for the homogeneity test of stratified AC 1 . Unfortunately, both tests performed poorly due to the conservative or liberal type I error rates, especially for small sample sizes. Meanwhile, a high AC 1 may lead to conservative type I error rates for small and moderate sample sizes. (+, −) or (−, +) n 12 + n 21 2P 12 3 (−, −) n 22 P 22 Total n 1 Table 3. Agreement of alcohol drinking status between male twins stratified by zygosity. Both  19  8  One  14  16  Neither  19  7 Total 52 31
In practice, we often encounter small sample cases of agreement data, for example, a clinical trial about coronavirus disease 2019   [20]. In this trial, the enzymelinked immunosorbent assay (ELISA) and gold-standard methods are used to detect the novel coronavirus IgG and IgM antibodies, classifying each of them as either positive (+) or negative (−). ELISA positive criterion is that the sample's optical density (OD) value is greater than or equal to the critical value. The positive criterion of the gold-standard method is the appearance of two colored bands. Table 4 lists the data stratified by the IgG and the IgM antibodies (K = 2), 17 patients in each group. Similar to Table 3, "One" entry corresponds to the number of "(+, −)" and "(−, +)". Table 4. Agreement between ELISA and gold-standard methods stratified by antibody type.

Number of Agreement
Antibody Type Unfortunately, asymptotic test statistics do not apply to small data. Exact approaches are effective for small samples, such as Fisher's exact test [21][22][23], and its extensions [24][25][26]. A conservative performance of Fisher's exact method supported the appearance of other exact approaches. We note that there exist nuisance parameters in the model of AC 1 coefficient. Significant progress has been achieved in the elimination of nuisance parameters for decades [27][28][29][30][31]. By fixing the marginal totals in the contingency table, Mehta [27] extensively used the conditional test (referred to as the C approach) to analyze various classical categorical data. Liddell [28] derived a test based on the exact distribution of the difference in sample proportions. As an alternative, Storer and Kim [29] modified Liddell's exact test, abbreviated as the E approach. Basu [30] provided a new procedure by maximizing the tail probability over the whole range of parameters, called the M approach. The global maximum is a challenge when the parameter space is not finite. Lloyd [31] pointed out the weakness of the M approach, and he suggested a so-called E+M approach by defining the tail area with the E approach and maximizing the tail probability over the parameter space. Generally, E, M, and E+M approaches are called unconditional tests. Tang et al. [32] showed that the exact conditional approach was generally inferior to the exact unconditional approach for small samples. Shan and Wilding [33] compared asymptotic and exact procedures for the kappa coefficient in a 2 × 2 table. However, little work has been carried out in extending the exact approaches to test the homogeneity of the AC 1 coefficients across several independent strata. This paper aims to propose asymptotic and exact methods for the homogeneity test of stratified AC 1 . The novelty and contribution are shown by three main aspects as follows. (i) For large sample sizes, we propose two asymptotic statistics, including likelihood ratio and Wald-type tests, to extend the study of homogeneity test in Honda and Ohyama [19] under large sample sizes. Our results show that the likelihood ratio test is more robust than other tests regarding type I error rates. The powers of these tests are close to each other. Thus, we recommend the likelihood ratio test for large samples' homogeneity test of stratified AC 1 . (ii) Based on the asymptotic statistics, we derive three exact approaches (E, M, and E+M methods) to investigate the small sample cases (n = 10, 25). These exact methods can effectively improve the performance of the homogeneity test concerning type I error rates. Among these methods, the exact E approaches based on likelihood ratio and score tests are more robust in small samples. (iii) We investigate the strengths and weaknesses of asymptotic and exact methods through plentiful numerical analyses, respectively. Some beneficial conclusions are obtained from the analyses of actual examples. The rest of this paper is organized as follows. In Section 2, we review the AC 1 coefficient in a stratified condition and establish a probability model. The maximum likelihood method and iterative algorithm are used to estimate the unknown parameters. We further review the score statistic and derive two asymptotic test statistics for large samples in Section 3. Based on these statistics, several exact methods are used for small sample sizes in Section 4. In Section 5, we conduct numerical studies to investigate the performance of all the derived methods regarding type I error rates and powers. In Section 6, we study the aforementioned real examples of large and small samples to illustrate these methods. Finally, a brief conclusion is given in Section 7.

A Probability Model and Homogeneity Test
Following Ohyama [12], we introduce K covariates into Table 2 and establish a probability model. Suppose that N subjects are divided into K independent strata. In the kth (k = 1, 2, . . . , K) stratum, there are n 1k , n 2k , and n 3k subjects in the three categories. Denote n lk as the total number of subjects in the kth stratum. Table 5 shows the data structure across the strata.
For the stratified analysis, we need to construct AC 1 for each stratum. Let X kij be an indicator of the ith (i = 1, 2) rater's judgement for the jth (j = 1, 2, . . . , n K ) subject in the kth (k = 1, 2, . . . , K) stratum. If there is a positive "(+)" classification, then X kij = 1, and otherwise 0. Ohyama [12] assumed that the underlying probability of classifying a subject does not depend on raters but on subjects; that is, Pr(X kij = 1|j) = p kj . The N subjects are classified into K strata based on covariates, and every stratum has different subjects. Thus, the data of every stratum is independent of each other. Denote E(p kj ) = ∑ n k j=1 p kj /n k π k , and Var(p kj ) = ∑ n k j=1 (p kj − π k ) 2 /n k σ 2 k . Then, AC 1 of the kth stratum is , k = 1, 2, . . . , K.
Next, we estimate the parameters γ and π = (π 1 , . . . , π K ) under the null hypothesis H 0 : γ 1 = γ 2 = · · · = γ K γ. The log-likelihood function is rewritten by where l 0k (γ, π k |n k ) is the log-likelihood function of the kth stratum under H 0 . Letγ and π k (k = 1, 2, . . . , K) be the constrained MLEs of γ and π k under H 0 . Similarly, we can differentiate l 0 (γ, π|N) to γ and π k , and set them to zero as follows: However, there are no closed-form solutions for the above equations. The Fisher scoring algorithm is used to obtain the constrained MLEs. Three steps describe the iteration process as follows.

Score Statistic T SC
Honda and Ohyama [19] proposed the score statistic. Denote Under H 0 , the score test statistic can be represented as whereγ andπ = (π 1 ,π 2 , . . . ,π K ) T are the constrained MLEs. The 2K × 2K Fisher information matrix I 2 is given in Appendix A.2. Through calculation, its simplified form is

Wald-Type Statistic T W
Denote β = (γ 1 , γ 2 , . . . , γ K , π 1 , π 2 , . . . , π K ) 1×2K , and The null hypothesis H 0 is equivalent to Cβ T = 0, where 0 is a zero vector. Thus, we define the Wald-type statistic as whereγ k andπ k are the unconstrained MLEs. The Fisher information matrix I 3 is the same as that of the score test. We obtain the simplified form of T W as Appendix A.3 provides the detailed process. Under H 0 , these three statistics T L , T SC , and T W are asymptotically distributed as a chi-square distribution with K − 1 degrees of freedom [34]. Given a significance level α, is the 100(1 − α) percentile of the chi-square distribution with K − 1 degrees of freedom. For a special observed data N * = (n 1 , n 2 , . . . , n K ), the p-values of these statistics are defined as where T θ (N * ) is the value of the statistic for the observed data N * . For convenience, p A L , p A SC , and p A W are called asymptotic (A) approaches. Generally, asymptotic tests work well for large sample scenarios. However, they are conservative or liberal in the case of small sample sizes. Thus, we propose several exact methods based on the above statistics.

Exact Methods
Researchers often use the p-value to summarise the evidence against a null hypothesis. Thus, the key to the exact method is the calculation of the exact p-value. We uniformly denote the aforementioned test statistics T L , T SC , and T W as T θ (θ = L, SC, W). Instead of relying on the chi-square distribution, the exact test can use the true sampling distribution of T θ and compute an exact p-value. The calculation process is as follows. First, we need to generate all possible tables. For a given observed data N * , the column margins n 1 , n 2 , . . . , n K are fixed. We enumerate all possible tables by varying the cell values. The detailed process is described as follows.
(i) Produce all possible values of each stratum, which is formed by all combinations (n 1k , n 2k , n 3k ) such that n 1k + n 2k + n 3k = n k (k = 1, . . . , K), and n k is fixed. We take K = 2 and n 1 = n 2 = 2 as an example. There are six combinations in the kth stratum, including (0, 0 (ii) Enumerate all possible tables determined by the combination of all strata. For K = 2 and n 1 = n 2 = 2, we can obtain 36 possible tables in Table 6.
Note that each column corresponds to a categorical table with K strata. Through steps (i)-(ii), we can enumerate all possible tables for any observed data N * = (n 1 , n 2 , . . . , n K ). Then we identify the tail area from this reference set. The tail area includes all the tables whose statistic values equal or exceed the statistics of the observed data N * . Finally, the exact p-value is calculated by summing the probabilities of all the tables in the tail area. The calculation of the exact p-value needs to eliminate the unknown parameters shown in the previous section. The following exact methods use different ways for the elimination of the unknown parameters.

E Approach
The E approach eliminates the unknown parameters by replacing them with the constrained MLEs. We first generate all possible tables. Define the tail area whereγ * andπ * = (π * 1 ,π * 2 , . . . ,π * K ) are the constrained MLEs of γ and π = (π 1 , π 2 , . . . , π K ). Meanwhile, the probability of a table in the tail area is L(γ * ,π * |N) = exp{l 0 (γ * ,π * |N)}, which is the likelihood function under the null hypothesis. For convenience, p E L , p E SC , and p E W are collectively called the E approach.

M Approach
In Basu [30], the size of a test is always understood as the maximum probability of the type I error rate. Thus, the elimination of the unknown parameters for the M approach is to find the values of parameters over the whole range of γ and π, which can maximize the sum of probabilities of all the tables in the tail area. This maximum is the p-value of the M approach. Denote Θ = {π : π k ∈ [0, 1], k = 1, 2, . . . , K} and where π = (π 1 , π 2 , . . . , π K ) and γ = (γ 1 , γ 2 , . . . , γ K ). Similar to the E approach, the tail area can be calculated by Under these conditions, the exact p-value of the M approach can be defined as where L(γ, π|N) = exp{l(γ, π|N)} is the likelihood function under H a . M approaches based on the three statistics are denoted as p M L , p M SC , and p M W . Table 6. All the possible tables for K = 2 and n 1 = n 2 = 2.

E+M Approach
The E approach is not always effective because of unsatisfactory type I error rates. Lloyd [31] used an additional maximization step to improve it, which is called the E+M approach. First, the p-value of the E approach is used as a test statistic to define the tail area. Then, we maximize the sum of probabilities of all the tables in the tail area as the exact p-value. Based on the above procedures, the tail area of the E+M approach is defined as where L(γ, π|N) is the same as the likelihood function in the M approach. The E+M approach includes p EM L , p EM SC , and p EM W .

Numerical Simulation
This section investigates the performance of asymptotic and exact methods in terms of type I error rates and powers. Given a significance level of 0.05, the type I error rate is the probability of rejecting H 0 when H 0 is true. According to Tang et al. [35], a test is considered liberal when the type I error rate is larger than 0.06, and conservative if it is less than 0.04; otherwise, it is robust. In several tables of this paper, we put the robust region of type I error rate (0.04-0.06) in bold to illustrate the performance of statistics. The power is defined by A test is optimal if it is robust and has more significant power.

Simulations of Asymptotic Methods
In the simulation, we first compare the performance of test statistics T L , T SC , and T W in terms of empirical type I error rates under different parameter settings. Under H 0 : γ 1 = γ 2 = 0.1, we take K = 2, n 1 = n 2 = 50, and π 1 = π 2 = 0.3 as an example to describe the detailed calculation process of empirical type I error rates.
(i) Bring the given values of γ k , n k , and π k (k = 1, 2) into (3). Let F be the cumulative distribution for the three types of ratings (l = 1, 2, 3) of the two strata. Through calculation, we have (ii) We produce an n 1 × K pseudorandom matrix drawn from the standard uniform distribution on the open interval (0,1), denoted by r = (r ik )(i = 1, 2, . . . , n 1 , k = 1, 2, . . . , K). Define Then, n 2k = n 1 − n 1k − n 3k for k = 1, 2, . . . , K. When K = 2, we can obtain a sample (or table) with two strata as follows: When 10, 000 pseudorandom matrices are given, 10, 000 samples are randomly produced under the null hypothesis H 0 : γ 1 = γ 2 = 0.1. (iii) For each sample, we calculate the corresponding MLEs and construct three statistics T L , T SC , and T W . Given a significance level α, (iv) The empirical type I error rate is calculated by the proportion of rejecting H 0 , which is the number of rejections/10, 000.
Through steps (i)-(iv), we can calculate the empirical type I error rates of asymptotic test statistics T L , T SC , and T W under different parameter settings. In practice, the AC 1 coefficient is usually positive. Under H 0 : Table 7 shows the empirical type I error rates of asymptotic statistics for K = 2 under the balanced and unbalanced π settings. The corresponding results of K = 3, 4 are shown in Tables A1 and A2 of the Appendix A.4. The tables show that the type I error rates of all the statistics are closer to the significance level of 0.05 with the increasing sample size. When the sample sizes are relatively small, the type I error rates of the likelihood ratio and score statistics are smaller than 0.05. The Wald-type test has a few liberal type I error rates. These three test statistics have conservative type I error rates for small and moderate samples when γ is close to 1. As the number of strata increases, T W becomes more liberal. For the unbalanced π, some type I error rates under γ = 0.1 are more significant than 0.06 under K = 3 and K = 4. Overall, T L should be recommended because of the more robust type I error rates among the three statistics.
We use three-dimensional figures to investigate the asymptotic methods' type I error rates. For convenience, the sample sizes are given as n 1 = n 2 = · · · = n K n = 10, 50, 100. Parameters are selected from π k = π and γ k = γ (k = 1, 2, . . . , K, K = 2, 3, 4). For each sample size, π increases from 0.1 to 0.9 by 0.04, and γ increases from −0.9 to 0.9 by 0.04. Figure 2 shows the distribution surfaces of type I error rates for all tests under K = 2. Similarly, the cases of K = 3, 4 are displayed in Figures A1 and A2 of the Appendix A.4. From these figures, we observe that the type I error rates of these statistics are smaller than 0.05 when the value of γ is close to −1 or 1. The type I error rates of Wald-type statistics tend to be larger for the same sample size with the increasing number of strata. Thus, it is more liberal than the other two statistics. The empirical type I error rates of the likelihood ratio and score statistics are closer to 0.05 as the sample size increases. For large sample scenarios, T L , T SC , and T W are usually robust, the type I error rates of T L are more concentrated at 0.05. Overall, the likelihood ratio statistic performs better under all configurations. However, when the sample sizes are small, most type I error rates of T L and T SC are smaller than 0.04, and those of T W are greater than 0.06.
Next, we analyze the powers of the proposed tests, which are similar to the calculation of empirical type I error rate when the samples are generated from the alternative hypothesis H a . Take n 1 = n 2 = · · · = n K n = 10, 50, 100. The following parameter settings are considered for each sample size: Here, a : b : c means increasing from a to c by b. Under the alternative hypothesis H a , we randomly generate 10, 000 samples for each design. The empirical power equals the proportion of rejecting H 0 in all samples. Figure 3 shows the empirical powers of the three asymptotic tests. The Wald-type test has higher empirical powers than the other two statistics, especially in small samples. The powers of test statistics become higher if there exists a more considerable difference between γ 2 and γ 1 (γ 3 ). Among these statistics, the values of powers are closer as the sample size becomes larger.
Above all, T L is recommended for the homogeneity test of stratified AC 1 in large sample sizes because of the robust type I error rates and satisfactory powers.   Table 7. Empirical type I error rates of asymptotic statistics for K = 2 (balanced sample sizes).

Exact Methods Results
Considering the unsatisfactory performance of asymptotic methods in small sample sizes, we introduce the exact E, M, and E+M methods to improve effectiveness. The type I error rates and powers of these three methods are compared with asymptotic approaches to investigate the advantages of exact methods. The algorithm for exact p-value is usually computationally intensive, time consuming, and sometimes exceeds the memory limits of the computer. Thus, running time is an important determinant for the appropriate numbers of strata and sample size in the numerical study of exact methods. For simplicity, we only focus on n 1 = · · · = n k n, π k = 0.5, and γ k = 0.1 (k = 1, 2, . . . , K, ). The average of 100 running times is used as the running time of the exact p-value. We study the running times in the case of the following parameter settings: (i) K = 2, n = 2, . . . , 11, and (ii) n = 3, K = 2, 3, 4, 5. The running times of different methods are shown in Table 8. From the results, it is obvious that running time will increase exponentially for the growth of the number of strata K and the marginal numbers n. It is challenging for a computer's megabytes of storage and clock speed. Thus, the cases n = 10 and K = 2, 3 are considered in our work. Take π k = π, γ k = γ (k = 1, 2, . . . , K, ), where π = 0 : 0.02 : 1, and γ = −1 : 0.02 : 1. Figure 4 shows the surfaces of type I error rates for K = 2. In the Appendix A.4, we provide the case of K = 3 in Figure A3. The small diagrams in the upper right corner reflect the curves of the type I error rates under π = 0.5 and γ = −1 : 0.02 : 1. For the large diagrams, p A L and p A W have liberal type I error rates, while the type I error rates of p A SC are smaller than 0.05. The M and E+M approaches produce conservative type I error rates. The E approaches under the likelihood ratio and score statistics are better than that under the Wald-type test when K = 2. For K = 3, the surfaces of E approaches under three statistics are closer to the significance level in the case of positive γ. From small diagrams, the curves of type I error rates have bimodal shapes. To reveal the reason, we consider the case of K = 2, γ k = γ, and π k = 0.5. As a part of tail probability, has the same value for each γ under the fixed sum of n 2k . Table 9 reflects the changes of LL as the increases of γ and (n 21 + n 22 ). The table shows that the increase of γ can affect the peak change with the given sum of n 2k . The peak also occurs when the sum of n 2k increases for a given γ. Meanwhile, each method's tail area determines the bimodal shape's location and values. Next, we compare the type I error rates of exact and asymptotic methods under several parameters. Let K = 2 and n 1 = n 2 = 10, 25. From Table 10, the type I error rates of p E L and p E SC are closer to 0.05 than those of p E W . The E approach works better as the sample size increases in the range of small sample sizes. The type I error rates of E approaches are close to 0.05 when n 1 = n 2 = 25. It reveals that the E approach is more effective than asymptotic methods for high γ.
According to the relationship between γ k and π k , the parameter settings are considered under H a as follows: (i) n k = n = 10, K = 2, π = (0.5, 0.5), γ 1 = 0.1, γ 2 = −0.9 : 0.05 : 0.9; (ii) n k = n = 10, K = 3, π = (0.5, 0.5, 0.5), γ 1 = γ 3 = 0.1, γ 2 = −0.9 : 0.05 : 0.9, (k = 1, 2, . . . , K). Figure 5 provides the power curves of the exact methods. For the A approach, p A W has higher powers under different parameter settings. The power becomes larger as the absolute value of γ 2 − γ 1 (γ 3 ) becomes larger. On the contrary, the powers of each method tend to be 0.05 as γ 2 becomes closer to 0.1. The power curves of the E+M approaches are close to each other. Then, we compare the powers of asymptotic and exact methods under different parameter settings. For K = 2 and n 1 = n 2 = · · · = n K = 10, the values of γ k are taken as (i) γ 1 = 0.1, γ 2 = 0.5 : 0.1 : 0.9; (ii) γ 1 = 0.3, γ 2 = 0.6 : 0.1 : 0.9; and (iii) γ 1 = 0.5, γ 2 = 0.8, 0.9. When K = 3, let γ 1 = γ 3 and other settings be the same. In Table 11, we provide the power comparisons of K = 2 under the balanced π conditions. Tables A3-A5 show the comparisons of these methods under other settings. The powers of exact methods are generally smaller than those of asymptotic methods. However, p E SC has higher powers than p A SC . For exact methods, the E approach has higher power than the other two methods, in which the E approach under the Wald-type statistic has higher power.     In summary, the exact E method can effectively improve the effectiveness of the homogeneity test of stratified AC 1 under small sample sizes and high γ. The E approaches under the likelihood ratio and score statistics perform better than that under the Wald-type statistic.

Applications
We review the two real examples in the introduction. Table 3 shows the data structure among 83 twins. For the large sample size, the hypotheses of the homogeneity test for AC 1 across two strata are H 0 : γ 1 = γ 2 γ vs H a : γ i is not all the same, i = 1, 2.
By computation, the unconstrained MLEs of γ and π areγ = (0.4615, −0.0312) and π=(0.5000, 0.5161). The constrained MLEs of γ and π areγ = 0.2788 andπ=(0.5000, 0.5351). Table 12 provides the results of statistics and p-values. Given a significance level α = 0.05, the values of test statistics T L , T SC , T W are all larger than χ 2 1,0.95 = 3.8415. All the p-values of these three tests are smaller than 0.05. Thus, the null hypothesis H 0 is rejected at the significance level 0.05. There is a significant difference between two AC 1 coefficients, and we cannot merge the data of two strata to compute a common coefficient. Then, we need to investigate how zygosity affects the consistency of the alcohol-drinking status of male twins.  Table 4 shows the small data structure for the clinical COVID-19 trial. The A, E, M, and E+M methods are applied to test H 0 : γ 1 = γ 2 γ vs H a : γ 1 = γ 2 . By calculation, the unconstrained MLEs of parameters γ and π areγ = (0.6656, 0.2197) and π = (0.6176, 0.6176). The constrained MLEs areγ = 0.4537 andπ = (0.5882, 0.6666). The values of statistics are T L (N * ) = 2.0150, T SC (N * ) = 1.9674, and T W (N * ) = 2.0805. Table 13 shows the corresponding p-values of asymptotic and exact methods. The running time is about 8 min because of two strata and n 1 = n 2 = 17, referred to Table 8. There is no significant difference in stratified AC 1 for any approaches. Thus, we can merge the data in these two strata and estimate the value of common AC 1 by MLEs, which is 0.4537.

Concluding Remarks
This article defines the stratified AC 1 coefficients as the object of study and constructs the likelihood function of the observed data. The primary purpose is to derive various statistics for testing the homogeneity of stratified AC 1 in the case of two raters with a binary outcome. We constructed asymptotic and exact methods for large and small sample sizes. Two asymptotic test statistics and their explicit expressions are derived for large sample sizes, including the likelihood ratio statistic (T L ) and Wald-type statistic (T W ). Meanwhile, the score statistic (T SC ) proposed by Honda and Ohyama [19] is also reviewed. Asymptotic p-values p A θ (θ = L, SC, W) under the statistics mentioned above are denoted as the A approach. Three exact methods (E, M, and E+M) are proposed based on T L , T SC , and T W for small sample sizes.
We conduct numerical studies to compare the performance of the above methods by type I error rates and powers. For large samples, the type I error rates of the likelihood ratio statistic are closer to the predetermined significance level of 0.05. The powers of statistics are better as the sample size increases. Overall, the likelihood ratio statistic is optimal among these three statistics in large sample sizes. However, asymptotic tests may generate unsatisfactory type I error rates at small sample scenarios and high γ. For small sample sizes in K = 2, p A L and p A W are liberal, and p A SC has the conservative type I error rates under different parameter settings. The type I error rates of the E approach are closer to the significance level of 0.05. The M and E+M approaches have conservative type I error rates. Moreover, p E L and p E SC are more robust than p E W . When K = 3, and γ is positive, the E approach has the robust type I error rates among three methods, in which p E L and p E SC perform better. The type I error rates of the exact E method are closer to 0.05 as the sample size increases. The E approach can improve the effectiveness of homogeneity tests at high γ. In the case of powers, the A approach has larger powers than these exact methods. Under the Wald-type statistic, each exact method has a higher power. Thus, the E approaches based on the likelihood ratio and score statistics have a robust performance for small sample sizes.
The proposed methods can be used not only in medical research but also in biometrics and psychological measurements. Meanwhile, exact methods can be applied to other data types, such as binary outcomes on multiple raters. This work focuses on constructing parametric statistics through unconstrained and constrained MLEs. However, there are still many problems that need to be solved. For example, how to construct optimal tests? How to improve exact methods effectively for a larger K or sample size n k ? How to simplify the heavy computations caused by the consideration of all the tables? More studies should be conducted on these problems and be extended in the future.  [16] and Hang et al. [20].

Informed Consent Statement: Not applicable.
Data Availability Statement: Clinical data referred to are from Hannah et al. [16] and Hang et al. [20].

Acknowledgments:
We thank reviewers and editors for constructive and valuable advice for improving this article.

Conflicts of Interest:
The authors declare that they have no conflict of interest.

The Simplification of T W
To obtain the simplification of T W , we calculate CI −1 3 C T and obtain ), i = 1, 2, . . . , K − 2, By computation, the simplified form of T W is given as Appendix A.4. Other Tables and Figures   Table A1. Empirical type I error rates of asymptotic statistics for K = 3 (balanced sample sizes).