Omnibus Tests for Multiple Binomial Proportions via Doubly Sampled Framework with Under-Reported Data

: Previously, Rahardja (2020) paper (in the ﬁrst reference list) developed a (pairwise) multiple comparison procedure (MCP) to determine which (proportions) pairs of Multiple Binomial Proportions (with under-reported data), the signiﬁcant differences came from. Generally, such an MCP test (developed by Rahardja, 2020) is the second part of a two-stage sequential test. In this paper, we derived two omnibus tests (i.e., the overall equality of multiple proportions test) as the ﬁrst part of the above two-stage sequential test (with under-reported data), in general. Using two likelihood-based approaches, we acquire two Wald-type (Omnibus) tests to compare Multiple Binomial Proportions (in the presence of under-reported data). Our closed-form algorithm is easy to implement and not computationally burdensome. We applied our algorithm to a vehicle-accident


Introduction
As stated in the Abstract, this paper is the continuation of Rahardja (2020) paper [1]. In regard to counts-and-proportions (categorical) type of data, the topic of interest here is about miscategorization of categorical data, particularly, binomial outcome data. For example, counts of disease/no-disease cancer screening in a hospital, proportions of male/female vehicle-driver accidents in an insurance company, counts of correct/incorrect audited payments in an accounting firm, etc. In an ideal (error-free) world, there is no miscategorization. Then, the simple and straightforward estimators (counts-and-proportions) taught in an introductory statistics (Stat 101) course can just be applied. However, the fact that in this imperfect world, errors or flaws (miscategorization) exist and cannot be ignored because such presence of under-reported or over-reported errors may create severe bias [2,3]. The fact that such imperfections exist in this world (due to the presence of under-reported or over-reported errors) is the motivation behind this research: to remedy the bias by incorporating a doubly sampled framework and to achieve model identifiability. Consequently (as a brief overview), our objective here is to provide a (bias remedy) method, by using a doubly sampled framework, in the presence of either under-reported or over-reported data. In the next paragraphs, we will go over the bibliography and novelty of this research.
In reality, we frequently run into data structure of a binomial output variable and a nominal input variable with multiple groups or levels, within that nominal input variable. In the scenario that both the binomial output variable and the nominal input variable are perfectly counted with no mistake, one can just use the Pearson chi-square statistics (taught in a Stat 101 course) to test the relationship between the input and output variables. Nevertheless, there are data cases where the nominal input variable (for example, a patientcategory nominal input variable with g = 3 levels: male/female/pediatric) is categorized flawlessly with no error, while the binomial/binary output variable (e.g., disease/nodisease) is categorized using an error-prone classifier. Such error-prone classifier can Stats 2022, 5 410 Therefore, in light of novelty, in this article, we propose two likelihood-based (two Wald-type Omnibus Testing: naïve and modified) methods to analyze data obtained via a doubly sampled framework for g ≥ 3. Without loss of generality (WLOG), here we limit the scope of our omnibus testing methods to only one error type (under-reported or over-reported only) data. Again, the objective is to test the relationship between the nominal input variable (with multiple levels or groups) and the binomial output variable. Equivalently, this objective translates into H 0 : p 1 = p 2 = . . . = p g , or testing the equality of the multiple binomial proportions among g groups (i.e., the equality of the levels of this nominal input variable). If the proportions are not significantly different, i.e., H 0 is not rejected, then there is no relationship between the nominal input variable and the binomial output variable. Vice versa, if H 0 is rejected, then there is relationship. The rest of the article is managed (described) this way. In Section 2 we lay down the data structure. In Section 3 we formulate two omnibus tests (Wald-type) statistics to test the relationship between the two variables (input and output). In Section 4, as an example, we demonstrate our two omnibus tests by employing real data. In Section 5 we assess the Type I error rates and power of the tests via simulations. Finally, in Section 6, we present the conclusion, limitations, and future research.

Data Structure
In this section, we consider multiple-sample binomial data with only under-reported data obtained using a doubly sampled framework. Since there has been various discussion and methods proposed in the literature, for g = 1 and g = 2, here we provide a description of the kind of data structure that has been discussed (see Table 1 for such data structure under consideration) and generalized it to g ≥ 3 scenario. As previously mentioned in Section 1, for each of the g groups, the overall data structure is composed of two independent datasets: the original study and the validation substudy. Due to economic viability, one only utilizes the perfect classifier in the validation substudy; while using the imperfect classifier in both the original and validation studies. We exhibit the summary counts in both the original and the validation studies for the data structure in Table 1. We let the size of observations m i and n i in both the original and the validation studies, for each group i (i = 1, . . . , g), respectively. We then define N i = m i + n i be the total number of units for group i.
In the original study, we define x i and y i be the numbers of positive and negative classifications in group i using the error-prone classifier. In the validation study (for i = 1, . . . , g, j = 0, 1, and k = 0, 1), we denote n ijk as the sample size in group i classified as j and k using the error-free and the error-prone classifiers, respectively.
Here, we have more notations to define. For the jth unit in the ith group, while i = 1, . . . , g and j = 1, . . . , N i , we designate F ij and T ij as the binomial output variables for the error-prone and error-free classifiers, respectively. Note that both F ij and T ij are observable for all units in both the original and the validation studies, while T ij is only observable units in the validation substudy. If the result is positive by the error-prone classifier, we mark F ij = 1; else, F ij = 0. Likewise, we mark T ij = 1 if the result is truly positive using the error-free classifier; else, T ij = 0. Clearly, misclassification occurs when F ij = T ij .
Next, we mark the parameters for group i as follows. Again, in this article, WLOG we consider under-reported data only, since the same formulation generality will apply (to over- reported only), as well. We let the true proportion parameters of interest be p i = Pr(T ij = 1), the proportion parameters of the error-prone classifier be π i = Pr(F ij = 1), and the false positive rates of the error-prone classifier be ϕ i = Pr(F ij = 1 T ij = 0) for i = 1, 2, . . . , g. Note that π i is not an additional unique parameter because it is obtainable by using all other parameters. In particular, using the law of total probability, we have where q i = 1 − p i . In relation to the summary counts arranged in Table 1, we exhibit the associated cell probabilities in Table 2.
Our goal is to test the relationship between the categorical (nominal) input variable and the (binomial) output variable. In other words, whether the levels within the (nominal) input variable matters (testing equal proportions), in predicting the binomial output variable. In particular, the statistical hypotheses of interest can be written as H 0 : p 1 = p 2 = . . . = p g vs. H 1 : at least one pair is not equal. ( Again, this goal translates into testing the equality of the multiple binomial proportions among g groups (i.e., if the proportions are not significantly different, or H 0 is not rejected, then there is no relationship between the categorical (nominal) input variable and the binomial output; and vice versa, if H 0 is rejected, then there is association).

Methods
In this section we aim to derive two (omnibus) statistical tests for testing the hypotheses in Expression (2). We begin with estimating the Maximum Likelihood Estimator (MLE) along with its asymptotic variance-covariance matrix for every single one of the parameters based on the full-likelihood function. Next, we acquire two chi-square tests for testing the Expression (2) where the second test is essentially an improved version of the first test.

The Maximum Likelihood Estimator (MLE)
We display the data structure under consideration in Table 1. For group i, the observed counts (n i00 , n i01 , n i11 ) from the validation substudy have a Trinomial distribution with total size n i and associated probabilities displayed in an upper right 2 × 2 submatrix in Table 2, i.e., Note that Multinomial Distribution [14] is the generalization of Binomial [15] distribution. Trinomial Distribution [16] is another special case of Multinomial Distribution [14]. Moreover, the observed counts (x i , y i ) in the original study have the following binomial distribution: Since (n i00 , n i01 , n i11 ) and (x i , y i ) are independent for group i and these cell counts are independent across groups, up to a constant, the full-likelihood function is where η = p 1 , ϕ 1 , . . . , p g , ϕ g is the parameter vector. Directly maximizing (3) with respect to η is not straightforward. Instead of using numerical methods or non-closed form algorithm (which have been the historical challenges in the bibliography), we use a reparameterization of η and derive a closed-form solution. Specifically, we newly introduce Then, we introduce the new set of parameters to be a vector γ = λ 1 , π 1 , . . . , λ g , π g . Via Expression (3), the full log likelihood function in γ is The corresponding score vector has the following form: By setting the above score vector to 0, we obtain the MLE for γ as a vectorλ i = n i11 /n i•1 andπ i = (x i + n i•1 )/N i . By solving Expressions (1) and (4) and applying the invariance property of MLE, the MLE for the vector η isp i =π iλi andφ i = (1 −λ i )π i /q i .

Naïve Wald (nWald) Test
Under Expressions (2) or (6), asymptotically, we have wherep = p 1 , . . . ,p g andΣ = diag(σ 2 1 , . . . ,σ 2 g ). Hence, we reject the null hypothesis in Expressions (2) or (6) quantile of the chi-square distribution with g − 1 degrees of freedom. Note that we derive T 1 by applying a naïve Wald test procedure and therefore we name it as a naïve Wald (nWald) test. Such tests are known to be liberal and tend to have Type I error rates above the nominal level.

Modified Wald (mWald) Test
To better refine the naïve Wald test, we utilize Since the range of s i is the real line, the asymptotic distribution ofŝ i is better approximated by a normal distribution than that ofp i . Subsequently, in what follows we will construct a test based onŝ i . By the Delta method, we have We note that the hypotheses in Expression (2) are also equivalent to where s = s 1 , . . . , s g . Now, because under Expressions (2) or (7), asymptotically, we have whereŝ = ŝ 1 , . . . ,ŝ g andΩ = diag(τ 2 1 , . . . ,τ 2 g ). We reject the null hypothesis in Expressions (2) or (7) at Type I error level α if T 2 > χ 2 g−1,1−α/2 . Such testing method is called the modified Wald (mWald) test.

Example
To demonstrate the two Omnibus tests (nWald and mWald) derived in Section 3 of this manuscript (as the First Part/Stage of our Two-Stage Sequential Test), here (for examplecontinuity purpose), we apply the same vehicle-accident data published by Hochberg in 1977 [17] and demonstrated by Rahardja in 2020 [1] as the Second Part/Stage of our Two-Stage Sequential Testing development.
To recap, the narrative of such vehicle-accident data [1,17] is as follows. There are two classifiers for categorizing accidents into two [output] categories: with injuries (=1) or without injuries (=0). Our error-prone classifier is the police classifier which is prone to error. The other error-free classifier (gold standard) is the non-police classifier which does not miscategorize (e.g., hospital records of injury examination and vehicle-insurance  Table 3. In such vehicle-accident example [1,17], the objective is to contrast the probability of injury among four (4) Table 3.
Recall that from Expression (2), the null hypothesis in this context is to test if the chance of injuries among the four (4) accident groups are equal. Using the two (omnibus) tests developed in Section 3, the probabilities of injury for the four groups are obtained as p 1 = 0.175, p 2 = 0.204, p 3 = 0.028, and p 4 = 0.046. The chi-square statistics and p-values for testing the equality of the four proportions are nWald (T 1 = 220.67, p-value < 0.0001) and mWald (T 2 = 143.98, p-value < 0.0001). Since the p-value are less than 0.05 (hence, at the 5% level of significance), we conclude that the vehicle-accident study provided enough statistical evidence in support of an association/relationship between the output (injure risks) and the (nominal) input variable (with four-level of category: A = Male and High vehicle damage, B = Female and High vehicle damage, C = Male and Low vehicle damage, and D = Female and Low vehicle damage) because those risks of injuries (p 1 = 0.175, p 2 = 0.204, p 3 = 0.028, and p 4 = 0.046) are significantly different among those four groups (A, B, C, D). In other words, the four category as levels of input variable matters in predicting the injury risks (4-category). Since there were statistically significant differences among the four-level of input variable, then it is necessary to proceed to the Second Part/Stage (of the Two-Stage Sequential Testing Procedures) to determine which pair(s) contributes the proportions difference. In addition to the numeric version of results described above (in terms of test statistics, p-value, and probabilities of injuries), it is also evidenced from the graphical version of results graphed in Figure 1 test statistics, p-value, and probabilities of injuries), it is also evidenced from the graphical version of results graphed in Figure 1-visually, the height (probabilities) of the histogram showed significant different among the four accident groups (A, B, C, D). As anticipated, both the numeric and graphical versions of results are consistent in showing that there are significant different among the four accident groups. Such results concluded the First Part/Stage testing (of the Two-Stage Sequential Testing Procedure). Next, one can proceed to the Second Part/Stage testing. Note again that such Second Part/Stage testing (of the Two-Stage Sequential Testing Procedures) was already completed/demonstrated by Rahardja in 2020 manuscript [1]; and as a summary recap, those significant differences came from the following pairs: AC, AD, BC, and BD.  Note again that such Second Part/Stage testing (of the Two-Stage Sequential Testing Procedures) was already completed/demonstrated by Rahardja in 2020 manuscript [1]; and as a summary recap, those significant differences came from the following pairs: AC, AD, BC, and BD.

Simulation
In this Simulation Section, we would like to mention that there is no software package and/or library package readily available for such methodology developed above (in Section 3) because generally, the methodology was developed first, before the software package. Then, typically, after a decade or more, when the method developed has become popular then commercial/non-commercial software companies or individuals start building software package and/or library package; not the reverse. Hence, we resorted to writing our own manual coding via R software [18] version 4.1.2. Since the methodologies or algorithm developed above (in Section 3) are closed-form (unlike those non-closed form algorithms formerly developed in all the previous literature reviews), it does not require much/intensive computing resource. Consequently, we can merely utilize a regular laptop/desktop computer. In other words, parallel super computers are not necessary. Here, we use a regular computer with the following specifications: Dual Core Processor of 4GB RAM CPU and a hard drive storage of 1TB; and each case of the simulations we describe below only takes about 2 min to run.
Simulations were performed to evaluate and contrast the performance of nWald and mWald tests under varying cases using empirical Type I error rate. A two-sided nominallevel Type I error rate of α = 0.1 was used for a selection of g = 3 levels. Even though not required by the two tests, for the sake of simplifying the simulation and display of results, equal sample sizes N 1 = N 2 = N 3 = N, substudy sample sizes n 1 = n 2 = n 3 = n, and false-positive (over-reported) rates ϕ 1 = ϕ 2 = ϕ 3 = ϕ were utilized in simulations. For each simulation configuration, we generated K = 10,000 datasets.

Type I Error Rate
First, the performance of these two tests was investigated in terms of Type-I error rate by differing the total sample sizes. In these simulations, we assumed: The upper two panels in Figure 2 display the graphs of Type I error rate of both tests versus N for p = 0.1 and p = 0.5, respectively. On the upper-left panel (p = 0.1), binomial distributions are skewed; hence are not sufficiently well-behaved (because the distribution is not symmetric but skewed). Hence, in this scenario, the two tests are not anticipated to behave/carry-out well for (relatively) small samples. Nevertheless, the mWald test is conservative (as the graph showed close-to-nominal, in this case, type I error rate of 10% horizontal line) for all sample sizes studied. Both tests have close-to-nominal Type I error rates when the sample size N is greater than 300 (i.e., as the sample size larger than 300, the graph approaches to the nominal/horizontal line, asymptotically). On the contrary, it is well-known that on the upper-right panel (p = 0.5), binomial distributions are quite symmetric around their means and therefore we anticipate both tests to behave/carry-out well in this scenario. It is not surprising that the upper right panel of Figure 2 shows that both tests have close-to-nominal Type I error rates in any case of the sample sizes (i.e., both graphs approach to the nominal level or the horizontal line, in all sample sizes on the graph). The Type I error rate of the mWald test is consistently better than that of the nWald test (because the graphs consistently showed that the mWald have smaller Type I error rates, across all sample size studied). Second, we examine the performance of the two tests in terms of Type-I error by varying p. In these simulations, the following are assumed: 1. False positive rate φ = 0.1; Figure 2. Type I error rate versus total sample sizes N for the upper two panels and versus the true proportion parameter p for the lower two panels.
In the upper two panels of Figure 3, we graph Type I error rate of both omnibus tests versus r for N = 200 and N = 500, subsequently. When N = 200, the upper left panel of Figure 3 demonstrates that the nWald test (solid-line curve) has inflated Type-I error rates for all r (since the solid-line curve are way above the nominal/horizontal line), especially when r is small. The nWald test has better coverage as r increases. On the contrary, the mWald test has close-to-nominal levels for all the r values studied here (since the dashedline curve are very close to the horizonal line). Similar observations can be made for N = 500 with both tests having better Type-I error rates.
Forth and finally, we assess the Type-I error rate of the two tests by varying ϕ. In the simulations, below are the specifications: The lower two panels in Figure 3 graph Type-I error rate of both tests versus ϕ for N = 200 and 500, respectively. For both N, the lower two panels of Figure 3 demonstrate that the nWald test has inflated Type-I error rates (because the solid-line curve are inflated above the horizontal line), while the mWald test has close-to-nominal levels for all the ϕ values studied here (because the dashed-line curve almost overlapped the horizontal line). The values of ϕ do not affect the Type-I error rates for both tests and both N (because both curves are quite flat across all studied values of ϕ, as shown in the graph). Note that the above computer machine took 1.548624 min using R version 4.1.2 software [18] to run the simulations displayed in Figure 3. Mutations of the DJ-1 gene lead to a protein's instability and loss of function is ultimately responsible for the death of dopaminergic neurons and the early Parkinson's disease [7].
Interestingly some of these mutations occur also on the region codifying for residue, especially Cys106, highlighting the crucial role of this residue in protein f and stability [8,9].

DJ-1 Status in Human Cancers
In addition to the widely known evidence for the role of DJ-1 in neurodege diseases, several studies point to DJ-1 as a master regulator of neoplastic transfo [10]. This notion is supported by different studies. DJ-1 gene maps in a chrom locus, in which several chromosome abnormalities in cancer cells have been repo DJ-1 over-expression promotes, alone or in combination with other oncogenes ( and Myc), NIH3T3 cell transformation [11]. DJ-1 is highly expressed in many with poorer prognosis, including breast, lung, pancreatic, thyroid, bra endometrial as well as different types of leukaemia [1,12,13,14,15].
From a mechanistic point of view, although the exact activity of carcinogenesis has not been yet fully elucidated, it seems to be tightly associated ability to prevent oxidative damage and modulate peculiar cellular processes, signal transduction, apoptosis, invasion, and chemoresistance through the regul some key proteins such as tensin homolog (PTEN), mitogen-activated protein (MAPK), nuclear factor kappa-light-chain-enhancer of activated B cells ( Hypoxia-inducible factor 1-alpha (HIF-1α), androgen receptor (AR), and NF-E2 factor 2 (NRF2) [16,17].

Power
We also examine the power of the two tests. In these simulations, we choose the following. Note that theoretically, binomial distribution sufficiently well behaved when p close to 0.5 (i.e., "fair-coin" behavior) and not sufficiently well behaved when p is further away from 0.5-this is why we selected such contrasted selections of p = (0.1, 0.2, 0.3) and p = (0.4, 0.5, 0.6). For both sets of p and both r values, the nWald test is uniformly more powerful than the mWald test (as shown on the graph that the solid-line curve uniformly lay above the dashed-line curve), especially when the sample size is small. For larger sample sizes, both tests have similar powers. However, note that when sample size is small, the nWald test inflates the Type-I error rate and, therefore, should be not be used; therefore, the power comparison is not meaningful in these cases. Additionally, note that with the above computer machine specifications, the processing time in R software [18] is 1.904431 min for the simulation results displayed in Figure 4. the activity of redox sensor. Under hyper oxidative conditions, Cys106 might u oxidation eliciting the loss of antioxidant activity [5]. High levels of oxidized Cys1 have been associated with several pathologies, whose oxidative stress is the drivi in the physiopathology onset [3,4,6].
The conserved Cys106 region in the DJ-1 can be oxidized, inducin translocation into the mitochondria and repression of p53-dependent gene transc This region, necessary for DJ-1 homodimerization, has been reported to be the ta recent series of bis-isatin derivatives that possess anticancer properties [6].

Summary Discussion, Limitations and Future Research
The summary is as follows. In this manuscript, we considered testing the relationship between a nominal input variable (with multiple levels or groups) and a binomial output variable in doubly sampled framework with under-reported data. We begin with deriving closed-form formulas for the MLE of all the parameters via reparameterization. Subsequently, we constructed two omnibus tests (nWald test and mWald test) by applying the Wald test procedures.
As example to demonstrate the procedures, both tests were applied to the same vehicleaccident dataset used in the Rahardja (2020) paper [1], first introduced by Hochberg in 1977 [17]. Note that to be deemed valid, in general, most statistical methodologies developed and published historically in the bibliography are derived based on the Asymptotic (Large-Sample) Theory. This is why in the standard university-level introductory statistics (Stat 101) course, as the general rule-of-thumb and simplified case, students are taught that large sample was defined as big as 30 samples. Therefore, as anticipated, by Asymptotic (Large-Sample) Theory, the two omnibus tests yielded similar results due to their large-sample size nature of Hochberg (1977) vehicle-accident dataset [17].
Note also that in the case of small sample situations, for examples, in the Biology field, or in a Pilot Study of typical drug developments, where practitioners often encounter small sample, then the results should be noted with precautions. Meaning, if the results from the omnibus test and/or MCP test are not significant, it does not necessarily mean that it is truly not significant; it could be due to insufficient (small) sample size. Such precaution should be included/noted in the labeling or study report. Then, more information needs to be consulted with the subject matter experts (SME) in the field of study, for the final direction of decision making for such small-sample (biology or pilot) study. For example, whether making the decision of "Go/No-Go"; in other words, whether proceeding/not, to next phase (large sample) study, by spending more funds to recruit more patients/sample (with such noted risk or precaution); or terminate the study since it is not promising, etc.
Simulation studies were performed to assess the empirical Type I error rate of both tests under various simulation configurations. Since the two tests were developed based on large-sample theories, both methods were anticipated to implement well under large samples. Such anticipation was affirmed since the resulting Type I error rates were close to the nominal level for large samples. When sample size is relatively small and p is close to 0 or 1, the nWald test resulted in a lot more inflated Type I error rate while the mWald test preserved the Type I error rate. Hence, the mWald test is recommended over nWald test.
The limitations are as follows. When sample size is afforded at relatively small size, p is close to 0 or 1, even in larger samples, both methods still suffer from inflation of type I error rate due to: not only asymptotic approximation, but also both tests can be conservative, i.e., slow to reject the null hypothesis ( Figure 2). In the standard/built-in statistical software packages such as R, SAS, etc., there is the use of continuity adjustment in chi-square test which is most useful for small sample sizes, to prevent overestimation of statistical significance for small data. The formula is chiefly used when at least one cell of the table has an expected count smaller than five (5). Unfortunately, such correction may tend to overcorrect. This use of the continuity adjustment is somewhat controversial and can result in an overly conservative result that cause the chi-square test fails to reject the null hypothesis when it should (i.e., too conservative), for small sample size.
Finally, in this manuscript, as the first part of a two-stage sequential test (with overreported data), we have contributed the development of two omnibus tests (first-stage), to test the statistical hypothesis in expression (2), with an easy-to-implement, closed-form algorithm. Indeed, due to our closed-form algorithm via reparameterization, we were able to derive such omnibus tests for g ≥ 3 scenario with the under-report (or WLOG, overreport), which was the reason why previously other researchers were not able to derive such scenario. If such test in expression (2) is not rejected then we declare there is no proportion difference among the g groups. Otherwise, if the test is rejected, then we may want to proceed to the second stage (of the two-stage sequential test) to find-out which pair(s) contributes the difference. In such case, Rahardja pairwise comparison method developed in 2020 [1], which also known as Pairwise Multiple Comparison Procedures (MCP) with over-reported adjustment incorporated via doubly sampled framework, can further be applied. Note also that without incorporating any misclassification (under-reported or over-reported) into the statistical model, such MCP tests are just the standard/classical procedures in the literature review, as prescribed by Hsu in 1996 [19], Edwards and Hsu in 1983 [20], Hsu and Edwards in 1983 [21], and Hsu in 1982 [22]. Moreover, for the future research, one can add multiple covariates (with multiple levels, too) as the input variables, into considerations-in the presence of under-reported or over-reported (output) data. Examples of such multiple-level (input) covariates may be: age category (below 25, 25-40, 40-60, above 60), height category, body-mass-index category, drinking category (never, sometimes, often), etc.

Conclusions
In conclusion, we have completed a Two-Stage Sequential Testing development (for multiple binomial proportions using doubly sampled framework with only under-reported or over-reported data). In this manuscript, we derived two Omnibus Tests with the underreported or over-reported only. Such Omnibus Tests function as the First-Stage of the Two-Stage Sequential Test. Additionally, the Pairwise-MCP Test was already developed by Rahardja in 2020 [1] and function as the Second-Stage of the Two-Stage Sequential Test (with the over-reported or under-reported).