Adaptive Multiple Testing Procedure for Clinical Trials with Urn Allocation

: This work combines the Urn allocation and O’Brien and Fleming multiple testing procedure to compare two treatments in clinical trials in a novel way. It is shown that this approach overcomes the constraints that previously made it challenging to apply the original adaptive design to clinical trials. The method provides unique ﬂexibility, enabling trials to be stopped early if one treatment shows it is superior without compromising the efﬁciency of the original multiple testing procedure in terms of type I error rate and power. Experimental data and simulated case examples are used to illustrate the efﬁcacy and robustness of this original approach and its potential for usage in a variety of


Introduction
Over the past decade, the Food and Drug Administration (FDA) has shown concern regarding new challenges faced in conventional experiments.These challenges include difficulty in finding eligible participants, late-stage drug failures, and an increase in the cost of confirmatory trials.These issues have led to a collapse in the drug industry due to economic and ethical factors.In 2004, the FDA showed a decline in medical products submitted for approval during the Critical Path Initiative (CPI).To combat this issue, the FDA proposed that researchers switch from classical designs to adaptive designs [1,2].Adaptive designs are more flexible and allow for interim analysis tests to be conducted periodically.These tests calculate analysis at the end of each stage and provide the flexibility to end the trial early.Classical designs, however, use all sample sizes within the first stage [1].
The European Medical Agency (EMA) published draft guidelines on various aspects of clinical trials in December 2016 [3].In January 2017, the Center for Biologics Evaluation and Research (CBER) in the US FDA issued guidance on multiple endpoints in clinical trials; the report was intended for the industry and explained the importance of multiple testing procedures and how alpha should be controlled to prevent type I errors from arising when multiple analyses are conducted [2,4].In 2018, the FDA published "Adaptive Designs for Clinical Trials of Drugs and Biologics: Guidance for Industry," illustrating the importance of using these designs rather than the conventional ones with a primary endpoint [5].
Various testing procedures have been recommended and generalized based on previous recommendations for clinical trials, mainly based on multiple endpoints since these with the principles of group sequential design.Another relevant contribution came from Zhenming Shun, who investigated the effects of combining sequential group analysis, re-estimation of sample size, and negative stopping (stochastic curtailment) in a single interim analysis focusing on normal data.These modifications offer valuable refinements and expansions to the existing multiple-testing procedures [22][23][24].
Additionally, Urach and Posch's contributions to multi-arm sequential designs with the simultaneous stopping rule have brought advancements to the critical boundaries in multi-arm experiments.Their innovative methodologies have improved multi-arm studies' efficiency, reliability, and decision-making processes, leading to more robust and effective clinical research outcomes [25].
To go back to the basis of this work, O'Brien and Fleming (1979) and Pocock (1977) have served as significant inspirations for the modifications made to sequential group tests [13,26].Their procedure incorporates type I error control and statistical power, resembling a Chi-square test in the case of a one-stage design.However, it also allows for the early termination of a trial if the treatment demonstrates valuable results achieved through the utilization of critical values.
Researchers in various studies have leveraged the critical values proposed by O'Brien and Fleming.For instance, Hammond et al. [27] have shown that treating COVID-19 patients with nirmatrelvir plus ritonavir early in their illness can significantly decrease the progression to severe disease.Additionally, this treatment has been proven to quickly reduce the SARS-CoV-2 viral load in patients.These critical values highlight the importance of early intervention and provide hope for better outcomes in the fight against COVID-19.Similarly, Goldberg et al. [28] conducted a study to compare three different chemotherapy treatments for metastatic colorectal cancer.They used O'Brien and Fleming's critical values to monitor the results.The study found that there was no significant difference in overall survival between the three treatments, but two of the treatments, FOLFOX (folinic acid fluorouracil oxaliplatin) and FOLFIRINOX (folinic acid fluorouracil irinotecan oxaliplatin) had higher response rates and longer progression-free survival than the third treatment FOLFIRI (folinic acid fluorouracil irinotecan hydrochloride oxaliplatin).Additionally, Marcus et al. [29] utilized O'Brien and Fleming's methodology to prematurely conclude gallium trials involving treatment-naive follicular lymphoma patients.
These applications highlight the broad impact and versatility of O'Brien and Fleming's group sequential testing procedure in the realm of clinical trials, facilitating more efficient decision-making and potentially reducing patient exposure to ineffective or harmful treatments.
The critical values are subsequently used to assess the significance of observed differences between treatment groups and make statistical indications about the treatment effects.Critical values play important roles in the design, analysis, and interpretation of clinical trial results.
Hammouri addressed an important issue in 2013 regarding the stopping bounds of the O'Brien and Fleming procedure, particularly their non-monotonic behavior.To overcome this challenge, Hammouri increased the number of simulations to generate critical values exhibiting monotonic behavior.This adjustment was necessary because critical values should progressively increase with the addition of more interim periods in a clinical trial.Ensuring that critical values follow this pattern makes the control of type I error more robust [30].
In recent years, the statistical method of urn allocation has gained attention for its potential use in clinical trial design.This method involves randomly assigning treatment allocations to patients based on the probability of success for each treatment arm.Compared to traditional fixed allocation designs, urn allocation has been shown to provide greater flexibility and efficiency.However, its use in clinical trials has been limited due to concerns about maintaining the type I error rate and power [31,32].
Randomization is crucial for controlling bias in treatment comparisons.Complete randomization, which is similar to tossing a coin, can minimize or even eliminate bias.However, in small or moderate-sized trials, complete randomization may lead to a significant imbalance in the number of patients assigned to each group.In fact, according to expert advice [33,34], complete randomization in single-stratum trials with a target sample size of less than 200 is not recommended.There have been various proposed rules for restricted randomization, such as the permuted-block design [35], the biased-coin design and adaptive biased-coin design [36,37], and the urn design [38].
The urn design is a method of randomization that ensures balance in small trials while behaving like complete randomization in larger trials.This means treatment assignments within a sequence generated by the urn design are less predictable than other restricted randomization procedures.This reduces bias risk.In summary, the urn design has unique properties that balance and reduce bias through statistical analysis.
In this study, we present a framework that allows for a more effective comparison of two treatments.The framework integrates the O'Brien and Fleming testing procedure with Urn allocation, resulting in a flexible procedure for effectively detecting differences between treatments.More adaptive features have been incorporated into the procedure, further enhancing its performance.Examples of implementing the new procedure are also provided.

Methodologies
In this section, we outline the initial protocols for carrying out clinical trials: the O'Brien and Fleming procedure, as well as the Urn allocation.We then introduce our novel procedure, the Urn Multiple Testing Procedure (UMP), which blends the Urn allocation technique with the O'Brien and Fleming procedure to improve the accuracy of treatment effect assessments while keeping a check on the total type I error rate.Lastly, we delve into the process of evaluating the type I error and statistical power.

The Original O'Brien and Fleming Procedure
The data are tested and reviewed periodically with n 1, n 2 subjects receiving treatment 1 and treatment 2, respectively, with K stages and N = K(n 1 + n 2 ), such that N is the maximum number of subjects.Then, we use the usual Pearson Chi-square χ 2 (i) , α test size, and P (K, α), which is the O'Brien and Fleming's critical value in    In any i stage, if i K χ 2 (i) ≥ P (K, α), the null hypothesis is rejected, and the experiment is terminated.Otherwise, if the critical value does not exceed P (K, α), the subsequent subjects are randomized, their measurements are observed, and the steps are repeated.After completing K tests, if χ 2 (i) does not exceed P(K, α), the study is terminated by concluding that the hypothesis of no difference cannot be rejected at the (α) significance level.
For further details about computing and estimating the original and corrected critical points, see [30,39].

Urn Allocation
Urn allocation was given in Zelen [40], Wei and Durham [41], and Ivanovna [42] as δ is simply the fraction of patients on treatment 1.Where δ(p 1 , p 2 ) = 1−p 2 2−p 1 −p 2 = q 2 (q 1 +q 2 ) where p i represents the success probability of a patient on treatment i.Additionally, q i = 1 − p i represents the failure probability for i = 1, 2. In a clinical trial comparing two treatments (treatment 1 and treatment 2) with binary responses (success and failure), patients are enrolled sequentially, and their responses to the treatments are observed immediately.This study design, famously known as the play-the-winner (PW) rule, was introduced by Zelen in 1969.According to this rule, if a patient experiences success with a specific treatment, the subsequent patient will receive the same treatment, while if a patient's treatment is a failure, the next patient will be assigned to the alternate treatment.Then where N n,k be the number of subjects assigned to treatment k in the first n assignments and N n = (N n,1 , . . . N n,K ).The statistical behavior of proportions N n,K /n, k = 1, . . ., K is of interest.Moreover, σ 2 PW = q 1 q 2 (p 1 +p 2 ) (q 1 +q 2 ) 3 .since If treatment 1 is better, the PW rule favors treatment 1 [43].
Another researcher reached the same result with more details [44], where the probability that the superior treatment is allocated to each patient would be more than 0.5 under the use of an adaptive design.Although its demonstration may be difficult in general, they showed it for the deterministic play-the-winner design (Zelen [40]).Under this design, the first patient is allocated to either treatment by a simple randomization.Then the same treatment is applied after a success, and the other treatment is used after a failure.Assuming clinical equipoise (Freedman [45]), the principle of interchangeability is satisfied when the initial patient is enrolled.
Suppose the deterministic play-the-winner rule is followed.Let ∆ = p 1 − p 2 and K = p 1 + p 2 .To avoid triviality, assume that K = 2. Then p 1 = 1/2 and for any integer n ≥ 0, by mathematical induction, This sequence of probabilities of allocations on treatment 1 has the following properties.
Properties 1, 2, 4, and 5 say that the probability of using the superior treatment is more than 50%.Property 3 indicates that p is simply the asymptotic fraction of patients on treatment 1 (Here p is the same δ).This outcome aligns with the findings of Zelen [40] and Wei and Durham [41].Property 6 shows that if p 1 (or p 2 ) is sufficiently large, treatment 1 (or 2) will eventually be identified as the superior treatment.On the other hand, if there is no difference between the two treatments, we eventually randomize between them.Property 7 assumes that p 2 = 1 − p 1 .Suppose that p 1 > 0.5.If we call it a "success" if treatment 1 is allocated and a "failure" if 2 is used, we essentially have a binomial experiment with p 1 as the probability of success.

The Proposed Method: Urn Multiple Testing Procedure (UMP)
Our framework combines Urn allocation with the O'Brien and Fleming procedure, enabling researchers to monitor and analyze data collected at each interim analysis stage.This empowers them to make informed decisions based on predefined stopping rules and efficacy boundaries.Our procedure integrates the strengths of Urn allocation and the O'Brien and Fleming procedure, resulting in a comprehensive and sophisticated methodology that enhances the reliability and efficiency of studies.By leveraging these methods, our procedure leads to more meaningful and accurate results, ultimately providing a valuable contribution to the field.

The New Methodology
The new procedure involves periodically reviewing and testing collected data in K stages, with a designated sample size, denoted as n i where i = 1, . . ., K. The sample sizes will be determined using equal allocation.
In order to carry out the Urn allocation method, we split the sample size n i , into two portions, n i1 and n i2 , which will be assigned to treatment 1 and treatment 2, respectively.This allocation guarantees that the sum of n i1 and n i2 equals the overall sample size, n i .
The equal allocation method will be used to determine the subsample sizes in the first stage.This entails dividing the total sample size n 1 evenly into two subgroups.The Urn allocation method will be employed in subsequent stages to determine the subsample sizes.This procedure takes into account data from previous stages and adjusts the allocation based on observed treatment outcomes.By using this method, more participants are allocated to the treatment with a greater chance of success, resulting in increased study efficiency.
The O'Brien and Fleming procedure will be used for the other steps to determine whether to continue or stop testing.This process involves comparing the observed test statistic, which is usually the Pearson Chi-square χ 2 i , with the corrected critical values provided in Table 1.These values set the thresholds for statistical significance at each study stage.
If the observed test statistic exceeds the critical value, indicating a significant treatment effect, we may terminate the test early and conclude that the treatment is effective.Conversely, if the observed test statistic falls below the critical value, suggesting no significant treatment effect, we may choose to continue the test and collect additional data to gather more evidence.

1.
To begin, consider the overall sample size (N) and the number of stages involved (K).
It is important to note that N should be an even number, as equal allocation may be used if Urn allocation fails, and this requires even sample sizes across stages.

2.
The process proceeds first by finding each stage sample size.By calculating For i = 1, the sample size for the first stage n 1 and equal allocation is used to compute the subsamples, where n 11 = n 12 = n 1 2 .

4.
For i = 2, . . ., K, n i is calculated as the subsample given by n i1 = round( y i * n i ) , where q i−1,j = 1 − p i−1,j and p i−1,j for j = 1, 2 are success rates in the cumulative previous stages for treatment 1 and treatment 2, respectively, and n i2 = (n i − n i1 ).In case where n i1 or n i2 equals zero, equal allocation will be used.

5.
At that time, in each stage i, starting from the first one, subjects are randomized, and their measurements are observed.Each subsample will be added up to the previous subsamples for the same treatment.i K χ 2 (i) is calculated and compared to P(K, α). i.
, then the study ends, and the null hypothesis is rejected.Otherwise, if i K χ 2 (i) < P(K, α), and i < K, the procedure precedes to the next stage. ii.
If i K χ 2 (i) < P(K, α) and i = K, the study is terminated, and the null hypothesis fails to be rejected.
The methodology for the new procedure is graphically illustrated in the following flow chart, Figure 1.

Type I Error and Power
The purpose of this section is to prove the accuracy of the procedure by examining the new type I error and power.Monte Carlo simulation will be used to assess the effectiveness of a procedure.This method helps us to examine the power and the error rate for conducting the new procedure, UMP, and compare it to the original procedure.With Monte Carlo simulations, we can demonstrate how well the UMP performs in detecting significant effects and ensure that it maintains the same type I error rate, which upholds statistical standards.

Type I Error Procedure
The simulation using SAS (Statistical Analysis System, version 9.4 (SAS Institute Inc., Cary, NC, USA)) code was used to calculate type I errors to evaluate the new procedure.The simulations take in a range of success probabilities (p = 0.1, 0.2, 0.3, 0.4, 0.5), test sizes (α = 0.99 and 0.95), and critical values (P(K, α)) for K values ranging from 1 to 5.
By conducting these simulations, the performance of the UMP under different scenarios can be evaluated.The varying success probabilities will allow an exploration of the robustness and adaptability to a wide range of conditions, furthermore, considering different test sizes and critical values.

Type I Error and Power
The purpose of this section is to prove the accuracy of the procedure by examining the new type I error and power.Monte Carlo simulation will be used to assess the effectiveness of a procedure.This method helps us to examine the power and the error rate for conducting the new procedure, UMP, and compare it to the original procedure.With Monte Carlo simulations, we can demonstrate how well the UMP performs in detecting significant effects and ensure that it maintains the same type I error rate, which upholds statistical standards.In order to determine if the new procedure is effective at rejecting the null hypothesis or detecting a significant difference between groups, we will create multiple subsamples from a single binomial distribution with p = 0.1, 0.2, 0.3, 0.4, and 0.5 with sample sizes N = 120, 300, and 720 for α = 0.01 and with N = 20, 360, and 720 for α = 0.05.Equal success rates were used (ensuring no difference can be found and the H 0 is true).This process will be repeated 500,000 times, and we will calculate the percentage of times the H o is rejected to determine the type I error rate.

Type I Error Procedure
We created Figure 2 to visually represent the process of determining the type I error for the UMP.
from a single binomial distribution with  = 0.1, 0.2, 0.3, 0.4, and 0.5 with sample sizes  = 120, 300, and 720 for α = 0.01 and with  = 20, 360, and 720 for α = 0.05.Equal success rates were used (ensuring no difference can be found and the  is true).This process will be repeated 500,000 times, and we will calculate the percentage of times the  is rejected to determine the type I error rate.
We created Figure 2 to visually represent the process of determining the type I error for the UMP.

Power Procedure
The SAS program was employed to determine the power values for a specific analysis.In the initial step, a probability value, P 1 , was selected as 0.1, along with success probability values, P 2 , of 0.15, 0.2, 0.25, and 0.3.The program executed 500,000 iterations with significance levels α = 0.01 and α = 0.05.The analysis incorporated the corrected O'Brien and Fleming critical values, P(K, α).Using the conventional Chi-square test, sample sizes were determined to achieve power values around 0.8.
For α = 0.05, the sample sizes were 1380, 396, 200, and 120, while for α = 0.01, the sizes were 2040, 600, 300, and 180, to ensure a significant difference.Two subsamples were generated for each case of K = 1, . . ., 5 from different binomial distributions with different means.The new procedure was utilized to test whether the null hypothesis ( H 0 ) indicating that there is no difference between the two groups, could be rejected in favor of the alternative hypothesis ( H a ).The entire process was repeated 500,000 times to calculate the proportion of rejections of H 0 , representing the power rate, given that H a was guaranteed to be true.The graphical representation of the power computation process can be found in Figure 2.

Results
This section will present the results of our analysis of the type I error and power levels for the proposed procedure.Our findings show that the type I error rate was acceptable, indicating that we did not reject the null hypothesis when it was true.Additionally, the power level of the test was high, suggesting that we were able to accept the alternative hypothesis when it was true correctly.These results demonstrate the validity and reliability of our proposed procedure and provide confidence in the accuracy of our findings.

Type I Error Results
After conducting an extensive analysis of various sample sizes, it was observed that the results remained consistent irrespective of the sample size.Furthermore, when comparing these results to the conventional Chi-square test, it was found that the modified procedure reduced type I error values.This reduction became more pronounced as the value of K increased.
The rationale behind this trend lies in the fact that as K increases, the Chi-square statistic is multiplied by a factor that is less than one.Consequently, the resulting value is compared with an increasingly larger critical value.As a result, it becomes more difficult to reject the null hypothesis H 0 .In other words, the larger the value of K, the stronger the evidence required to reject the null hypothesis.
The results for sample sizes 360 and 300 are presented in Table 2, with significance levels of 0.05 and 0.01, respectively.In the first case, where the sample size was 360, and the coefficient of determination was 0.05, we observed that the type I error ranges from 0.04907 to 0.050808 when K = 2.For K = 5, the type I error values range from 0.048838 to 0.05047, which are very close to the significance level of 0.05.This improvement in error control over the usual Chi-square procedure is considered acceptable.The type I error also showed similar patterns when considering the same significance level (alpha) and varying sample sizes of 120 and 720.For the sample size of 120, the type I error ranges from 0.042836 to 0.051654.For the sample size 720, the type I error ranges from 0.049216 to 0.05063.
When using a significance level of α = 0.01, the values range between 0.008882 and 0.09922 in the second stage and range between 0.008866 and 0.009976 in the last stage.Importantly, these values remain below 0.01, which is considered satisfactory as the errors do not exceed this threshold.
Furthermore, we calculated the type I error values for sample sizes 120 and 720 using a significance level of α = 0.01.The results indicated that the new procedure performs appropriately in terms of type I errors.Specifically, for a sample size of 120, all type I error values fall within the range of 0.008034 to 0.09992.Similarly, for a sample size of 720, the type I error values range from 0.009494 to 0.010064.
Tables 2-4 provide a comprehensive overview of the remaining type I error values for different sample sizes and significance levels, allowing for a thorough examination of the procedure's performance in various scenarios.Figure 3a,b shows the type I error values from the 500,000 simulations.values fall within the range of 0.008034 to 0.09992.Similarly, for a sample size of 720, the type I error values range from 0.009494 to 0.010064.Tables 2-4 provide a comprehensive overview of the remaining type I error values for different sample sizes and significance levels, allowing for a thorough examination of the procedure's performance in various scenarios.Figure 3a, b shows the type I error values from the 500,000 simulations.

Power Results
As a result, it was noticed that the initial values for the power value with K = 1 and α = 0.05 were between 0.80346 and 0.81579.Moreover, the values showed a decreased behavior when the values of K were increased, since the power values with α = 0.05 when K = 5 were between 0.78067 and 0.79628, with marginal errors less than 0.01933 between the power values with K = 5 and 0.8.
Then, power values with an α of 0.01 and K value of 1 range from 0.80087 to 0.81310.Upon further analysis, we found again that the power values decreased as K values increased.Specifically, when K was equal to 5, the power values ranged from 0.78204 to 0.8030, with marginal errors less than 0.01796 between the power values with K = 5 and 0.8.In both cases, the marginal error is very small and can be overlooked compared to the benefits gained from using this method.
The remaining values and power values behaviors have been presented in Table 5 and visually represented in Figure 4a,b.

Calculating Rejection Rates for Each Stage
In this section, our focus was to determine at which stage the rejection occurs by calculating the rejection rates and identifying the sample size required to conclude the rejection of the null hypothesis ( ).

Calculating Rejection Rates for Each Stage When the Difference Is Present
With a standard power of 0.8 and considering different probabilities of success (0.1 and 0.15) with an  = 0.05 threshold, a total sample size of 1380 was determined.Using 500,000 iterations at each stage (), the results were summarized in Table 6 and illustrated in Figure 5.
For  = 2, in this case, the majority of rejections happened in the second stage (74.3%),indicating that the entire sample size was required to reject the  hypothesis in

Calculating Rejection Rates for Each Stage
In this section, our focus was to determine at which stage the rejection occurs by calculating the rejection rates and identifying the sample size required to conclude the rejection of the null hypothesis ( H 0 ).

Calculating Rejection Rates for Each Stage When the Difference Is Present
With a standard power of 0.8 and considering different probabilities of success (0.1 and 0.15) with an α = 0.05 threshold, a total sample size of 1380 was determined.Using 500,000 iterations at each stage (i), the results were summarized in Table 6 and illustrated in Figure 5.

Calculating Rejection Rates for Each Stage When the Difference Is Not Presented
The rejections were calculated with  = 0.01, and the sample size was 360.Based on 500,000 iterations, Table 7 and Figure 6 illustrate the required sample size and the number of rejections at stage .
Moreover, it needs to be noted that these percentages are out of the 5% rejection rate.The decision rules for this multiple testing procedure are nearly identical to the usual Chisquare one-stage procedure in the absence of early termination when  is true.For K = 2, in this case, the majority of rejections happened in the second stage (74.3%),indicating that the entire sample size was required to reject the H 0 hypothesis in 25.7% of the cases.
Moving on to K = 3, it was observed that a sample size of 920 was needed to reject the H 0 hypothesis in the second stage.This resulted in a 50.9% rejection rate, which was the highest rejection rate among the three stages.The highest rejection rate for K = 4 occurred in the third stage, with a rate of 45% with the sample size required to reject the H 0 hypothesis being 1036.Lastly, for K = 5, the highest rejection rate was 35%, which was observed in the fourth stage combined with a sample size of 1104.
Based on these findings, it can be concluded that the proposed procedure effectively minimized the necessary sample size for statistical significance, leading to a significant decrease in expenses and effort.This outcome indicated the efficiency and practicality of the suggested procedure.Table 6 was recapped in Figure 5 to summarize the results.

Calculating Rejection Rates for Each Stage When the Difference Is Not Presented
The rejections were calculated with α = 0.01, and the sample size was 360.Based on 500,000 iterations, Table 7 and Figure 6 illustrate the required sample size and the number of rejections at stage K.The rejections were calculated with  = 0.01, and the sample size was 360.Based on 500,000 iterations, Table 7 and Figure 6 illustrate the required sample size and the number of rejections at stage .
Moreover, it needs to be noted that these percentages are out of the rejection rate.The decision rules for this multiple testing procedure are nearly identical to the usual Chisquare one-stage procedure in the absence of early termination when  is true.Moreover, it needs to be noted that these percentages are out of the 5% rejection rate.The decision rules for this multiple testing procedure are nearly identical to the usual Chi-square one-stage procedure in the absence of early termination when H 0 is true.

Examples 4.1. Example 1: Computational Example
We conducted a trial on simulated data of 1200 patients divided into two groups (600 each).Group 1 data was simulated using a binomial distribution with a success rate of 0.25, and group 2 data was simulated using a binomial distribution with a success rate of 0.55.
The UMP was used to find the significant association between treatment in use and success for K = 1 to 5 and alpha of 0.05.We analyzed the data and presented case five as part of our findings to illustrate calculations.
For case five, the UMP was used with sample sizes equal to n i = 240 and a critical value of 4.1602 (from Table 1) For the first stage, with n 1 = 240, and the subsamples distributed equally with n 11 = n 12 = 120.Then, the Chi-Square statistic was 8.6, and after dividing the Chi-Square value by one-fifth, the value was 1.72, which is not greater than the critical value, so we failed to reject the H 0 hypotheses.For the second stage, with n 2 = 240, the cumulative subsamples were distributed by using Urn allocation as follows: Since p 1,1 = 34 120 = 0.28, and p 1,2 = 56 120 = 0.47.Then, we got The Chi-Square statistic was 23.35, and after dividing the Chi-Square value by twofifths, the value was 9.34, which is greater than the critical value, so we rejected the H 0 hypotheses.
In total, the experiment was terminated after two stages with 480 participants out of 1200 patients due to significant differences between the two groups.
The complete information regarding each K is presented in Table 8.In this example, the trial can be terminated using only 400 of 1200 patients, as shown in Figure 7.

Example 2: Experimental Data Example
For this example, an experimental study was conducted by distributing a brief survey among university students in April 2023 at Jordan University of Science and Technology.The purpose of this study was solely illustrative, and no inferential conclusions were intended to be drawn from the data.The study aimed to demonstrate our methodology,

Example 2: Experimental Data Example
For this example, an experimental study was conducted by distributing a brief survey among university students in April 2023 at Jordan University of Science and Technology.The purpose of this study was solely illustrative, and no inferential conclusions were intended to be drawn from the data.The study aimed to demonstrate our methodology, which was designed to minimize the number of participants required for the experiment.The primary focus was to investigate the association between children's smoking habits and their parents' smoking history.
The survey data included 240 participants, with 120 smokers and 120 non-smokers in each group.Each participant was asked about their smoking status, and the smoking histories of their parents were recorded for analysis.Once 120 participants were reached for each group, data collection stopped.A total of 386 participants were sampled to reach this count.
Instead of utilizing the entire sample (i.e., K = 1), the UMP method was employed to showcase that the necessary results could be obtained with fewer than 240 participants.The data was used for K = 2, 3, 4, and 5 and alpha of 0.05.In each scenario, each subsample was further divided into K stages, and UMP was applied.
To demonstrate the procedure, we included details of K = 4: A conclusion of significance was reached after two stages of the trial, with 120 out of 240 participants.
In detail: For the first stage with K = 4, we have n 1 = 60, and the subsample was divided equally, where n 11 = n 12 = 30, so the Chi-square statistic was 8.5, and by multiplying the Chi-square by one-fourth, the value was 2.13, which is not greater than the critical value of 4.0961, so we failed to reject H 0 hypothesis.For the second stage, we needed to recalculate n 21 and n 22 by using the Urn allocation.
We have the subsamples equal to n 21 = 46 and n 22 = 14, with cumulative subsamples equal to 76 and 44 for group1 and group2 respectively.Moreover, two-fourths of the Chi-square statistic was 15.81, which is greater than the critical value.Thus, we reject the H 0 hypothesis.
In this case, we used only 120 patients to end the experiment and get a significant difference between the two treatments.
For more details regarding each K, see Table 9.Similarly, the trial can be terminated in the second example using only 80 of 240 patients, as shown in Figure 8.Similarly, the trial can be terminated in the second example using only 80 of 240 patients, as shown in Figure 8.

Discussion
Clinical trials aim to enhance efficiency by employing sequential procedures that incorporate multiple primary critical points.By considering multiple critical endpoints, trials can comprehensively characterize treatment effectiveness.Several forms of these procedures have been proposed in order to optimize the analysis.
O'Brien and Fleming proposed one of the earliest multiple-testing procedures in 1979.However, it has been noted by Hammouri [30] that the original critical values of this procedure required modification to ensure monotonic behavior after 1,000,000 iterations.This study specifically combined the O'Brien and Fleming procedure with Urn allocation.

Discussion
Clinical trials aim to enhance efficiency by employing sequential procedures that incorporate multiple primary critical points.By considering multiple critical endpoints, trials can comprehensively characterize treatment effectiveness.Several forms of these procedures have been proposed in order to optimize the analysis.
O'Brien and Fleming proposed one of the earliest multiple-testing procedures in 1979.However, it has been noted by Hammouri [30] that the original critical values of this procedure required modification to ensure monotonic behavior after 1,000,000 iterations.This study specifically combined the O'Brien and Fleming procedure with Urn allocation.
These modifications were introduced to improve the performance and reliability of the procedure when dealing with multiple primary critical values.By using Urn allocation, the procedure is tailored to address specific considerations and optimize the decision-making process during the clinical trial.
Overall, using multiple critical endpoints and modifying the O'Brien and Fleming procedure with Urn allocation are strategies employed to enhance the efficiency of clinical trials.These advancements aim to provide more robust and informative results, ultimately contributing to advancing medical research and patient care.
The decision to utilize Urn's allocation in this study was motivated by ensuring that more subjects receive the most effective treatment.Health outcomes are maximized, equity is promoted, resources are optimized, and evidence-based practices are advanced by allocating effective treatment in clinical trials.The urn design is a popular adaptive design that has been extensively researched.Such designs balance complete randomization and perfect balance in treatment assignments, thereby reducing experimental bias.By forcing a small trial to be balanced, the urn design becomes increasingly randomized as the trial size increases.As a result, it is less susceptible to experimental bias than other restricted randomization procedures.In clinical trials, assuming that the study subjects represent a random sample a uniform population may be challenging.A randomization model provides a preferred basis for statistical inference in such instances.
In any new statistical procedure, the reduction in type I error is crucial as it ensures that false positive results are minimized, thereby enhancing the reliability of the findings.At the same time, maintaining power is essential to ensure that the study has sufficient sensitivity to detect true treatment effects.
To assess the performance of the proposed procedure in controlling type I error, a detailed analysis was conducted by considering alpha = 0.05 and 0.01 and various sample sizes.
In the first case, an alpha value of 0.05 and a sample size of 360 were employed.The investigation began by calculating the type I error, where the results were promising since the type I errors did not exceed the value 0.050808, and most of the values were under 0.05.To further investigate the performance of the proposed procedure, additional analyses were conducted to calculate the type I error rates using other different sample sizes.The type I error values were below the 0.052 threshold.
In addition to the previous analyses, further investigations were conducted to assess the type I error values with an alpha level of 0.01.All type I error values in all cases were below the specified threshold of 0.011, and most of the values were under 0.01.Indicating successful control of type I errors.These findings emphasize the effectiveness of the proposed procedure in controlling type I error.Since the calculated type I error rates consistently fell within acceptable ranges and most of them were less than the desired alpha, affirming the reliability and validity of the modified procedures.By employing Urn's allocation, the new procedure offered improved control over type I errors, enhancing the integrity of clinical trial outcomes and decision-making processes.
The type I error results were anticipated because, in any study with multiple endpoints, critical values are adjusted to maintain the overall type 1 error rate.Type 1 errors occur when a null hypothesis is rejected even though it is true.Adjusting the critical values makes it harder to reject the null hypothesis, reducing the chance of a type I error when H o is true.
Furthermore, multiple comparisons were made using different success rates and sample sizes to determine if the proposed implementation maintained an acceptable acceptance rate (power) of the alternative hypothesis H a .The study considered success rates of 0.1, 0.15, 0.2, 0.25, and 0.3, with corresponding sample sizes to ensure a power of approximately 0.8 using the Chi-square test.Despite observing a slight decrease in the values, the new implementation successfully maintained the acceptance rate under H o as being true (power), where the power values were greater than 0.7723 for alpha level 0.05 and 0.78067 for alpha level 0.01-confirming the maintenance of satisfactory power levels.
These results demonstrate that, despite slight decreases, the proposed implementation effectively maintains acceptable power values for alpha levels 0.05 and 0.01 across various test sizes and sample sizes.The observed differences between the values remain within reasonable margins, indicating the reliability and robustness of the modified procedures.
In conclusion, the results of this study suggest that the UMP is a potentially more effective procedure compared to O'Brien and Fleming's multiple tests and the single sample method.The UMP offers improved control over type I error while maintaining acceptable power levels.
In conclusion, the results of this study highlight the effectiveness of employing multiple critical endpoints in the O'Brien and Fleming procedure, while modifying it using Urn allocation to enhance clinical trials efficiency.The modification introduced in this study successfully controlled type I errors, ensuring reliability and validity of the findings.By utilizing Urn allocation, the procedure addressed specific considerations and optimized the decision-making process during the trial, leading to improved treatment allocation and patient outcomes.The proposed implementation maintained an acceptable power level showing its effectiveness in detecting true treatment effects.These findings provide valuable insights into the potential of modified procedures and open avenues for future research.
Future work will compare the UMP with other binary outcome testing methods, particularly the Optimal Weighted Multiple Testing Procedure (OWMP) discussed earlier [39].This comparison will provide further insights into these different procedures' relative strengths and weaknesses.Furthermore, the proposed procedure will be extended to more than two treatments.This extension will involve conducting additional iterations based on the modifications applied to the original O'Brien and Fleming procedure.By expanding the scope of the study, a more comprehensive evaluation of the UMP's performance and applicability can be achieved.By advancing statistical procedures in clinical trials, we can contribute to medical research advancement and ultimately improve patient care.

Figure 1 .
Figure 1.The algorithm of the procedure of UMP.

Figure 1 .
Figure 1.The algorithm of the procedure of UMP.

Figure 2 .
Figure 2. The algorithm to calculate the type I error and power values using Monte Carlo simulations with the UMP.2.4.2.Power Procedure

Figure 2 .
Figure 2. The algorithm to calculate the type I error and power values using Monte Carlo simulations with the UMP.

Figure 5 .
Figure 5. Percent of rejections of  occurring at stage  with 500,000 iterations.

Figure 5 .
Figure 5. Percent of rejections of H 0 occurring at stage i with 500,000 iterations.

Figure 6 .
Figure 6.Percent of rejections of  occurring at stage  with 500,000 iterations.Figure 6. Percent of rejections of H 0 occurring at stage i with 500,000 iterations.

Figure 6 .
Figure 6.Percent of rejections of  occurring at stage  with 500,000 iterations.Figure 6. Percent of rejections of H 0 occurring at stage i with 500,000 iterations.

Mathematics 2023 , 21 Figure 7 .
Figure 7. Sizes of samples necessary to reach the rejection of  for the computational example.

Figure 7 .
Figure 7. Sizes of samples necessary to reach the rejection of H 0 for the computational example.

Figure 8 .
Figure 8. Sizes of samples necessary to reach the rejection of  for the experimental data example.

Figure 8 .
Figure 8. Sizes of samples necessary to reach the rejection of H 0 for the experimental data example.

Table 1 .
O'Brien and Fleming corrected and original critical values.

Table 2 .
Type I values from Monte Carlo simulations for α = 0.05 with sample size 360 and α = 0.01 sample size 300 using the UMP.

Table 2 .
Type I values from Monte Carlo simulations for  = 0.05 with sample size 360 and  = 0.01 sample size 300 using the UMP.

Table 5 .
Power estimates for different cases for implementing Urn allocation and equal weights to the O'Brien Fleming procedure with α = 0.05 and α = 0.01.

Table 6 .
Sample sizes and percentages of accepting H a occurring at stage k with 500,000 iterations.

Table 7 .
Sample sizes and percentages of rejecting  occurring at stage i with 500,000 iterations.

Table 7 .
Sample sizes and percentages of rejecting H 0 occurring at stage i with 500,000 iterations.
3.3.2.Calculating Rejection Rates for Each Stage When the Difference Is Not Presented

Table 7 .
Sample sizes and percentages of rejecting  occurring at stage i with 500,000 iterations.

Table 8 .
The results of the UMP for the computational example.

Table 9 .
The results of the UMP for the experimental data example.