Estimation and Testing of Wilcoxon–Mann–Whitney Effects in Factorial Clustered Data Designs

: Clustered data arise frequently in many practical applications whenever units are repeatedly observed under a certain condition. One typical example for clustered data are animal experiments, where several animals share the same cage and should not be assumed to be completely independent. Standard methods for the analysis of such data are Linear Mixed Models and Generalized Estimating Equations—however, checking their assumptions is not easy, especially in scenarios with small sample sizes, highly skewed, count, and ordinal or binary data. In such situations, Wilcoxon–Mann– Whitney type effects are suitable alternatives to mean-based or other distributional approaches. Hence, no speciﬁc data distribution, symmetric or asymmetric, is required. Within this work, we will present different estimation techniques of such effects in clustered factorial designs and discuss quadratic- and multiple contrast type-testing procedures for hypotheses formulated in terms of Wilcoxon–Mann–Whitney effects. Additionally, the framework allows for the occurrence of missing data: estimation and testing hypotheses are based on all-available data instead of complete-cases. An extensive simulation study investigates the precision of the estimators and the behavior of the test procedures in terms of their type-I error control. One real world dataset exempliﬁes the applicability of the newly proposed procedures.


Introduction
Clustered data are commonly encountered in medical research and various disciplines and occur whenever a subject is not only observed once under a certain condition, but multiple times. For instance, animals sharing the same cage, students in classes, skin irritations, etc., are different examples of clustered data. In these situations, subjects provide multiple possibly dependent observations (not necessarily equally sized). Standard methods for the analysis of independent observations (e.g., t-test, linear regression, Analysis of Variance (ANOVA), etc.) are not applicable in such scenarios. Ignoring that structure might result in bias (such as inflated type-I error rates and estimation bias) and therefore appropriate models are necessary for making inference. Reducing the clusters to a single point by, e.g., computing their mean or median, typically results in a loss of power and decreased precision of point estimates [1][2][3]. Furthermore, estimation of treatment effects becomes an issue because of presence of intra-cluster correlations and unequally sized clusters, see, e.g., Gao [4]. Under certain assumptions such as multivariate normality and linear relationships, Linear Mixed Models and Generalized Estimating Equations can be used. However, testing these assumptions is difficult in practice as noted by Fitzmaurice et al. [5] and Johnson and Wichern [6], especially when dealing with small sample sizes. Further, if count, ordinal, or highly skewed data are present, mean-based approaches are not applicable and thus, another type of measure is needed. On the contrary, Wilcoxon-Mann-Whitney-type effects p = P(X < Y) + 1 2 P(X = Y) [7] are purely non-parametric quantities, which can be used for the definition of a treatment effect for metric, discrete, ordinal, and even dichotomous data in a unified way. Thus, the response variable is not assumed to be symmetrically distributed. Here X and Y represent two independent random variables coming from different populations. In the literature, p is also called relative effect [8] or probabilistic index [9,10], see Brunner et al. [11] for an overview. It is the aim of this paper to discuss different estimation techniques of such effects in factorial repeated measures designs with a clustered structure. We hereby differ between non-informative and informative cluster sizes by presenting weighted and unweighted estimators. Here, informative cluster sizes mean that the cluster sizes might be related to the outcome. In addition to estimation, we further introduce different statistical inference methods for testing hypotheses formulated in terms of these effects. All methods allow for the occurrence of missing data and take all-available data into consideration, which is a novelty since previous methods by Akritas et al. [12], Fong et al. [13], Domhof et al. [14] and Amro et al. [15] can only be used for testing hypotheses in terms of distribution functions in scenarios with missing data and do not allow for clustered data. Akritas and Brunner [16] and Brunner et al. [17] propose ranking procedures for testing hypotheses green formulated in terms of distribution functions in clustered data designs, see Brunner et al. [18] for an excellent overview. Thus, the aim of this work is to provide a framework for estimating Wilcoxon-Mann-Whitney-type effects, as well as to present test procedures for hypotheses formulated in terms of these effects in factorial repeated measures designs with clustered data. The paper is organized as follows. First, a real world example is introduced in Section 2 that motivates the development of the methods. Next, in Section 3 the factorial repeated measures model with a clustered data structure is introduced. Subsequently, point estimators and their asymptotic distributions are derived in Sections 4 and 5. Further, test procedures and multiple hypotheses in this framework are presented in Sections 6 and 7. The results of extensive simulation studies are presented in the following Section 8. Finally, the motivating example is analyzed by using the newly proposed methods in Section 9 and a discussion and conclusion about the findings is given in Section 10. All proofs can be found in the Appendix A.

Motivating Example
In order to motivate the development of methods for factorial repeated measures data with a clustered structure, we consider the secondary/exploratory outcome analysis of retinal thickness in the 'Sunphenon in progressive forms of multiple sclerosis' (SUPREMES) trial by Klumbies et al. [19], which was a relatively small clinical trial in progressive multiple sclerosis (MS) patients. MS is the most common autoimmune disorder of the central nervous system, affecting approximately 2.8 million people worldwide [20]. In MS, immune cells attack the myelin sheaths of nerve fibers, often resulting in neurodegeneration and thus permanent disability. In most cases, the disease starts with a relapsing-remitting course, followed by a progressive course around 15-20 years after diagnosis. Around 15% of patients feature a progressive course from onset of the disease [21]. MS is associated with visual impairment caused by optic nerve and posterior visual pathway damage, which can be quantified with optical coherence tomography based thickness measurements of the retinal nerve fiber layer, the combined ganglion cell, and inner plexiform layer (GCIP) and inner nuclear layer (INL) [22]. Retinal thickness analysis has been suggested as an outcome parameter in MS clinical trials [23]. In animal models of MS, epigallocatechin gallate (EGCG), which is an anti-inflammatory agent, indicated neuroprotective properties. The recent paper of Klumbies et al. [19] investigated the effect of EGCG on retinal thickness as an indicator for treatment response in progressive MS. For this motivating example, only the parameter peripapillary retinal nerve fiber layer (pRNFL) will be investigated.
Longitudinal OCT data were available from 31 patients, from which 15 patients were assigned to the intervention group and 16 to the control group, respectively. For most patients, both eyes were investigated, thus leading to possibly dependent observations, since assuming independence of the eyes from the same patient would be dubious. Further, missing values occur: From 61 patients in the SUPREMES trial, only 31 contributed to the final analysis and at 3-year follow-up, only 8 patients remained in the study. Table 1 displays the numbers of patients with pRNFL measurements in each group at each time point. Due to the extremely small number of measurements after 3 years, this time point was discarded from the analysis. Since the sample size was quite small and many missing values occurred, the authors tested-besides other hypotheses-whether there exists a statistical interaction of treatment and time using non-parametric analysis of longitudinal data in factorial experiments, as proposed by Brunner et al. [18]. One disadvantage of this procedure is that it cannot handle missing values and a clustered data structure. Therefore, Klumbies et al. [19] conducted a complete-case analysis and modeled the eyes as a second sub-plot factor besides the sub-plot factor time. In order to close this methodological gap, a general factorial model with repeated measurements, allowing for possibly correlated dependent replicates will be introduced in the next section.
To account for metric, discrete, ordinal and ordered categorical data in a unified way, we use the normalized version of the distribution function which is the average of the left and the right continuous version of the distribution function F − is = P(X is1 < x) and F + is = P(X is1 ≤ x), which was first introduced by Ruymgaart [24]. In model (1), the numbers of non-missing observations under time-point s in group i, the overall sample size and the minimal number of observations over all groups and time points are given by n i and λ min = min(λ 11 , . . . , λ ad ).
We propose to use unweighted relative effects with G = 1 ad ∑ a i=1 ∑ d s=1 F is being the unweighted mean distribution function and Z ∼ G, independent of X isk . The relative effect p is models the relationship of the distribution F is to the average distribution G. If p is > p jt , then data coming from F is tend to be larger than data coming from F jt . If p is = p jt , then there is no tendency to greenlarger nor smaller values between the two distributions. For more information on unweighted and weighted relative effects, we refer to Brunner et al. [25] and Brunner et al. [26]. In the following, we always refer to unweighted relative effects, if relative effects are mentioned.

General Factorial Model with Clustered Data
In the following we introduce a general factorial longitudinal model with clustered data. In comparison with model (1), we observe random vectors X ik = (X i1k , .., X idk ) with and m isk denotes the number of possibly dependent replicates of subject k in group i at time s and m is = ∑ n i k=1 m isk λ isk denotes the total number of possibly dependent replicates in group i at time s. Thus, the number of dependent replicates may vary for each subject and may not be under experimental control. Note, that we do not assume any correlation structure of the dependent replicates. Similarly as in (2), we define the relative effect as Note, that model (1) is contained within this model as a special case with m isk ≡ 1. In order to derive asymptotic results, we impose the following model assumptions: Assumption A1.1 ensures that none of the groups vanishes asymptotically, whereas Assumptions A1.2 and A1.3 ensure that the total sample size N and number of clustered observations m is is bounded. In the following section, estimators for the relative effect p will be derived.

Estimators and Their Asymptotic Distributions
We will first study estimators of relative effects in factorial repeated measures designs without clustered data (i.e., m isk ≡ 1). In order to account for possible missing values, we define the empirical distribution function of the data under time point s in group i as the average of the all-available data by Here, c(u) = 0, 1 2 , 1, if u <, =, > 0. By plugging in the empirical counterparts F is , we obtain and define p = ( p 11 , . . . , p 1d , p 21 , . . . p ad ) .

Effect Estimation in Factorial Designs with Clustered Data
To generalize the estimation of empirical distribution functions and relative effects to the case of clustered data, we follow the idea of Roy et al. [1] who proposed two different approaches for estimating the distribution functions by using the cluster sizes as weighting schemes. In the first version of the estimator of the relative effect p, larger clusters add more weight to the estimation than smaller ones and in the second version, all clusters add the same weight to the estimation, disregarding their size. Analogously to Roy et al. [1] the estimators are called unweighted and weighted estimators, respectively. Note that Obuchowski [27] also used the weighted version in the two-sample case.
The two different versions of the empirical distribution functions (unweighted and weighted) are defined as follows: is (x) is the unweighted estimator of F is (x), where the average of the count function is calculated separately for each cluster and these averages are then again averaged. F is (x) is the weighted estimator of F is (x) where the counts are averaged over all observations. In order to write the estimators in a unified way, we define weights then an estimator for F is (x) and G(x) is given by It then follows that an estimator of p is is given by Note that in order to derive the theory of these estimators, the weights need to fulfill the following properties Proposition 1. If Assumption A1.3 is fulfilled, it holds that Furthermore note, that the application of all weights which fulfill both properties is theoretically possible. For example, Zou [28] developed an 'optimal' estimator, which incorporates information on cluster sizes and intra-cluster correlations by a mixed model approach.
First, we will study the asymptotic properties of general estimators for p in the following proposition. Proposition 2. The estimator p * = p * 11 , . . . , p * 1d , p * 21 , . . . , p * ab is asymptotically unbiased and strongly consistent, i.e., Subsequently, the asymptotic distribution of the statistic √ N( p * − p) will be derived. It will be indicated in the next theorem that √ N( p * − p) has asymptotically under A1.1 and A1.2, the same distribution as the random vector √ NB * , with based on random variables defined as The expectation of Ψ * hk can be written as with p (is,ht) := F is dF ht denoting pairwise relative effects between groups i and h and time points s and t.
It follows that the asymptotic covariance matrix of √ N( p * − p) is given by The asymptotic multivariate normality of the linear statistic √ N( p * − p) is given in the next theorem.
However, this covariance matrix is mostly unknown in practical applications and must be estimated in order to be able to make statistical inferences. In Section 5 we will derive a consistent and positive semi-definite estimator of the covariance matrix.

Informative Cluster Sizes
In many applications, the cluster sizes m isk (might) depend on the outcome of interest, i.e.,

Assumption 2.
E(X isku ) = E(X isku |m isk ), i = 1, . . . , a; s = 1, . . . , d; k = 1, . . . , n i ; u = 1, . . . , m isk , Which makes them non-ignorable or informative [29]. As an example, consider the periodontal disease (an inflammation of the gums and bone that surround and support the teeth) study [30]. Severe periodontitis ends in the falling out of teeth and, thus, cluster sizes (patient's teeth) depend on the clinical outcome. Hoffmann et al. [29], among others, suggest a Within-Cluster-Resampling (WCR) method for the analysis of informative clustered binary data. This approach is also applicable in the rather general model considered here and will be described in the following: A randomly chosen observation X q isk is sampled from cluster X isk . This is done for each of all N * d clusters, resulting in a dataset involving single observations only. The latter is repeated Q times, e.g., Q = 10,000, and for each of the Q datasets, the vector of relative effects p is estimated by adapting Equation (4): An estimator and its asymptotic distribution is given in the following theorem.

Theorem 3.
Let A consistent variance is estimator is provided in the next theorem.

Theorem 4.
Let N and p ∆ be defined as in Theorem 3. Then, an estimator of Σ ∆ is given by where Σ q is the estimated covariance matrix from the q-th analysis (see the following chapter for the derivation of an estimator) and The proofs of Theorem 3 and 4 can be found in the appendix of Hoffmann et al. [29]. The WCR-approach proposed by Hoffmann et al. [29] is computationally intensive and could possibly lead to negative variance estimators due to the subtraction in the variance estimation in Equation (7). Hoffmann et al. [29] noted, that this occurs rarely and concluded that in these scenarios, the number of resampled datasets Q or the number of clusters N may be too small for making inferences. However, the WCR-based approach is equivalent to the unweighted estimation of the relative effects p as proposed by Roy et al. [1]-in both analysis all clusters are given equal weight, regardless of their size. Thus, the use of the unweighted estimator should be preferred over the WCR-based approach since its computation is less intensive and always leads to positive variance estimators. However, it should be noted that Assumption 2 of ignorable cluster-sizes is never imposed during the development of the theory in this work. Therefore, all weighting schemes that fulfill Assumptions A2.1 and A2.2 can be applied in case of non-ignorable cluster sizes-however, the resulting estimators have a different interpretation.

Estimation of the Covariance Matrix
Now, an estimator of the covariance matrix V * N is derived. Similarly, as in Rubarth et al. [31], the random variables Ψ * is,hk are not observable. Otherwise, an estimator of V * N would be given by . Therefore, we replace the unknown Ψ * is,hk with observable random variables. Define the vectors Ψ * denote the estimators of the pairwise relative effects p (is,ht) . Finally, an estimator for the unknown covariance matrix V * N,h is given by Its properties are presented in the next theorem.

Multiple Hypotheses
In this section, the formulation of hypotheses for main-and interaction effects in the factorial repeated measures framework will be outlined. Let C = (c 1 , . . . , c q ) ∈ R q×ad be an arbitrary contrast matrix and let be a family of hypotheses, where c denotes the -th row vector of C. The decision which contrast matrix is appropriate depends on the specific research questions. Well known types of contrast matrices are the Tukey-type contrast matrix, used for all-pairwise comparisons or the Dunnett-type contrast matrix, which is used for the comparison of several groups to one control group. User-specified contrast matrices can also be applied, as long as they have the property of a contrast matrix, which is, that each row of the contrast matrix C sums up to 0 (i.e., ∑ ad m=1 c ,m = 0∀ = 1, . . . , q). Since the layout in this paper is multifactorial, it is briefly demonstrated how to define appropriate contrast matrices for testing main effects of group membership and time and interaction effect between group membership and time.

•
Main effect group membership G In order to make comparisons in terms of group membership, it is necessary to center and average over the repeated measures. Thus, a contrast matrix to test for no group effect will be defined as with C g being a contrast matrix for the group effect with a time structure. • Main effect time T Similarly for the time effect, the measurements across the groups need to be centered and averaged, leading to a contrast matrix to test for no time effect as Again, C t denotes a contrast matrix for the effect over time without the group structure. • Interaction effect G × T For the test of no interaction between group membership and time, the centering matrix will be used.

Test Statistics
In this section, we will present different test procedures for testing global and multiple hypotheses concerning the null hypothesis H p 0 : C p = 0, with C being an appropriate contrast matrix tailored to the specific research question. First, we propose two quadratic test procedures, a Wald-type statistic (WTS) and an ANOVA-type statistic (ATS) as already described by Brunner et al. [17], Domhof et al. [14], and Rubarth et al. [31]. These procedures can only be used to test the global null hypothesis and cannot be inverted to obtain (simultaneous) confidence intervals. Therefore, we will present a Multiple Contrast Test Procedure (MCTP), which has been introduced by Konietschke et al. in a general nonparametric factorial framework [32] and Rubarth et al. [31] for the case of incompletely observed data. Using this procedure, multiple hypotheses can be tested simultaneously and adjusted confidence intervals and p-values are directly obtained.

Quadratic Test Procedures
Following Konietschke et al. [33] and Rubarth et al. [31], we consider the Wald-type statistic (WTS) which can be approximated by a χ 2 f distribution with f = rank(C V * N C ) degrees of freedom (see the discussion on further assumptions on V * N in Brunner et al. [25]). Here, [.] + denotes the Moore-Penrose inverse of a matrix. However, simulation studies by Konietschke et al. [33], Domhof et al. [14] and Rubarth et al. [31] indicate, that the WTS is very liberal in small or moderate sample size scenarios. Therefore, Akritas et al. [34] and Brunner et al. [25], among others, approximate the (asymptotic) distribution of degrees of freedom. Here, M = C CC − C and CC − denotes a generalized inverse of CC . Since M is a projection matrix, it holds that M p = 0 ⇐⇒ C p = 0. The unknown traces tr(MV * N ) and tr(MV * N MV * N ) are estimated by replacing V * N with V * N , see Brunner et al. [17] for the derivation.

Multiple Contrast Test Procedure
To overcome the above outlined disadvantages of the quadratic test procedures, Konietschke et al. [32] proposed a rank-based MCTP for factorial designs, whereas Rubarth et al. [31] proposed a procedure for repeated measures designs with missing values.
with c being the -th row vector of C. All test statistics are collected in the vector Note that the test statistics T * and T * m ( = m) are not necessarily independent depending on the chosen contrast and the repeated measures.
The distribution of T * is asymptotically standard normal. It follows then from Theorem 2 and Slutzky's theorem that T * follows, asymptotically, as N → ∞, a standard multivariate normal distribution with expectation 0 and correlation matrix with D * being a diagonal matrix of the diagonal elements of CV * N C . For large samples, the local null hypothesis H ( ) [35]. By inverting the corresponding test statistic, simultaneous confidence intervals for the effects δ = c p can be obtained by It follows directly, that the global null hypothesis H p 0 : C p = 0 will be rejected, Analogously as in Konietschke et al. [36] and Rubarth et al. [31], the correlation matrix is unknown but can be consistently estimated by Again, D * is denoted as the diagonal matrix obtained from the diagonal elements of C V * N C . We note that the method controls the family wise error rate α in the strong sense asymptotically. However, the proposed procedure is only valid for large sample sizes and the convergence of T * to its asymptotic distribution is rather slow [32]. Therefore, we follow Konieschke et al. [32] who proposed a small sample approximation by using a central multivariate T(ν, 0, R * ) distribution, with ν degrees of freedom and correlation matrix R * . We define for each linear contrast c = (c 11 , . . . , c ad ) , = 1, .., q random variables Φ * hk = c Ψ * hk . It can be directly seen that and by independence of Φ * hk and Φ * hk (k = k ) we obtain for the variance We follow Gao et al. [37] and estimate the degree of freedom by , = 1, . . . , q.

Simulation Study
Within this section, the precision of the unweighted and weighted estimator and the behavior of the introduced test procedures in terms of their type-I error control are examined. The investigated metrics for the precision are Mean Squared Errors (MSEs) and biases, defined as As already pointed out by Domhof et al. [14], Konietschke et al. [33], and Rubarth et al. [31], the WTS requires large sample sizes to be able to maintain the type-I eror rate. Therefore, only the ATS and the MCTP will be examined.

Set-Up
The simulation study was conducted in R The correlation of the dependent replicates within a cluster was set to be • ρ isk ≡ 0, ρ isk ≡ 0.3, ρ isk ≡ 0.9 (same correlation within each cluster); • ρ isk realizations of a Binomial distribution with Binom(10, 0.6)/10 (different correlations within each cluster).
Data was generated by drawing from multivariate normal distributions having expectation µ ik = (µ i1 , . . . , µ i1 , . . . , µ id , . . . , µ id ) ∈ R m ik (m ik = ∑ d s=1 m isk ) and covariance matrices Σ ik ∈ R m ik ×m ik with The components σ isk are obtained from the following homo-and heteroscedastic covariance matrices of multivariate normal distributions: Since the simulation study of Rubarth et al. [31] indicated that the performance of the procedure is not dependent on the distribution, no other data generating distributions were considered. Analogously, we restricted our simulations to the case of Missing-Completely-At-Random (MCAR) scenarios, since no different behavior of the methods of Rubarth et al. [31] in MAR scenarios compared to MCAR scenarios could be observed. Thus, the indicators λ isk greenwere generated by drawing from Binomial distributions B(1 − r) with r being the percentage of missing values r = (r 1 , r 2 ) ∈ {(0%, 0%), (0%, 20%), (10%, 10%), (30%, 30%)}. Since the power of the methods was already investigated in detail by Rubarth et al. [31], the simulation study green of the present paper focused solely on type-I error rates. Further, we additionally investigated the precision of the unweighted and weighted estimators.

Results-Type-I Error Rate
First, an overview of the impact of different sample sizes in scenarios with completely observed data is given in Figure 1 if no missing values occur. It can be readily seen that both procedures control the type-I error quite well even if the sample size is quite low with n 1 = n 2 = 15. Interestingly, the MCTP works better if sample sizes are unbalanced, whereas the ATS works better in case of balanced sample sizes. Next, the impact of missing values will be inspected. In Figure 2, the sample sizes were fixed with n 1 = n 2 = 30. The empirical type-I error rates of both procedures increase, if missing values occur and the higher the relative frequency of missing values, the higher the type-I error rates. Furthermore, the simulation study indicates that the MCTP is more affected by the occurrence of missing values than the ATS, which was already noted by Rubarth et al. [31]. The relationship between the cluster sizes m isk and the type-I error rates is depicted in Figure 3. For this comparison, the sample sizes were again fixed with n 1 = n 2 = 30 and only completely observed data were investigated. It can be readily seen that type-I error rates decrease if two dependent replicates of each subject are present in comparison to a dataset without a clustered structure. However, type-I error rates of the ATS increases if the number of dependent replicates m isk is arbitrary with an expected number of dependent replicates of 5 or 4, respectively. In contrast, the type-I error rates of the MCTP decrease on median in these scenarios.
Next, the influence of intra-cluster correlations ρ isk on type-I error rates is investigated in scenarios with sample sizes n 1 = n 2 = 30 and green without missing data (Figure 4). The type-I error rates of the ATS decrease if non-arbitrary higher intra-cluster correlations are present, whereas the type-I error rates of the MCTP increase in case of higher (non-arbitrary) intra-cluster correlations ρ isk . Interestingly, if arbitrary correlations ρ isk are present with a mean correlation of 0.6, the type-I error rates of the ATS increase in comparison to scenarios with fixed correlations, whereas the type-I error rates of the MCTP are on the same level as in scenarios with a fixed correlation ρ isk = 0.9. Figure 5 depicts the impact of homo-and heteroscedastic covariance matrices Σ 1 and Σ 2 in settings with n 1 = n 2 = 30 and green without missing data. Again, it can be seen that the type-I error rates of the ATS are on median smaller than those of the MCTP for both homo-and heteroscedastic covariance matrices. Type-I error rates of the ATS seem to be a bit smaller in case of homoscedasticity, whereas type-I error rates of the MCTP seem to be slightly larger in homoscedastic scenarios.   Figure 5. Boxplots of Type-I error rates in relation to covariance matrices Σ 1 and Σ 2 in various settings with n 1 = n 2 = 30 and without missing data.
Next, the relationship of unweighted and weighted estimation and type-I error rates will be inspected in scenarios with n 1 = n 2 = 30 and without missing data ( Figure 6). It can be readily seen that the type-I error rates of the ATS do not differ on median in case of weighted and unweighted estimation of the relative effect p, only the interquartile range is increased in case of weighted estimation. Contrary, the type-I error rates of the MCTP are on median smaller in case of unweighted estimation of the relative effect p but without an enlargement of the respective interquartile range.
To conclude, an analysis of the impact of unweighted and weighted estimation of the relative effect p and the fixed intracluster correlation ρ isk is presented in Figure 7 (in scenarios with n 1 = n 2 = 30 and without missing data). The type-I error rates of the ATS decrease if the intra-cluster correlations ρ isk increase, as already depicted in Figure 4. Interestingly, the medians of the type-I error rates are comparable in case of unweighted and weighted estimation if no intra-cluster correlation is present. However, in these scenarios, the interquartile range of the type-I error rates in case of weighted estimation is very enlarged in comparison to the case of unweighted estimation. If a medium intra-correlation is present, type-I error rates of the unweighted estimator are on median smaller than those of the weighted estimator. Here, the interquartile ranges are quite comparable. In scenarios with high intra-class correlations, the weighted estimator yields smaller type-I error rates on median; as well as a larger interquartile range.
Again, as already outlined in Figure 4, type-I error rates of the MCTP increase with higher intra-cluster correlations. The type-I error rates of the unweighted and weighted estimator are quite comparable if no intracluster-correlation is present. However, they are quite different if a medium correlation is present: type-I error rates by using the unweighted estimator tend to be smaller on median than those obtained from weighted estimation. If high correlations are present, the unweighted estimator yields smaller type-I error rates than the weighted version.  Figure 7. Boxplots of Type-I error rates in relation to unweighted and weighted estimation of the relative effect p and fixed intra-cluster correlations ρ isk in various settings with n 1 = n 2 = 30 and without missing data.

Results-Precision
Analogously to the previous section, we will first explore the impact of the sample size on the precision of the unweighted and weighted estimators in scenarios with completely observed data. It can be readily seen in Figure 8 that the MSEs of the unweighted and weighted estimators are quite comparable. The MSEs decrease if sample size increase; balanced settings exhibit smaller MSEs than unbalanced settings. Regarding the bias of the estimators, scenarios with smaller sample sizes tend to exhibit biases in the negative direction, whereas scenarios with larger sample sizes exhibit biases in the positive direction. Interestingly, the interquartile range of biases is quite enlarged in scenarios with n 1 = 40 and n 2 = 20. Next, the impact of missing data on the precision of the estimators is inspected in scenarios with n 1 = n 2 = 30 (see Figure 9). As seen before, the MSEs of unweighted and weighted estimators are quite comparable. The MSEs increase with an increasing missing rate. Interestingly, the interquartile ranges of the biases of the two different estimators are quite different; the weighted estimator exhibits a larger interquartile range (especially if 10% of data are missing) than the unweighted estimator. Further, biases of the unweighted estimator tend to be positive (except if the missing rate is 30%). Contrary, the biases of the weighted estimator tend to be negative. In Figure 10, the influence of the number of dependent replicates m isk on the precision is presented (again in scenarios with n 1 = n 2 = 30 and without missing data). As already pointed out, the distribution of MSEs of the unweighted and weighted estimator is very similar. Interestingly, there is very little variation if only one observation per subject and time point is available compared to scenarios with more possibly dependent replicates. Further, the MSEs decrease quite a lot if already two possibly dependent observations are available and the more clustered data are available, the smaller are the MSEs. The same observations can be made regarding the biases of both estimators.
Finally, the relationship between intra-cluster correlations will be inspected (see Figure 11, again scenarios with n 1 = n 2 = 30 and without missing data). Again, the MSEs of the unweighted and the weighted estimator do not differ much. It can be readily seen that in scenarios with fixed cluster correlations, the MSEs increase with increased intracluster correlations. Contrary, the biases of both estimators approach 0 with increasing fixed intra-cluster correlations but the two estimators exhibit a different behavior: the biases of the unweighted estimator tend to be positive whereas the biases of the weighted estimator tend to be negative.   (5) in relation to intra-cluster correlations ρ isk in scenarios with n 1 = n 2 = 30 and without missing data.

Analysis of the Motivating Example
The parameter pRNFL from the SUPREMES study introduced in Section 2 can now be analyzed using the newly proposed methodology for factorial repeated measures designs with dependent replicates. It can be readily seen in Figures 12 and 13 that pRNFL baseline values in the placebo group tend to be smaller than those from the EGCG group.  Further, Figure 13 indicates that pRNFL values seem to increase at 2-year follow-up; however, this needs to be interpreted with caution, since almost 50% of the patients could not be measured at this time point. These were mostly patients with smaller baseline pRNFL values. This could possibly violate the Missing Completely at Random (MCAR) assumption. However, the simulation study of Rubarth et al. [31] indicated that the proposed method is robust against violations of the MCAR assumption. In order to account for the baseline differences between the two groups following Klumbies et al. [19], the analysis is not based on raw pRNFL values but on differences to baseline. In contrast to the analysis of Klumbies et al. [19], the MCTP on baseline differences is applied. The Tukey contrast was chosen to compare all pairwise differences, thus, the contrast matrix for testing the null hypothesis H p 0 : C p = 0 is as follows: The relative effects were estimated by using the unweighted version of the estimator, leading to p = ( p 11 , p 12 , p 21 , p 22 ) = (0.525, 0.445, 0.505, 0.525) , indicating that the differences to baseline are the largest at 1-year follow-up simultaneously in group 1 (intervention group) and group 2 (Placebo group) and the smallest at 2-year follow-up in group 1. The results, including the values of the test statistics T , p-values and 95% simultaneous confidence intervals, are displayed in Table 2. Table 2. Point estimators of differences p is − p jt (i, j: group, s, t: time point), simultaneous confidence intervals, t-values and p-values for Tukey-type contrasts in relative effects in the SUPREMES trial.

Comparison
Estimator 95%-Confidence Interval t-Value p-Value Note that no adjustment for multiplicity was necessary, since the same critical value obtained from the MCTP was used for each comparison. It follows that no evidence exists to reject the global null hypothesis H p 0 : C p = 0, resulting in the same conclusion as already pointed out by Klumbies et al. [19]: Retinal thickness analysis did not reveal neuroprotective effects of EGCG, especially when considering the contrasts p 21 − p 11 = 0 and p 22 − p 12 = 0, which allow a direct comparison of the differences to baseline at 1-year and 2-year follow-up in the two groups.

Discussion and Conclusions
In the present paper, we presented different estimation techniques for Wilcoxon-Mann-Whitney effects in factorial repeated measures designs with clustered data. In a first step, the information whether cluster sizes are informative or non-informative is key and plays a major role in precise effect size estimation. Furthermore, as indicated by Zou [28] the use of the intraclass correlation enlarges the precision of the estimators. Anyway, besides estimation, we furthermore discussed how to test global and multiple hypotheses in terms of Wilcoxon-Mann-Whitney effects using any of the aforementioned estimators. Here, no specific data distribution (symmetric or asymmetric) is required. The presented estimators and test procedures should be preferred over standard parametric methods such as Linear Mixed Models or Generalized Estimating Equations in scenarios with small sample sizes, heteroscedastic variances or count, ordinal outcomes. We recommend to use the MCTP instead of quadratic-type test procedures in most practical applications since testing global hypotheses does not usually answer the research questions. However, the MCTP presented in this paper does not precisely hold the nominal type-I error rate in case of very small sample sizes, high correlations, and strong heteroscedasticity between groups or time points. Recently, Friedrich et al. [39] proposed novel resampling methods in purely nonparametric designs. Extending these ideas to such designs will be part of future research to improve especially the MCTP in "extreme" scenarios.
Although an extensive simulation study was conducted to evaluate the precision and type-I error rates of the procedures in several scenarios, it is advised to conduct further simulation studies in practical applications, e.g., in the planning or data analysis phase of a study for a specific scenario. Further examinations indicate that the methods are applicable even in situations with rather small sample sizes, such as n = 10. The actual nominal level and accuracy depends, however, on the design of interest.
Furthermore, it is important to note that many scientists in applied research fields, e.g., biomedicine, have a misconception that the Wilcoxon-Mann-Whitney (WMW) test is a test for equality of means or medians, when outcomes are metric and distributions are skewed or ordinal and that this test is the non-parametric equivalent to a classic twosample t-test [40]. As investigated by Fagerland et al. [41], the true significance level of the WMW test deviates enormously from the nominal level when the test is used for comparing means or medians in scenarios with deviations from a pure shift model (two populations having equal shapes and scales). In practical applications, the pure shift model is rarely present, since skewed distributions with different means have most likely also different variances. Especially in such scenarios, the Brunner-Munzel test [8] should be applied. Another disadvantage of the application of WMW tests as noted by Bergmann et al. [42] is that many versions and implementations in statistical software packages exist, e.g., large sample approximation, exact permutation form, versions with or without correction for continuity or ties and different algorithm variants, all leading to possibly different p-values and eventually to different conclusions. In this work we present a unified approach that does not need a correction for ties nor continuity. Furthermore, the WMW test and its p-value are rarely accompanied with their corresponding effect estimate, the Wilcoxon-Mann-Whitney parameter p and its corresponding confidence interval. As noted by Fay et al. [43], classic confidence interval procedures for the WMW parameter are not compatible with exact WMW tests, meaning, that the tests rejects a hypothesis at significance level α but the confidence interval for p includes 1 2 . Thus, Fay et al. [43] developed compatible confidence intervals for asymptotic WMW tests and for some exact WMW tests. Furthermore, Fay et al. [44] indicate that the WMW parameter p can be framed as a causal parameter (the probability that a randomly chosen subject from one population, e.g., the treatment group in a randomized-controlled trial, will have a larger response than one subject from the other population, e.g., the control group). However, this parameter is not equal to another closely related and non-identifiable causal effect, the probability that a randomly chosen subject will have a larger response under treatment than under control [44]. This paradox was first introduced by Hand et al. [45]. Therefore, caution must be given when interpreting effect estimates from non-parametric procedures. Thus, the literacy of non-parametric statistics of scientists working in applied fields should be fostered. This work aims to accomplish this by first introducing and explaining Wilcoxon-Mann-Whitney parameters p in special designs and then providing a flexible model for the analysis of factorial repeated measure designs with a clustered structure, allowing for missing values in a second step. For user friendly applications of the methods, it is planned to enrich the R software package nparLD [46]. Author Contributions: Conceptualization, K.R. and F.K.; methodology, K.R., P.S. and F.K.; formal analysis, K.R.; software, K.R. and P.S.; data curation, H.G.Z.; writing-original draft preparation, K.R. and P.S.; writing review and editing K.R., P.S., F.K. and H.G.Z.; funding acquisition, F.K. All authors have read and agreed to the published version of the manuscript.

Appendix A
In this section, we present all the proofs of the theoretical results achieved.
Proof of Proposition 1. For the concrete weights, we calculate Finally, we calculate Since these are all necessary properties of the weights, the following results hold for all kinds of weights fulfilling these properties.
Moreover, it becomes clear, that condition A1.3 is only necessary for the first inequality and the weighted estimator. Since these inequalities are required for nearly all the following statements, it is therein claimed. Note that if the assumption of bounded cluster sizes is difficult to justify in practical applications, the unweighted estimator can be used without any restrictions.

Proposition A1. For the empirical distributions, under condition A1.3, it holds
Proof. First, we demonstrate the pointwise almost sure convergence of the empirical distribution function F * is . Denote with Ξ is = {k = 1, . . . , n i , : λ isk > 0}, the amount containing the indices of subjects from group i, with existing observation in the s-th component. It is clear that |Ξ is | = λ is . Moreover, for fixed x ∈ R we define independent random variables Y * isk := λ is w * isk ∑ m isk u=1 c(x − X isku ). Then it holds that For the expectation of this sum, we calculate Because of |c(x)| < 1 for the variance of Y * ik we obtain For the application of the strong law of large numbers we finally consider Thus, it holds that Replacing c(x) through c (+) (x) resp. c (−) (·), this leads to the same convergence for the right-continuous resp. left-continuous versions of this distribution functions. Now, this pointwise convergence has to be expanded for the supremum norm. This was already done by Domhof [47] for a similar setting and only requires the pointwise convergence proven above. The result for G * and G follows from this with the triangle inequality.

Proof of Proposition 2.
To prove the asymptotic unbiasedness, we consider the single components and calculate . It is clear that through condition A1.1 and A1. 2 we can also substitute this with O N −1 .
For the second part we consider From Proposition A1, we know that the differences between the distribution function and the empirical distribution function converge to zero, independent from the kind of weights. Therefore, as a finite sum of zero sequences, it holds that p * is − p is a.s. −→ 0. From this, it directly follows that p * − p a.s.

Proof of Theorem 1. It is clear that
Analogously to Theorem 1 from Rubarth et al. [31] for the s-th component from the i-th group, it holds that where the stochastic convergence to zero, denoted by O P (1), holds regarding λ min → ∞.
Considering now the expectations, this leads to with p (is,ht) := F is dF ht . Therefore, we can calculate as well as, for h = i, In total, we obtain Together with the other equation, we obtain [ψ * hk − E(ψ * hk )] +O P (1).

Proof of Theorem 2.
Through the construction of Ψ * is,hk and A1.2, it holds that |Ψ * is,hk | ≤ O(N 0 ), thus, these random variables have finite moments. Due to the independence of all ψ * hk by Lindeberg-Feller Theorem, we obtain for is asymptotically distributed like a normal distributed random vector with expectation vector 0 and covariance matrix V * N,h . Hereby, two of the requirements of the the Lindeberg-Feller Theorem are obviously fulfilled, since the random variables are uniformly bounded and centered. However, it is not sure that V * N,h , which depends on N, converge to a fixed matrix. However, through the fact that all random variables are bounded, it follows that the covariance matrix is also bounded, and for each sequence we can find a subsequence where the covariance matrix converges. For each of this subsequences the asymptotic distribution of B * h matches with the actual normal distribution. Therefore, the result holds in general. Through the independence of the B * h and √ NB * = ∑ a h=1 √ N √ n h √ n h B * h together with condition A1.2 the result follows.
Since the covariance matrix is unknown and the random variables are not observable, we use estimated versions of these random variables to estimate the covariance matrix. They are defined as Ψ * is,hk := ad ∑ a j =i ∑ d t=1 m isk λ isk w * isk p * (jt,is) ) + nh ad ∑ d t=1 m isk λ isk w * isk p * (it,is) −m itk λ itk w * itk p (is,it) , else.
Based on these variables, we define an estimator for the unknown covariance matrix V N,h through whereby an estimator for V * N := ∑ a h=1 κ −1 h · V * N,h is given by V * Proof of Theorem 5. 1. Let y = (y 1 , . . . , y ad ) be an arbitrary vector. Then, it holds that Since V * N is a convex combination of positive semi-definite matrices, it is also positive semi-definite.
2. It is clear that is a consistent estimator for V * N,h . To demonstrate consistency, we use the triangle inequality and prove that V * N,h − V * N,h a.s −→ 0. First, we remember that |Ψ * hk | ≤ O(N 0 ) and with the same arguments it follows that | Ψ * hk | ≤ O(N 0 ), |β * hk | ≤ O(N 0 ) and | β * hk | ≤ O(N 0 ). Then, we consider the single components, where for the diagonal elements it holds that