Two-Sample Tests Based on Data Depth

In this paper, we focus on the homogeneity test that evaluates whether two multivariate samples come from the same distribution. This problem arises naturally in various applications, and there are many methods available in the literature. Based on data depth, several tests have been proposed for this problem but they may not be very powerful. In light of the recent development of data depth as an important measure in quality assurance, we propose two new test statistics for the multivariate two-sample homogeneity test. The proposed test statistics have the same χ2(1) asymptotic null distribution. The generalization of the proposed tests into the multivariate multisample situation is discussed as well. Simulations studies demonstrate the superior performance of the proposed tests. The test procedure is illustrated through two real data examples.


Introduction
Multivariate two-sample homogeneity tests arise naturally in many applications. Consider two multivariate random samples x 1 , x 2 , . . . , x m and y 1 , y 2 , . . . , y n drawn from distributions F and G, respectively. The goal is to test H 0 : F = G. The corresponding alternative hypothesis is that there are some discrepancies between the two distributions. The common formulation of the alternative hypothesis is based on location shift, scale change, or both. There is plenty of existing research on this topic. For example, the classical multivariate analysis of variance (MANOVA), an extension of the univariate analysis of variance (ANOVA), provides a general tool for testing multiple multivariate samples under normality and homogeneity of covariance matrices assumptions. The power of the MANOVA test may dramatically decrease when the assumptions are violated. Recently, some nonparametric methods were proposed such as the Cramér test, originally proposed by [1] and implemented in the R package cramer, and the Energy Distance test [2] implemented in the R package energy. Another work on nonparametric two-sample testing was given by [3], which draws connections between the multivariate Wasserstein test (implemented in the R package otinference) and the Energy Distance test.
In this paper, we consider another type of nonparametric test based on statistical depth. Let F(x) be a distribution in d-dimensional space and x ∈ R d . Let D(x; F) denote a depth function which is a transformation from R d into [0, 1]. A larger depth value indicates that the sample point is closer to the centrality of a given distribution or data cloud. There are some advantages of using statistical depth.

1.
Depth is free of strong distributional assumptions. Unlike the MANOVA test, the depth-based test does not require the normality assumption. Moreover, there is a robust version of depth obtained with minimum covariance determinant (MCD) estimators [4,5]. The work in [6] listed four weak conditions for the depth function.

3.
Depth can provide a ranking of multivariate data. The authors of [12,13] developed the so-called Kruskal-Wallis test based on the rank of univariate samples. Unlike univariate data, multivariate data lack a natural ranking. Recently, the authors of [14] generalized the Kruskal-Wallis test to a multivariate multisample homogeneity test and proposed a depth-based rank (DbR) test.
Statistical depth has become a popular and powerful tool in nonparametric inference and has been used in numerous fields. The work in [8] summarized four properties of depth: affine invariance, maximality at center, monotonicity relative to the deepest point, and vanishing at infinity. The author of [15] emphasized the advantage of depth for a location-scale model. The authors of [16] addressed the outlier detection problem based on depth. Further classification is treated in [17]. The work in [18] amended the halfspace depth and proposed a new illumination depth. For recent extensions to functional depth and regression depth, see [19][20][21].
In this paper, we mainly consider the improvement of the power of current statistical depth for a two-sample test. Ref. [7] approached the two-sample problem from the angle of quality assurance and proposed the quality index Q. The Q statistic is designed for the quality control problem, where F represents the "good" population, and G represents a future "check" population. In the context of quality control, F is naturally used as a reference distribution, and the Q statistic measures the overall "outlyingness" of G relative to the reference distribution F. We investigate the use of the quality index Q beyond the context of quality control but to the general two-sample homogeneity problem. In the general two-sample problem, either of the two samples can be regarded as a reference distribution; we need to consider pairwise quality indexes in order to capture the disparity between the two populations effectively.
The structure of the paper is outlined as follows. In Section 2, we propose two new depth-based test statistics constructed from pairwise quality indexes for the general two-sample homogeneity test. Interestingly, due to the asymptotical symmetry of the pairwise quality indexes, we show that the two proposed test statistics share the same asymptotic null distribution. In Section 3, simulation studies have demonstrated the superior performance of the proposed tests. In Section 4, we discuss the generalization in multivariate multisample cases. The test procedure is illustrated through two real data examples in Section 5. We draw conclusions and suggest future work in Section 6.

Main Results
The proposed tests are based on statistical depth. There are a variety of statistical depth functions proposed in the literature. In the present work, we focus on Mahalanobis depth, spatial depth, and projection depth, which are implemented in the R package ddalpha.
Consider any d-dimensional distribution F with mean µ and covariance matrix Σ. For any x ∈ R d , the Mahalanobis depth of a sample point x is defined as If the underlying distribution F is unknown, it can be replaced by its corresponding empirical distribution, i.e., replacing µ and Σ with the sample mean and sample variance, respectively. The spatial depth is defined as where X has a distribution F. The projection depth is defined as where O(x; F), representing the outlyingness of a point x, is defined as where Med denotes the median function and X has a distribution F. Our work was inspired by the Q statistic proposed by [7], so we review it in the following. For two d-dimensional distributions F and G, the Q statistic is defined as When F and G are unknown, the Q statistic can be estimated by where F m and G n represent the empirical distributions of F and G, respectively, R(y i ; F m ) is defined as the sample proportion of x 1 , x 2 , . . . , x m having D(x j ; F m ) ≤ D(y i ; F m ). Here, we refer to F m as the reference distribution. Note that Q(F m , G n ) = Q(G n , F m ) in general because a different reference distribution is used to compute the depth value. The authors of [7] showed that Similarly, when we take G n as the reference distribution, we have In practice, Q(F m , G n ) may perform better than Q(G n , F m ), and vice versa. We have the issue to choose a better one from the two pairwise quality indexes Q(F m , G n ) and Q(G n , F m ). Hence, we construct the following two new test statistics, the weighted average statistics W m,n (w 1 , w 2 ) and the maximum statistics M m,n , to capture the disparity between the two populations efficiently: where w 1 , w 2 ≥ 0 and w 1 + w 2 = 1, and If two distributions are different, at least one of (Q(F m , G n ) − 1 2 ) 2 and (Q(G n , F m ) − 1 2 ) 2 is large, so the weighted average statistics may be acceptable, and the maximum statistics are deemed to be superior. Note that prior information about the weights is needed to calculate the weighted average statistic W m,n (w 1 , w 2 ). When w 1 = 1, w 2 = 0 or w 1 = 0, w 2 = 1, we treat G or F as a reference distribution. One may set the weights according to the sample sizes. When m = n, one may use equal weights (i.e., w 1 = w 2 = 0.5); otherwise, set w 1 = n/(m + n) = 1 − w 2 . In the general two-sample problem, both proposed test statistics can achieve better power than the original Q statistic under various alternative hypotheses, as demonstrated by our simulation studies. We recommend using the maximum statistic M m,n when no prior information on the weights is available.
For a general depth function, the authors of [6] studied sufficient conditions to guarantee the asymptotic null distributions in (7) and (8). Under the same conditions, we establish the asymptotically symmetrical properties of the pairwise quality indexes, which is a key result that leads to the asymptotic null distributions of the proposed test statistics. For the sake of the completeness of the presentation, we list these conditions below. Let x 1 , . . . , x m and y 1 , . . . , y n be independent samples from F and G, respectively, where F and G are any distribution functions. Let F m and G n be the corresponding empirical distribution function. For any point x and y and distribution H in R d , let D(·; ·) be a given (affine invariant) depth function with 0 D(x; H) 1.
. Based on the asymptotic symmetry property of the pairwise quality indexes, we show that both proposed test statistics are asymptotic pivotal and have a very simple χ 2 (1) asymptotic null distribution, as stated in the following theorem. The proof of the theorem is given in Appendix A. Theorem 1. Given two random samples x 1 , x 2 , . . . , x m and y 1 , y 2 , . . . , y n drawn from distributions F and G, respectively, let F m and G n be the corresponding empirical distribution function. Consider the two statistics M m,n and W m,n defined in (9) and (10), respectively. If min(m, n) → ∞ and m/n tends to a positive constant, under conditions A1-A4 and when null hypothesis is true, we

Simulation Studies: Two-Sample Cases
In this section, simulation studies are conducted to examine the finite sample performance of the proposed tests in the general two-sample problem. We generate two random samples x 1 , x 2 , . . . , x m and y 1 , y 2 , . . . , y n from distributions F and G, respectively.
We first consider the Type I error of three statistics W m,n ( 1 2 , 1 2 ), W m,n ( n m+n , m m+n ), and M m,n . Let F = G = N(0, I 2×2 ), where N(0, I 2×2 ) represents the bivariate normal distribution with mean vector 0 and two-by-two identity covariance matrix. We set m = 100, 200, . . . , 1000 and n = m or m/2. By Theorem 1, all these statistics have the same χ 2 (1) asymptotic null distribution. We consider the upper 95% quantile of χ 2 (1), 3.84, and compare it with the empirical quantiles of three statistics for different configurations of m and n with 100,000 repetitions. Figure 1 shows the convergence rate of empirical quantiles to the theoretical one. When m = n, W m,n (0.5, 0.5) and W m,n ( n m+n , m m+n ) are the same and have a fast rate of convergence; otherwise, both rates are slower as shown in the first two rows. The convergence rate for M m,n is the slowest. Other simulations (not shown here for the sake of brevity) show that the quantile of M m,n approaches the nominal value when m exceeds 5000. Overall, all three depth functions lead to comparable results, except that M m,n has a faster rate of convergence when Mahalanobis depth is used for computation.
Next, we consider the power of six test statistics: W m,n (1, 0), W m,n (0, 1), W m,n (0.5, 0.5), W m,n ( n m+n , m m+n ), M m,n , and a depth-based rank (DbR) statistic [14]. The estimated power is calculated based on the simulated upper quantile α = 0.05 for each statistic with Mahalanobis depth, spatial depth, and projection depth as the underlying depth function. The power study is conducted for the following three alternatives: (1) Two bivariate normal distributions with a scale change: One sample is taken from F = N(0, I 2×2 ) , and the other sample comes from G = N(0, I 2×2 + 0.5Ĩ 2×2 ), wherẽ I 2×2 = ((0, 1) , (1, 0) ). Figure 2 shows the power comparison for different sample sizes and various depth functions with 1,000 repetitions. Compared with the difference between balanced sample size and unbalanced sample size, the difference among three depth functions is very small. The maximum statistic M m,n and one particular weighted statistic W m,n (1, 0) have the largest power but W m,n (0, 1) has the smallest power. In this case, F is preferred to be the reference distribution. In practice, since it is not clear which reference distribution should be used to capture the disparity between the two samples more effectively, the maximal statistic is recommended. It is interesting to see that the DbR statistic is comparable to the weighted statistic with equal weights.
(2) Two bivariate normal distributions with a mean change: One sample is taken from F = N(0, I 2×2 ) and the other sample comes from G = N((0.35, 0.35) , I 2×2 ). Different from the first scenario, Figure 3 shows that the maximum statistic M m,n outperforms all other five statistics to detect a location shift between the two samples.
(3) Two bivariate normal distributions with both mean and scale changes: One sample is taken from F = N(0, I 2×2 ), and the other sample is drawn from G = N((0.3, 0.3) , I 2×2 + 0.4Ĩ 2×2 ). Figure 4 shows that M m,n and W m,n (0, 1) are comparable, and both outperform all other four test statistics. We also compare the Mahalanobis-based maximum test denoted as M m,n and its robust version obtained with MCD estimators denoted as M * m,n , MANOVA test, Cramér test, Energy test, and Wasserstein test for different m = n = 100, 200, . . . , 500 under two alternative hypotheses. For the first alternative hypothesis shown in the left panel of Figure 5, in the first sample, we draw 50% data from N(0, I 2×2 ) and 50% data from N ((1, 1) , I 2×2 ), and in the second sample, we draw 95% data from N(0.5 + 0, I 2×2 ) and 5% data from N(1 + 0, I 2×2 ). Mahalanobis-based tests outperform all other four test statistics. The MANOVA test has almost no power.   For the second alternative hypothesis shown in the right panel of Figure 5, in the first sample, we draw a multivariate t distribution with mean 0, scale matrix I 2×2 , and degrees of freedom of 2. In the second sample, we used a multivariate t distribution with mean 0, scale matrix I 2×2 + 0.6Ĩ 2×2 , and degrees of freedom of 3. The Mahalanobis-based tests still perform the best, while the Cramér test and the Energy test also perform the same for large sample sizes. The MANOVA has no power for non-normal data.

Multisample Comparison
As demonstrated in the last section, the maximal statistic M m,n is more powerful than the weighted statistic in a two-sample comparison, especially against location shift alternatives. In this section, we extend the maximum statistic M m,n for comparing multisample multivariate distributions. Consider k random samples that are drawn from distributions F (j) with sample sizes n j and empirical distribution F ( Similar to the proof of Theorem 1, for x > 0 we have where Z 1 , Z 2 , . . . , Z k are independent from N(0, 1), The asymptotic properties of the generalized maximum statistic for k-sample comparison are left for future research. Here, we conduct simulation studies to evaluate the power performance of the maximum statistic M m,n,k , and the DbR statistic. The estimated power is based on the simulated upper quantile α = 0.05 for each statistic with Mahalanobis depth, spatial depth, or projection depth as the underlying depth function. We consider the following two alternatives: (1) Three bivariate normal distributions: Two samples are taken from F (1) = F (2) = N(0, I 2×2 ), and the third sample comes from F (3) = N((0, 0) , I 2×2 + 0.5Ĩ 2×2 ). Figure 6 compare the power of two statistics M m,n,k and DbR statistic for m = 100, 200, . . . , 1000, three depth functions and two combinations n = k = m and n = 2k = m/2 for 1000 repetitions. We can see that the maximal statistics M m,n,k outperforms the DbR statistic. (2) Three distinguished bivariate normal distributions: F (1) = N(0, I 2×2 ), F (2) =  N((0.3, 0.3) , I 2×2 ), and F (3) = N((0, 0) , I 2×2 + 0.5Ĩ 2×2 ). Figure 7 shows that the maximal statistic M m,n,k outperforms the DbR statistic for all the cases considered.
In summary, our simulation studies show that the maximal statistic M m,n,k is more powerful than the DbR statistic under various alternatives in three-sample comparisons. The choice of depth function does not significantly change the performance of the tests.

Analysis of Two Real-Data Sets
In this section, we illustrate our proposed maximum statistic for three-sample comparison in the analysis of two real-data examples.

Beans Data
In industry, automatic methods are required for testing. The authors of [22] analyzed seven different kinds of beans: Barbunya, Bombay, Cali, Dermason, Horoz, Seker and Sira, where 12-dimensional and 4-shape features were obtained for comparison. It is interesting to see whether our proposed test can distinguish Dermason and Sira because the difference between them only lies in the end: round and flat. There are 3546 Dermason beans and 2636 Sira beans. Those two features, perimeter and major axis length, may describe the difference well and are included here for a two-sample comparison.
To visualize the dispersion of distributions of two different kinds of beans, we compare scale curves proposed by [23] and implemented in the R package DepthProc, which allows us to compare the scale of different distributions graphically. Let D α (F) be the α-trimmed region with respect to distribution F, that is, Similarly, we have sample versions of D(x; F m ) and D α (F m ). Let V(α; F m ) be the volume of convex region D α (F m ). Define a sample scale curve by taking the plot of 1 − α versus the volume V(α; F m ). The faster-growing scale curve is associated with a larger scale of distribution. Figure 8 presents the scale curves for the three species of iris under Mahalanobis depth. We can see that the scale curves of the two kinds of beans are slightly different, which requires confirmation through a formal statistical hypothesis test. In fact, the asymptotic p-value by (13) is also zero for each depth function. This is also true for MANOVA, Cramér, Energy, and Wasserstein tests.

Egyptian Skulls Data
The researchers have suggested that changes in skull size over time are evidence of interbreeding between a resident population and a migrant population. The R package HSAUR includes the male Egyptian skulls data with four measurements (maximal breadth of the skull, basibregmatic height of the skull, basialveolar length of the skull, and nasal height of the skull) from five different time periods: 4000 B.C., 3300 B.C., 1850 B.C., 200 B.C., and 150 A.D. There are 30 measurements for each time period. We wish to determine if there are any differences in the skull sizes in the last three time periods. Figure 9 presents the scale curves of skulls from the last three different time periods under Mahalanobis depth. Since the values of V(α; F m ) depend on the units of observations, it is reasonable to have distinct ranges of V(α; F m ) in iris and skull datasets. We can see that the scale curves of skulls from 1850 B.C. and 4000 B.C. are very close. Furthermore, we have estimated the p-values for maximum statistic M 30,30,30 and DbR statistic for each depth function under 10,000 repetitions. Table 1 shows all estimated p-values which are greater than 0.05. This confirms that there is no strong evidence to show the interbreeding between a resident population and a migrant population in those three time periods. We note that the asymptotic p-value by (13) is 0.027, 0.015, and 0.052 for Mahalanobis, Spatial, and Projection depth, respectively, which are much smaller than the corresponding estimated p-values.

Conclusions
In this paper, we propose two new test statistics for testing the homogeneity of two multivariate samples. Unlike other existing depth-based tests, our proposed test statistics were inspired by the quality index in the context of quality assurance. Constructed based on the pairwise quality indexes, our test statistics are shown to have the same χ 2 (1) asymptotic null distribution. Simulations studies demonstrate the superior performance of the proposed tests. The generalization of the proposed tests into the multivariate multisample situation is discussed as well, along with some simulation studies for three sample comparison. When the number of samples grows, the number of pairwise quality indexes increases. It would also be of interest, although challenging, to consider the higherorder approximation of the asymptotic distributions, as the performance of the proposed tests can be improved.