Random Permutations, Non-Decreasing Subsequences and Statistical Independence

: In this paper, we show how the longest non-decreasing subsequence, identiﬁed in the graph of the paired marginal ranks of the observations, allows the construction of a statistic for the development of an independence test in bivariate vectors. The test works in the case of discrete and continuous data. Since the present procedure does not require the continuity of the variables, it expands the proposal introduced in Independence tests for continuous random variables based on the longest increasing subsequence (2014) . We show the efﬁciency of the procedure in detecting dependence in real cases and through simulations.


Introduction
In this article, we use an expanded structure of the symmetric group S n , over the set of permutations from {1, . . . , n} to {1, . . . , n}, to develop a dependence detection procedure in bivariate random vectors. The procedure is based on identifying the longest non-decreasing subsequence (LNDSS) detected in the graph of the paired marginal ranks of the observations. It records the size of the subsequence and verifies the chances that it has to occur in the expanded space of S n , under the assumption of independence between the variables. The procedure does not require assumptions about the type of the two random variables being tested, such as being both discrete, both continuous or a mixed structures (discrete-continuous).
When we face the challenge of deciding whether the independence between random variables can be discarded, it is necessary to establish the nature of the variables, whether they are continuous or discrete. For continuous random variables, we have several procedures, for example, Hoeffding's test and those based on dependence's coefficients (Spearman's coefficient, Pearson's coefficient, Kendall's coefficient, etc.). Instead, for the discrete case, the options are few, the most popular is Pearson's Chi-squared test. Also, the tests based on Kendall and Spearman coefficients going through corrections that consider ties can be used to test for independence between two discrete and ordinal variables (see [1,2]). In general, recommended for small sample sizes. Moreover, some derivations of the Chi-squared statistic have been projected to test independence between two nominal variables, as is the case of the Cramér's V statistic, see [3].
The goal of this article is to show an independence test, developed from the notion of the LNDSS among the ranks of the observations, see [4]. The main notion was introduced previously in [5] with a different implementation from the one proposed in this paper. The alterations proposed in this paper aim to improve the procedure's performance. This methodology works without limitation on the type of the two random variables being tested, which can be continuous/discrete.

The Procedure
We start this section with the construction of the test's statistics. For that, we introduce the LNDSS notion. Definition 1. Given the set Q = {q 1 , . . . , q n } of cardinality n such that q i ∈ R, ∀i ∈ {1, . . . , n}, i.
the subsequence {q i 1 , . . . , q i k } of Q is a non-decreasing subsequence of Q if 1 ≤ i 1 < · · · < i k ≤ n and q i 1 ≤ q i 2 ≤ · · · ≤ q i k ; ii. the length of a subsequence verifying i. is k; iii. lnd n (Q) = max k {1 ≤ k ≤ n : {q i 1 , . . . , q i k } ∈ S n }, where S n is the set of subsequences of Q verifying i. lnd n (Q) ( Using the next Definition we adapt this notion to the context of random samples.

Definition 2.
Consider (X, Y) a random vector with joint cumulative distribution function H, let (X 1 , Y 1 ), (X 2 , Y 2 ), . . . , (X n , Y n ) be independent realizations of (X, Y), we denote by LND n the random variable built from iii. of Definition 1 as LND n = lnd n (Q D ) , If we consider S n = π permutations such that π : {1, . . . , n} → {1, . . . , n} , the subset Q D given by Definition 2 and without ties is a specific case of the finite set S n . Also, S n is an algebraic group if it is considered operating with the law of composition among the possible permutations. Given two permutations π 1 , π 2 the composition between them results when applying π 2 * π 1 from right to left, it means first applying π 1 and to its result applying π 2 , that composition also is a permutation. The law of composition is associative, with an identity element and with the existence of an inverse element for each member of S n . By Definition a symmetric group defined over any set is the group whose elements are all the bijections from the set to itself, then S n is the symmetric group of the set {1, . . . , n} since, it is composed by all the bijections from {1, . . . , n} to {1, . . . , n}. Since {1, . . . , n} is finite, the bijections are permutations.
Through the next example, we show the construction of the LNDSS in a set Q D related to fictional observations. Example 1. Table 1 shows an artificial data with n = 6 and already ordered in terms of the magnitude of x i values. We show the graphical construction of LND n ,  6) from the plot between the ranks of the observations, shown in Figure 1. The value of LND 6 for this example is 5. We note that the indicated trajectory refers to the correspondence of 1 → 1, 3 → 1, 3 → 3, 5 → 3, 6 → 6, which is no longer a permutation in the traditional sense since, it allows repetition both in the domain and in the image.

Remark 2.
Note that the construction of the statistic LND n is symmetric in the sense that if we exchange the roles of X and Y, we obtain the same result. Formally, this characteristic is a consequence of the following property. Consider a sample {(X i , Y i )} n i=1 and the increasing set of indexes {I 1 , . . . , I k } ⊆ {1, . . . , n} such that the trajectory (X I 1 , Y I 1 ) − (X I 2 , Y I 2 ) − · · · − (X I k , Y I k ) constitutes a non-decreasing subsequence (as illustrated by Example 1), this occurs if and only if X I i ≤ X I i+1 and Y I i ≤ Y I i+1 , 1 ≤ i ≤ k − 1, then the trajectory (Y I 1 , X I 1 ) − (Y I 2 , X I 2 ) − · · · − (Y I k , X I k ) constitutes a non-decreasing subsequence also.
The example shows that the procedure operates in an extended space of the symmetric group S n . Below we show a motivation to identify the dependence by trajectories such as those used by Definition 2 and exemplified in Figure 1. The dependence on a bivariate vector can be represented by the ranks of the observations; let's see a simple motivation.   Table 1.
We see on the left of Figure 2 an apparent relationship between the random variables, this illusion of relationship disappears in the graph on the right, since when computing the ranks of the observations, the marginal stochastic structure is neutralized, showing the dependence between X and Y. And, in this case, X and Y are independent, since they have been generated in this way. On the other hand, if the variables X and Y were dependent, Figure 2 on the right should expose a pattern, and traces of it would be captured by the LND n notion. The formulation of the conjectures of independence between the random variables is then given by H 0 : X and Y are independent (1) Here follows the test's statistic build from Definition 2. u,v) ) as given by Definition 2,and D (u,v) That is, we consider the notion given by the Definition 2 for each set D (u,v) , which include the entire sample except one, allowing to build Q D (u,v) . Then, we define LND n (u, v) and, the test statistic is the average between all the cases LND n (u, v). Next we introduce the most frequent formulation of estimation of the two-sided p-value in a context such as that given by the JLND n statistic.

Definition 4.
The estimator of the two sided p-value for the statistical test of independence between X and Y (see (1)) is defined by, where jlnd 0 is the value of JLND n calculated in the sample, see Definition 3.F JLND n is the empirical cumulative distribution function of JLND n , under independence, and I A is the indicator function of the set A.
In the following subsection, we analyze the performance of two proposals to estimate F JLND n , one introduced in [5] and the other proposed by this paper.

F JLND n Estimateŝ
F JLND n can be estimated by using bootstrap, for instance see [5]. Denote this kind of estimation asF B JLND n . The procedure to buidF B JLND n under H 0 hypothesis is replicated here. Let be B a positive and integer value, we compute B size n resamples with replacement of X 1 , X 2 , . . . , X n and Y 1 , Y 2 , . . . , Y n separately, since we assume that H 0 is true. That is, we generate X b 1 , X b 2 , . . . , X b n for b = 1, 2, . . . , B, resampling from X 1 , X 2 , . . . , X n , and, we generate and from that sample compute the notion JLND n , from Definition 3, say JLND b n . Then, if |A| denotes the cardinal of A, set In Table 2, we show the performance of the JLND n ' s test based on the computation of the p-value (Definition 4) according to the Bootstrap technique, given by Equation (2). We generated n independent pairs of discrete Uniform distributions from 1 to m, and we computed in 1000 simulations, the proportion of them showing a p-value (Definition 4) ≤ α, indicating the rejection of H 0 . Such a proportion is expected to be close to α, in order to control type 1 error. As we can see, when increasing the number of categories m, the α level is no longer respected, since the registered proportion always exceeds α. In order to improve the control of type 1 error, in this paper is proposed an alternative way to estimate F JLND n . The Bootstrap method described above and used in [5] can be modified in order to avoid the removal of any of the observations, following the strategy of swapping them. We consider X 1 , X 2 , . . . , X n and Y 1 , Y 2 , . . . , Y n separately, given B ∈ Z, for each b ∈ 1, . . . , B consider a permutation π b : {1, . . . , n} → {1, . . . , n} and define X π b (1) , . . . , and from that sample compute the notion JLND n , from Definition 3, Bootstrap generates the estimate by Equation (2), it considers samples with replacement, which tends to increase the number of ties. For example, if the original sample has no ties, the Bootstrap procedure tends to create ties, leading to longer non-decreasing subsequences. The permutation-based procedure that allows the formulation of Equation (3) lacks such a tendency, and this principle seems to be a more suitable strategy. Table 2. The proportion of p-value ≤ α computed from Definition 4 and Equation (2) In Table 3, we show the performance of the JLND n ' s test based on the computation of the p-value (Definition 4) according to Equation (3). We implement the same settings used in Table 2, also we include simulations for m = 2, 3, 4, 5. The impact of Equation (3) allows better control of the type 1 error, we see that in most cases the proportion does not exceed α and when it does it remains close to α.
Returning to the construction of the hypothesis test (Equation (1)), we note that the hypothesis H 0 is used in the construction of both types of estimates of the cumulative distribution, Equations (2) and (3).
by one side and {Y i } n i=1 for other side. Then, the distribution of the length of the LNDSS, under H 0 , is estimated by both procedures, which allows computing the evidence against H 0 given by the observed value in the originally paired {(X i , Y i )} n i=1 sample and applying Definition 4. Moreover, the type 1 error control refers to the ability of a procedure to reject H 0 under its validity. In other words, it represents an unwanted situation, which we must control. In the study presented by Tables 2 and 3, based on the two ways of estimating the cumulative distribution of JLND n (under H 0 ) by Equations (2) and (3) respectively, we see that fixed a level α, Equation (3) offers better performance than Equation (2) since it maintains type 1 error at pre-established levels. For this reason, the test based on the statistic JLND n with the implementation given by Equation (3) is more advisable in practice.
The following section describes the behavior of the test in different simulated situations, in order to identify its strengths and weaknesses. Table 3. The proportion of p-value ≤ α computed from Definition 4 and Equation (3)

Simulations
To investigate the performance of the JLND n -based procedure, we will aim to determine the rejection ability of the procedure in scenarios with dependence. Our research focuses on the procedure that uses the Equation (3) to compute the p-value, given the justification of Section 2.1. We begin our study considering discrete distributions that we describe below and some mixtures or disturbances of them.
We take discrete uniform distributions on different regions, consider m, b and a fixed values such that m, b, a ∈ Z >0 , and set      As expected, for distribution D1 the procedure JLND n shows maximum performance, for all sample sizes and variants of m and a. For distribution D2, the performance of the procedure JLND n improves and reaches maximum performance as the sample size increases, for all variants of m and a. For distribution D3, we noticed a deterioration in the performance of the test when compared to the other two cases D1 and D2, despite this, the procedure responds adequately to the sample size, increasing its ability to detect dependence with increasing sample size.
Mi is a distribution that results from disturbing Di, so it makes sense to compare the effect of the disturbance, which in the illustrated cases is 20% from U(m). For the distribution M1 the JLND n -based procedure shows optimal performance, as occurs in the case D1. In cases M2 and M3, there is a deterioration in the performance of the procedure JLND n when compared to D2 and D3, respectively. Despite this, within the framework given by M2, we see that the good properties of the procedure are preserved when the sample size is increased.
In the following simulations, we investigate the dependence between discrete and continuous variables. The types explored are denoted by D4 and D5, Figure 4 illustrates the cases.  Note that when using p = 1 in ix (or x) we recorver D4 (or D5). Tables 7 and 8 show the performance of 1000 simulations of size n, from M4(m, 0.5) in Table 7, M5(m, 0.5) in Table 8. To the left of each Table (with p = 1), are simulated cases similar to the illustrated in Figure 4, D4 and D5. Table 7 shows that in the case of distribution D4, the procedure is very efficient and, we see that when the distribution is disturbed (by including 20% from W, to the right of Table 7) the procedure maintains its efficiency in detecting dependence. In relation to the distribution D5, we see from Table 8 that two effects occur, the one produced by the sample size n and the one produced by the value of m. By increasing n and m the procedure gains power quickly. The same effect is observed in the M5 distribution (D5 disturbance), with a certain deterioration in the power of the test.       The JLND n statistic is built in the graph of the paired ranks of the observations, and it is given by the size of the LNDSS found in this graph (see Figure 1). The proposal induces a region where this statistic can found evidence of dependence, in the diagonal of the graph. The simulation study points that the detection power of the procedure occurs in situations with an increasing pattern in the direction in which the JLND n statistic is built. Even more, the concomitant presence of increasing patterns and decreasing patterns does not necessarily nullify the detection capacity of the procedure, since the statistic JLND n is formulated considering the expanded S n space provided with the uniform distribution. See Tables 4-8 in which we observe that by increasing the sample size, the detection capacity of JLND n is preserved. Also, looking at the right side of the tables already cited, we verify the robustness of the procedure, when inspecting cases with a concentration of points in the diagonals and suffering contamination, if the sample size grows.
In the next section, we apply the test to real data and compare our results with other procedures.

Applying the Test in Real Data
As it has already been commented, in some data sets, we have ties, produced by the precision used in data collection. This is the case of the wine data set (from the glus R-package), composed of 178 observations. For example, consider the cases (i) Alcohol vs. Flavonoids (see Figure 5, left) and (ii) Flavanoids vs. Intensity (see Figure 5, right). For each case (i) and (ii) both variables are continuous but recorded with a precision of two decimal places. We use known procedures in the area of continuous variables. For all the computations is used the R-project software environment. The "hoeffd" function in the "Hmisc" package is used to compute the p-value in the case of Hoeffding's test. The "cor.test" function in the "stat" package is used to compute the p-value for Pearson, Spearman and Kendall tests, see also [8]. Finally, we use the "indepTest" function, from the "copula" package to compute the "Copula" test. In case (i) of Figure 5 (left) all the procedures report p-value less than 0.02. Using JLND n (jlnd 0 = 31.843) we obtain p-value = 0.0160 and p-value = 0.0004, applying Equations (2) and (3), respectively. That is, JLND n -based procedures detect dependence without the possible contraindications that the other procedures have, since we see ties in the dataset.
From the appearance of the scatter plot ( Figure 5, right), it is understandable that the tests based on the Spearman and Kendall coefficients show difficulties in recording dependence, see Table 9. We also see that the other procedures capture the signs of dependence as well as the one proposed in this paper (jlnd 0 = 29.904). In both situations (cases (i) and (ii)) the only procedure, without contraindication, with significant p-value to reject H 0 is JLND n . We inspect also the dependence between the variables Duration: duration of the eruption and Interval: time until following eruption, both measures in minutes, corresponding to 222 eruptions of the Old Faithful Geyser during August 1978 and August 1979. The data is coming from [9] and it is a traditional data set used in regression analysis with the aim of predicting the time of the next eruption using the duration of the most recent eruption (see [10]). Figure 6 clearly shows the high number of ties, which compromises procedures designed for continuous variables. We have run the JLND n test (jlnd 0 = 63.797), using various values of B, B = 1000, 2000, 5000, 10,000. In all cases the p-value is less than 0.00001 and using both versions to estimate the cumulative distribution, Equations (2) and (3). Then the hypothesis of independence between Duration and Interval is rejected.  Figure 6. Duration vs. Interval of geyser data set [9].
The data set, cdrate is composed by 69 observations given in the 23 August 1989, issue of Newsday, it consists of the three-month certificate of deposit rates Return on CD for 69 Long Island banks and thrifts. The variables are Return on CD and Type = 0 (bank), 1 (thrift), source: [9]. Table 10 shows the data arranged based on the values of the attribute Return on CD and divided into the two cases of the variable Type. That table shows sparseness, an issue reported in the literature, that compromises the performance of tests Pearson's Chi-squared based (Table 11), see [1].  Table 11. Independence tests between Return on CD and Type of cdrate data set from [9]. In Equations (2) and (3), B = 5000. In Table 11 we see the results for testing H 0 . We see that according to JLND n 's test we must reject H 0 , which seems to be confirmed by Figure 7. Figure 7 comparatively shows the performance of variable Return on CD, for the two values of variable Type. We conclude this section with a case of the wine data set, Class vs. Alcohol. Figure 8 shows the relationship in which we wish to verify whether independence can be rejected. Class registers 3 possible values and Alcohol has been registered with low precision, which leads to observing ties. The observed value of JLND n is jlnd 0 = 99.438. The p-value given by the Equations (2) and (3) indicate the rejection of H 0 . By the Equation (2) we obtain a p-value = 0.0004 and by the Equation (3) we obtain a p-value < 0.00001. We note that in the cases of Figure 5 we have verified that the test JLND n through Equation (3) offers lower p-value than the version given by Equation (2). In the cases of Figures 6 and 8, it may simply be an effect of computational precision. For the other cases, it is necessary to take into account that the Bootstrap version, by tending to create more ties, shows a tendency to underestimate the cumulative distribution, in other words,F B JLND n (q) ≤ F JLND n (q) where F JLND n (·) is the true cumulative distribution. Due to the increasing tendency shown by the cases addressed (see Figure 5), it is expected that the observed value of the statistic JLND n , jlnd 0 , in each case is positioned in the upper tail of the distribution, which leads to the p-value be given by 2(1 −F B JLND n (jlnd 0 )), see Definition 4. As a consequence 2(1 −F B JLND n (jlnd 0 )) > 2(1 − F JLND n (jlnd 0 )). With the proposal made through Equation (3), we seek to correct the underestimation, since it does not favor the proliferation of ties. Which would explain the relationship between the p-value.

Concluding Remarks
In this article, we investigate the performance of the JLND n statistic to identify dependence on bivariate random vectors from a paired sample of size n. The procedure requires identifying the LNDSS that can be found on the graph between the marginal ranks of the paired observations, see Definitions 1 and 2. The goal is to compare the length of such subsequence (Definition 3) with the length of all possible subsequences, under the assumption of independence. This means, imposing an uniform distribution on the expanded S n space. For the formulation of the procedure, it is required to estimate the distribution of the statistic JLND n , under the assumption of independence and, in this paper it is given by Equation (3) (see also Definition 4). The estimation proposed in this paper shows an improved performance compared with the one given in [5], see Section 2.1. The concept, longest non-decreasing subsequence, allows us to build a tool without restrictions over the type of variable, continuous or discrete in which it can be applied.
From the simulation study we confirm that the detection power of the procedure occurs in situations with an increasing pattern from left to right and from bottom to top, which is the direction in which the JLND n statistic is sought (see Figure 1). The observations can be associated with continuous or discrete variables, not affecting the power of the test. The concomitant presence of increasing patterns and decreasing patterns does not necessarily nullify the detection capacity of the procedure if the size of the samples is big enough. We also verify the robustness of the procedure when inspecting cases that suffer contamination that could conceal the dependence. See Tables 4-8, we use different real data sets that expose the versatility of the procedure to reject independence in situations such as (a) in the presence of ties, (b) in the presence of sparseness, (c) in mixed situations.