Next Article in Journal
New Identities Dealing with Gauss Sums
Next Article in Special Issue
Advanced Approach for Estimating Failure Rate Using Saddlepoint Approximation
Previous Article in Journal
Role of Ergonomic Factors Affecting Production of Leather Garment-Based SMEs of India: Implications for Social Sustainability
Previous Article in Special Issue
On Taylor Series Expansion for Statistical Moments of Functions of Correlated Random Variables
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Random Permutations, Non-Decreasing Subsequences and Statistical Independence

by
Jesús E. García
and
Verónica A. González-López
*
Department of Statistics, University of Campinas, Sérgio Buarque de Holanda, 651, Campinas 13083-859, São Paulo, Brazil
*
Author to whom correspondence should be addressed.
Symmetry 2020, 12(9), 1415; https://doi.org/10.3390/sym12091415
Submission received: 1 August 2020 / Revised: 14 August 2020 / Accepted: 21 August 2020 / Published: 26 August 2020

Abstract

:
In this paper, we show how the longest non-decreasing subsequence, identified in the graph of the paired marginal ranks of the observations, allows the construction of a statistic for the development of an independence test in bivariate vectors. The test works in the case of discrete and continuous data. Since the present procedure does not require the continuity of the variables, it expands the proposal introduced in Independence tests for continuous random variables based on the longest increasing subsequence (2014). We show the efficiency of the procedure in detecting dependence in real cases and through simulations.

1. Introduction

In this article, we use an expanded structure of the symmetric group S n , over the set of permutations from { 1 , , n } to { 1 , , n } , to develop a dependence detection procedure in bivariate random vectors. The procedure is based on identifying the longest non-decreasing subsequence (LNDSS) detected in the graph of the paired marginal ranks of the observations. It records the size of the subsequence and verifies the chances that it has to occur in the expanded space of S n , under the assumption of independence between the variables. The procedure does not require assumptions about the type of the two random variables being tested, such as being both discrete, both continuous or a mixed structures (discrete-continuous).
When we face the challenge of deciding whether the independence between random variables can be discarded, it is necessary to establish the nature of the variables, whether they are continuous or discrete. For continuous random variables, we have several procedures, for example, Hoeffding’s test and those based on dependence’s coefficients (Spearman’s coefficient, Pearson’s coefficient, Kendall’s coefficient, etc.). Instead, for the discrete case, the options are few, the most popular is Pearson’s Chi-squared test. Also, the tests based on Kendall and Spearman coefficients going through corrections that consider ties can be used to test for independence between two discrete and ordinal variables (see [1,2]). In general, recommended for small sample sizes. Moreover, some derivations of the Chi-squared statistic have been projected to test independence between two nominal variables, as is the case of the Cramér’s V statistic, see [3].
The goal of this article is to show an independence test, developed from the notion of the LNDSS among the ranks of the observations, see [4]. The main notion was introduced previously in [5] with a different implementation from the one proposed in this paper. The alterations proposed in this paper aim to improve the procedure’s performance. This methodology works without limitation on the type of the two random variables being tested, which can be continuous/discrete.
The existence of ties in a dataset cast doubts about the use of matured tests for continuous variables, see, for instance, [6] for a discussion on this issue. The use of procedures preconized for continuous random variables, in cases with repetitions in the observations due to the precision used to record the data, may have unforeseen consequences on the performance of the procedure. If the ties are eliminated, the use of asymptotic distributions can be compromised, if the ties are considered (by means of some correction), the control of type 1 and 2 errors can be put at risk (increasing the false positives/negatives of the procedure). Another frequent situation is when one of the variables is continuous, and the other is discrete. For some test of independence, problems may arise from this situation forcing the practitioner to apply some arbitrary data categorization. Under this picture, one of the most popular procedures is the Pearson’s Chi-squared statistic. The traditional tests are based on some of the following statistics Pearson’s Chi-squared, likelihood ratio [1], and Zelterman’s [7] for the case in which the number of categories is too large for the available sample size. Moreover, Zelterman’s [7] do not work well when one of the variables is continuous. [1] shows several examples of independent data, where Pearson’s Chi-squared, likelihood ratio, and Zelterman statistics fail. In [1] is shown that to be reliable, those tests require that each cell in the frequency table should have a minimal (non zero) frequency, which can depend on the total size of the data set. It is shown in [7], that in some situations, with a large number of factors, Pearson’s Chi-squared statistic will behave as a normal random variable with summaries as variance and mean that are unassociated to the Chi-squared distribution, even with large sample size. Those situations are similar to the case of continuous random variables registered with limited precision, which in fact, is similar to a discrete random variable with a large number of categories producing sparseness (or sparse tables).
This article is organized as follows. Section 2 introduces the formulation of the test showing the new strategy, in comparison with the implemented in [5]. Section 3 simulates different situations showing the performance of the procedure. The purpose is to show situations in which the statistic proposed in this paper is efficient in detecting dependence. We consider in the simulations settings concentrating points in the diagonals, the variables being continuous or discrete. We also consider perturbations of such situations, which will show the maintenance or loss of power of the test developed here. Section 4 applies the new procedure to real data, and Section 5 presents the final considerations.

2. The Procedure

We start this section with the construction of the test’s statistics. For that, we introduce the LNDSS notion.
Definition 1.
Given the set Q = { q 1 , , q n } of cardinality n such that q i R , i { 1 , , n } ,
i.
the subsequence { q i 1 , , q i k } of Q is a non-decreasing subsequence of Q if 1 i 1 < < i k n and q i 1 q i 2 q i k ;
ii.
the length of a subsequence verifying i . is k ;
iii.
l n d n ( Q ) = max k { 1 k n : { q i 1 , , q i k } S n } , where S n is the set of subsequences of Q verifying i .
l n d n ( Q ) (item iii., Definition 1) is the length of the LNDSS of Q . Here, consider two illustrations of Definition 1. Suppose that Q = { 1.3 , 0.2 , 2 , 2.1 , 1.2 } . Then the LNDSS are { 1.3 , 2 , 2.1 } and { 0.2 , 2 , 2.1 } , then l n d 5 ( Q ) = 3 . Consider now a collection Q with replications Q = { 1.5 , 2.4 , 1.1 , 2.4 , 3 , 3.1 } , so the LNDSS is { 1.5 , 2.4 , 2.4 , 3 , 3.1 } and l n d 6 ( Q ) = 5 .
Using the next Definition we adapt this notion to the context of random samples.
Definition 2.
Consider ( X , Y ) a random vector with joint cumulative distribution function H , let ( X 1 , Y 1 ) , ( X 2 , Y 2 ) , , ( X n , Y n ) be independent realizations of ( X , Y ) , we denote by L N D n the random variable built from iii. of Definition 1 as
L N D n = l n d n Q D ,
where D = ( X i , Y i ) i = 1 n and Q D = { q r a n k ( X i ) = r a n k ( Y i ) , i = 1 , , n } .
Remark 1.
i.
Note that without the presence of ties, the set Q D is a particular case of all the permutations of the values in the set { 1 , , n } .
ii.
With ties, there is more than one way of defining ranks. We apply the minimum rank notion. For example, the sample 6.1 , 2.1 , 5.3 , 4.7 , 5.5 , 6.2 , 5.3 , 4.7 has ranks 7 , 1 , 4 , 2 , 6 , 8 , 4 , 2 .
If we consider S n = π permutations such that π : { 1 , , n } { 1 , , n } , the subset Q D given by Definition 2 and without ties is a specific case of the finite set S n . Also, S n is an algebraic group if it is considered operating with the law of composition among the possible permutations. Given two permutations π 1 , π 2 the composition between them results when applying π 2 π 1 from right to left, it means first applying π 1 and to its result applying π 2 , that composition also is a permutation. The law of composition is associative, with an identity element and with the existence of an inverse element for each member of S n . By Definition a symmetric group defined over any set is the group whose elements are all the bijections from the set to itself, then S n is the symmetric group of the set { 1 , , n } since, it is composed by all the bijections from { 1 , , n } to { 1 , , n } . Since { 1 , , n } is finite, the bijections are permutations.
Through the next example, we show the construction of the LNDSS in a set Q D related to fictional observations.
Example 1.
Table 1 shows an artificial data with n = 6 and already ordered in terms of the magnitude of x i values. We show the graphical construction of L N D n ,
This data defines a Q D = { 5 , 1 , 1 , 3 , 3 , 6 } . The maximal non-decreasing subsequence is { 1 , 1 , 3 , 3 , 6 } given by the trajectory ( 0 , 0 ) ( 1 , 1 ) ( 3 , 1 ) ( 3 , 3 ) ( 5 , 3 ) ( 6 , 6 ) from the plot between the ranks of the observations, shown in Figure 1. The value of L N D 6 for this example is 5. We note that the indicated trajectory refers to the correspondence of 1 1 , 3 1 , 3 3 , 5 3 , 6 6 , which is no longer a permutation in the traditional sense since, it allows repetition both in the domain and in the image.
Remark 2.
Note that the construction of the statistic L N D n is symmetric in the sense that if we exchange the roles of X and Y , we obtain the same result. Formally, this characteristic is a consequence of the following property. Consider a sample { ( X i , Y i ) } i = 1 n and the increasing set of indexes { I 1 , , I k } { 1 , , n } such that the trajectory ( X I 1 , Y I 1 ) ( X I 2 , Y I 2 ) ( X I k , Y I k ) constitutes a non-decreasing subsequence (as illustrated by Example 1), this occurs if and only if X I i X I i + 1 and Y I i Y I i + 1 , 1 i k 1 , then the trajectory ( Y I 1 , X I 1 ) ( Y I 2 , X I 2 ) ( Y I k , X I k ) constitutes a non-decreasing subsequence also.
The example shows that the procedure operates in an extended space of the symmetric group S n . Below we show a motivation to identify the dependence by trajectories such as those used by Definition 2 and exemplified in Figure 1. The dependence on a bivariate vector can be represented by the ranks of the observations; let’s see a simple motivation.
We see on the left of Figure 2 an apparent relationship between the random variables, this illusion of relationship disappears in the graph on the right, since when computing the ranks of the observations, the marginal stochastic structure is neutralized, showing the dependence between X and Y . And, in this case, X and Y are independent, since they have been generated in this way. On the other hand, if the variables X and Y were dependent, Figure 2 on the right should expose a pattern, and traces of it would be captured by the L N D n notion.
The formulation of the conjectures of independence between the random variables is then given by
H 0 : X and Y are independent H 1 : X and Y are dependent .
Here follows the test’s statistic build from Definition 2.
Definition 3.
Let D = ( X i , Y i ) i = 1 n be replications of ( X , Y ) . Define J L N D n = 1 n ( u , v ) D L N D n ( u , v ) , where L N D n ( u , v ) = l n d n 1 ( Q D ( u , v ) ) as given by Definition 2, and D ( u , v ) = D ( u , v ) , with ( u , v ) D .
That is, we consider the notion given by the Definition 2 for each set D ( u , v ) , which include the entire sample except one, allowing to build Q D ( u , v ) . Then, we define L N D n ( u , v ) and, the test statistic is the average between all the cases L N D n ( u , v ) . Next we introduce the most frequent formulation of estimation of the two-sided p-value in a context such as that given by the J L N D n statistic.
Definition 4.
The estimator of the two sided p-value for the statistical test of independence between X and Y (see (1)) is defined by,
min 2 F ^ J L N D n ( j l n d 0 ) I F ^ J L N D n ( j l n d 0 ) 1 2 + 2 ( 1 F ^ J L N D n ( j l n d 0 ) ) I F ^ J L N D n ( j l n d 0 ) > 1 2 , 1 ,
where j l n d 0 is the value of J L N D n calculated in the sample, see Definition 3. F ^ J L N D n is the empirical cumulative distribution function of J L N D n , under independence, and I A is the indicator function of the set A .
In the following subsection, we analyze the performance of two proposals to estimate F J L N D n , one introduced in [5] and the other proposed by this paper.

2.1. F J L N D n Estimates

F ^ J L N D n can be estimated by using bootstrap, for instance see [5]. Denote this kind of estimation as F ^ J L N D n B . The procedure to buid F ^ J L N D n B under H 0 hypothesis is replicated here. Let be B a positive and integer value, we compute B size n resamples with replacement of X 1 , X 2 , , X n and Y 1 , Y 2 , , Y n separately, since we assume that H 0 is true. That is, we generate X 1 b , X 2 b , , X n b for b = 1 , 2 , , B , resampling from X 1 , X 2 , , X n , and, we generate Y 1 b , Y 2 b , , Y n b for b = 1 , 2 , , B , resampling from Y 1 , Y 2 , , Y n . Then, for each b define D b = ( X i b , Y i b ) i = 1 n and from that sample compute the notion J L N D n , from Definition 3, say J L N D n b . Then, if | A | denotes the cardinal of A , set
F ^ J L N D n B ( q ) = | { b : J L N D n b q } | B .
In Table 2, we show the performance of the J L N D n ’ s test based on the computation of the p-value (Definition 4) according to the Bootstrap technique, given by Equation (2). We generated n independent pairs of discrete Uniform distributions from 1 to m , and we computed in 1000 simulations, the proportion of them showing a p-value (Definition 4) α , indicating the rejection of H 0 . Such a proportion is expected to be close to α , in order to control type 1 error. As we can see, when increasing the number of categories m , the α level is no longer respected, since the registered proportion always exceeds α . In order to improve the control of type 1 error, in this paper is proposed an alternative way to estimate F J L N D n . The Bootstrap method described above and used in [5] can be modified in order to avoid the removal of any of the observations, following the strategy of swapping them. We consider X 1 , X 2 , , X n and Y 1 , Y 2 , , Y n separately, given B Z , for each b 1 , , B consider a permutation π b : { 1 , , n } { 1 , , n } and define X π b ( 1 ) , , X π b ( n ) . Similarly, consider a permutation σ b : { 1 , , n } { 1 , , n } and define Y σ b ( 1 ) , , Y σ b ( n ) . Then, for each b define D π b , σ b = ( X π b ( i ) , Y σ b ( i ) ) i = 1 n and from that sample compute the notion J L N D n , from Definition 3, say J L N D n π b σ b . Then, set
F ^ J L N D n B , π , σ ( q ) = | { b : J L N D n π b σ b q } | B .
Bootstrap generates the estimate by Equation (2), it considers samples with replacement, which tends to increase the number of ties. For example, if the original sample has no ties, the Bootstrap procedure tends to create ties, leading to longer non-decreasing subsequences. The permutation-based procedure that allows the formulation of Equation (3) lacks such a tendency, and this principle seems to be a more suitable strategy.
In Table 3, we show the performance of the J L N D n ’ s test based on the computation of the p-value (Definition 4) according to Equation (3). We implement the same settings used in Table 2, also we include simulations for m = 2 , 3 , 4 , 5 . The impact of Equation (3) allows better control of the type 1 error, we see that in most cases the proportion does not exceed α and when it does it remains close to α .
Returning to the construction of the hypothesis test (Equation (1)), we note that the hypothesis H 0 is used in the construction of both types of estimates of the cumulative distribution, Equations (2) and (3). For both cases, the observed values { ( X i , X i ) } i = 1 n are treated separately, as being independent { X i } i = 1 n by one side and { Y i } i = 1 n for other side. Then, the distribution of the length of the LNDSS, under H 0 , is estimated by both procedures, which allows computing the evidence against H 0 given by the observed value in the originally paired { ( X i , Y i ) } i = 1 n sample and applying Definition 4. Moreover, the type 1 error control refers to the ability of a procedure to reject H 0 under its validity. In other words, it represents an unwanted situation, which we must control. In the study presented by Table 2 and Table 3, based on the two ways of estimating the cumulative distribution of J L N D n (under H 0 ) by Equations (2) and (3) respectively, we see that fixed a level α , Equation (3) offers better performance than Equation (2) since it maintains type 1 error at pre-established levels. For this reason, the test based on the statistic J L N D n with the implementation given by Equation (3) is more advisable in practice.
The following section describes the behavior of the test in different simulated situations, in order to identify its strengths and weaknesses.

3. Simulations

To investigate the performance of the J L N D n -based procedure, we will aim to determine the rejection ability of the procedure in scenarios with dependence. Our research focuses on the procedure that uses the Equation (3) to compute the p-value, given the justification of Section 2.1. We begin our study considering discrete distributions that we describe below and some mixtures or disturbances of them.
We take discrete uniform distributions on different regions, consider m , b and a fixed values such that m , b , a Z > 0 , and set
i.
D 1 ( m , a ) : Uniform on A = ( x , y ) { 1 , , m } 2 : | x y | a ;
ii.
D 2 ( m , a ) : Uniform on A = ( x , y ) { 1 , , m } 2 : | x y | a o r | x + y m 1 | a ;
iii.
D 3 ( m , a , b ) : Uniform on A = { ( x , y ) { 1 , , m } 2 : | x y | a o r | x y + b | a o r | x y b | a or   | x y 2 b | a   or   | x y + 2 b | a } .
The performance of the distributions i.- iii. is illustrated in Figure 3, using m = 20 , a = 1 and b = 6 . Denote by U ( m ) the Uniform distribution on A = ( x , y ) { 1 , , m } 2 , given p [ 0 , 1 ] consider now the next three mixture of distributions
iv.
M 1 ( m , a ) : p D 1 ( m , a ) + ( 1 p ) U ( m ) ;
v.
M 2 ( m , a ) : p D 2 ( m , a ) + ( 1 p ) U ( m ) ;
vi.
M 3 ( m , a , b ) : p D 3 ( m , a , b ) + ( 1 p ) U ( m ) .
where the notation p D 1 ( m , a ) + ( 1 p ) U ( m ) represents that the bivariate vector is given in proportion p by the distribution D 1 ( m , a ) and in proportion ( 1 p ) by the distribution U ( m ) . Note that if p = 1 we recover the distributions D i , i = 1 , 2 , 3 . From Table 4, Table 5 and Table 6, settings have projected that increase the number of categories of the discrete Uniform distribution U ( m ) , and also increase the parameters a and b .
Table 4, Table 5 and Table 6 show the rejection rates, obtained through 1000 simulations of samples of size n . We inspect the distributions D i , i = 1 , 2 , 3 and the mixtures M i , i = 1 , 2 , 3 with p = 0.8 . What is sought is to obtain high proportions evidencing the control of type 2 error.
As expected, for distribution D 1 the procedure J L N D n shows maximum performance, for all sample sizes and variants of m and a . For distribution D 2 , the performance of the procedure J L N D n improves and reaches maximum performance as the sample size increases, for all variants of m and a . For distribution D 3 , we noticed a deterioration in the performance of the test when compared to the other two cases D 1 and D 2 , despite this, the procedure responds adequately to the sample size, increasing its ability to detect dependence with increasing sample size.
M i is a distribution that results from disturbing D i , so it makes sense to compare the effect of the disturbance, which in the illustrated cases is 20% from U ( m ) . For the distribution M 1 the J L N D n -based procedure shows optimal performance, as occurs in the case D 1 . In cases M 2 and M 3 , there is a deterioration in the performance of the procedure J L N D n when compared to D 2 and D 3 , respectively. Despite this, within the framework given by M 2 , we see that the good properties of the procedure are preserved when the sample size is increased.
In the following simulations, we investigate the dependence between discrete and continuous variables. The types explored are denoted by D 4 and D 5 , Figure 4 illustrates the cases. Consider m Z > 0 , a R > 0 and set
vii.
D 4 ( m , a ) : Uniform distribution on A = ( x , y ) { 1 , , m } × [ 0 , m + 1 ] : | x y | a ;
viii.
D 5 ( m , a ) : Uniform on A = ( x , y ) { 1 , , m } × [ 0 , m + 1 ] : | x y | a o r | x + y m 1 | a .
Denote by W ( m ) the Uniform distribution on A = ( x , y ) { 1 , , m } × [ 0 , m + 1 ] given p [ 0 , 1 ] consider now the next two mixture of distributions
ix.
M 4 ( m , a ) : p D 4 ( m , a ) + ( 1 p ) W ( m ) ;
x.
M 5 ( m , a ) : p D 5 ( m , a ) + ( 1 p ) W ( m ) .
Note that when using p = 1 in ix (or x) we recorver D 4 (or D 5 ).
Table 7 and Table 8 show the performance of 1000 simulations of size n , from M 4 ( m , 0.5 ) in Table 7, M 5 ( m , 0.5 ) in Table 8. To the left of each Table (with p = 1 ), are simulated cases similar to the illustrated in Figure 4, D 4 and D 5 . Table 7 shows that in the case of distribution D 4 , the procedure is very efficient and, we see that when the distribution is disturbed (by including 20% from W , to the right of Table 7) the procedure maintains its efficiency in detecting dependence. In relation to the distribution D 5 , we see from Table 8 that two effects occur, the one produced by the sample size n and the one produced by the value of m . By increasing n and m the procedure gains power quickly. The same effect is observed in the M 5 distribution ( D 5 disturbance), with a certain deterioration in the power of the test.
The J L N D n statistic is built in the graph of the paired ranks of the observations, and it is given by the size of the LNDSS found in this graph (see Figure 1). The proposal induces a region where this statistic can found evidence of dependence, in the diagonal of the graph. The simulation study points that the detection power of the procedure occurs in situations with an increasing pattern in the direction in which the J L N D n statistic is built. Even more, the concomitant presence of increasing patterns and decreasing patterns does not necessarily nullify the detection capacity of the procedure, since the statistic J L N D n is formulated considering the expanded S n space provided with the uniform distribution. See Table 4, Table 5, Table 6, Table 7 and Table 8 in which we observe that by increasing the sample size, the detection capacity of J L N D n is preserved. Also, looking at the right side of the tables already cited, we verify the robustness of the procedure, when inspecting cases with a concentration of points in the diagonals and suffering contamination, if the sample size grows.
In the next section, we apply the test to real data and compare our results with other procedures.

4. Applying the Test in Real Data

As it has already been commented, in some data sets, we have ties, produced by the precision used in data collection. This is the case of the wine data set (from the glus R-package), composed of 178 observations. For example, consider the cases (i) Alcohol vs. Flavonoids (see Figure 5, left) and (ii) Flavanoids vs. Intensity (see Figure 5, right). For each case (i) and (ii) both variables are continuous but recorded with a precision of two decimal places. We use known procedures in the area of continuous variables. For all the computations is used the R-project software environment. The “hoeffd” function in the “Hmisc” package is used to compute the p-value in the case of Hoeffding’s test. The “cor.test” function in the “stat” package is used to compute the p-value for Pearson, Spearman and Kendall tests, see also [8]. Finally, we use the “indepTest” function, from the “copula” package to compute the “Copula“ test.
In case (i) of Figure 5 (left) all the procedures report p-value less than 0.02. Using J L N D n ( j l n d 0 = 31.843 ) we obtain p-value = 0.0160 and p-value = 0.0004, applying Equations (2) and (3), respectively. That is, J L N D n -based procedures detect dependence without the possible contraindications that the other procedures have, since we see ties in the dataset.
From the appearance of the scatter plot (Figure 5, right), it is understandable that the tests based on the Spearman and Kendall coefficients show difficulties in recording dependence, see Table 9. We also see that the other procedures capture the signs of dependence as well as the one proposed in this paper ( j l n d 0 = 29.904 ). In both situations (cases (i) and (ii)) the only procedure, without contraindication, with significant p-value to reject H 0 is J L N D n .
We inspect also the dependence between the variables Duration: duration of the eruption and Interval: time until following eruption, both measures in minutes, corresponding to 222 eruptions of the Old Faithful Geyser during August 1978 and August 1979. The data is coming from [9] and it is a traditional data set used in regression analysis with the aim of predicting the time of the next eruption using the duration of the most recent eruption (see [10]).
Figure 6 clearly shows the high number of ties, which compromises procedures designed for continuous variables. We have run the J L N D n test ( j l n d 0 = 63.797 ), using various values of B, B = 1000, 2000, 5000, 10,000. In all cases the p-value is less than 0.00001 and using both versions to estimate the cumulative distribution, Equations (2) and (3). Then the hypothesis of independence between Duration and Interval is rejected.
The data set, cdrate is composed by 69 observations given in the 23 August 1989, issue of Newsday, it consists of the three-month certificate of deposit rates Return on CD for 69 Long Island banks and thrifts. The variables are Return on CD and Type = 0 (bank), 1 (thrift), source: [9]. Table 10 shows the data arranged based on the values of the attribute Return on CD and divided into the two cases of the variable Type. That table shows sparseness, an issue reported in the literature, that compromises the performance of tests Pearson’s Chi-squared based (Table 11), see [1].
In Table 11 we see the results for testing H 0 . We see that according to J L N D n ’s test we must reject H 0 , which seems to be confirmed by Figure 7. Figure 7 comparatively shows the performance of variable Return on CD, for the two values of variable Type.
We conclude this section with a case of the wine data set, Class vs. Alcohol. Figure 8 shows the relationship in which we wish to verify whether independence can be rejected. Class registers 3 possible values and Alcohol has been registered with low precision, which leads to observing ties. The observed value of J L N D n is j l n d 0 = 99.438 . The p-value given by the Equations (2) and (3) indicate the rejection of H 0 . By the Equation (2) we obtain a p-value = 0.0004 and by the Equation (3) we obtain a p-value < 0.00001.
We note that in the cases of Figure 5 we have verified that the test J L N D n through Equation (3) offers lower p-value than the version given by Equation (2). In the cases of Figure 6 and Figure 8, it may simply be an effect of computational precision. For the other cases, it is necessary to take into account that the Bootstrap version, by tending to create more ties, shows a tendency to underestimate the cumulative distribution, in other words, F ^ J L N D n B ( q ) F J L N D n ( q ) where F J L N D n ( · ) is the true cumulative distribution. Due to the increasing tendency shown by the cases addressed (see Figure 5), it is expected that the observed value of the statistic J L N D n , j l n d 0 , in each case is positioned in the upper tail of the distribution, which leads to the p-value be given by 2 ( 1 F ^ J L N D n B ( j l n d 0 ) ) , see Definition 4. As a consequence 2 ( 1 F ^ J L N D n B ( j l n d 0 ) ) > 2 ( 1 F J L N D n ( j l n d 0 ) ) . With the proposal made through Equation (3), we seek to correct the underestimation, since it does not favor the proliferation of ties. Which would explain the relationship between the p-value.

5. Concluding Remarks

In this article, we investigate the performance of the J L N D n statistic to identify dependence on bivariate random vectors from a paired sample of size n . The procedure requires identifying the LNDSS that can be found on the graph between the marginal ranks of the paired observations, see Definitions 1 and 2. The goal is to compare the length of such subsequence (Definition 3) with the length of all possible subsequences, under the assumption of independence. This means, imposing an uniform distribution on the expanded S n space. For the formulation of the procedure, it is required to estimate the distribution of the statistic J L N D n , under the assumption of independence and, in this paper it is given by Equation (3) (see also Definition 4). The estimation proposed in this paper shows an improved performance compared with the one given in [5], see Section 2.1. The concept, longest non-decreasing subsequence, allows us to build a tool without restrictions over the type of variable, continuous or discrete in which it can be applied. From the simulation study we confirm that the detection power of the procedure occurs in situations with an increasing pattern from left to right and from bottom to top, which is the direction in which the J L N D n statistic is sought (see Figure 1). The observations can be associated with continuous or discrete variables, not affecting the power of the test. The concomitant presence of increasing patterns and decreasing patterns does not necessarily nullify the detection capacity of the procedure if the size of the samples is big enough. We also verify the robustness of the procedure when inspecting cases that suffer contamination that could conceal the dependence. See Table 4, Table 5, Table 6, Table 7 and Table 8, we use different real data sets that expose the versatility of the procedure to reject independence in situations such as (a) in the presence of ties, (b) in the presence of sparseness, (c) in mixed situations.

Author Contributions

Conceptualization, J.E.G. and V.A.G.-L.; methodology, J.E.G. and V.A.G.-L.; software, J.E.G. and V.A.G.-L.; validation, J.E.G. and V.A.G.-L.; formal analysis, J.E.G. and V.A.G.-L.; investigation, J.E.G. and V.A.G.-L.; data curation, J.E.G. and V.A.G.-L.; writing–review and editing, J.E.G. and V.A.G.-L. Both authors have read and agreed to the published version of the manuscript.

Funding

No funds were received for the execution of this work.

Acknowledgments

The authors wish to thank the two referees for their many helpful comments and suggestions on an earlier draft of this paper.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Agresti, A. Categorical Data Analysis; John Wiley & Sons: Hoboken, NJ, USA, 2002. [Google Scholar]
  2. Kendall, M.G. The treatment of ties in ranking problems. Biometrika 1945, 33, 239–251. [Google Scholar] [CrossRef] [PubMed]
  3. Cramir, H. Mathematical Methods of Statistics; Princeton U. Press: Princeton, NJ, USA, 1946; Volume 500. [Google Scholar]
  4. Romik, D. The Surprising Mathematics of Longest Increasing Subsequences; Cambridge University Press: New York, NY, USA, 2015; Volume 4. [Google Scholar]
  5. Garca, J.E.; González-López, V.A. Independence test for sparse data. AIP Conf. Proc. 2016, 1738, 140002. [Google Scholar]
  6. Garca, J.E.; González-López, V.A. Independence tests for continuous random variables based on the longest increasing subsequence. J. Multivar. Anal. 2014, 127, 126–146. [Google Scholar] [CrossRef]
  7. Zelterman, D. Goodness-of-fit tests for large sparse multinomial distributions. J. Am. Stat. Assoc. 1987, 82, 624–629. [Google Scholar] [CrossRef]
  8. Hollander, M.; Wolfe, D. Nonparametric Statistical Methods; John Wiley & Sons: New York, NY, USA, 1973; pp. 185–194. [Google Scholar]
  9. Simonoff, J.S. Smoothing Methods in Statistics; Springer: New York, NY, USA, 1996. [Google Scholar]
  10. Weisberg, S. Applied Linear Regression, 4th ed.; John Wiley & Sons: Minneapolis, MN, USA, 2005; Volume 528. [Google Scholar]
Figure 1. The LNDSS of Q D , from Table 1.
Figure 1. The LNDSS of Q D , from Table 1.
Symmetry 12 01415 g001
Figure 2. (left) X vs. Y . (right) r a n k s ( X ) vs. r a n k s ( Y ) . The values of X and Y are simulated from two independent exponential distributions, λ = 10 for X and λ = 20 for Y , n = 100 .
Figure 2. (left) X vs. Y . (right) r a n k s ( X ) vs. r a n k s ( Y ) . The values of X and Y are simulated from two independent exponential distributions, λ = 10 for X and λ = 20 for Y , n = 100 .
Symmetry 12 01415 g002
Figure 3. (left) D 1 ( 20 , 1 ) , n = 80 , (middle) D 2 ( 20 , 1 ) , n = 80 , (right) D 3 ( 20 , 1 , 6 ) , n = 80 .
Figure 3. (left) D 1 ( 20 , 1 ) , n = 80 , (middle) D 2 ( 20 , 1 ) , n = 80 , (right) D 3 ( 20 , 1 , 6 ) , n = 80 .
Symmetry 12 01415 g003
Figure 4. (left) D 4 ( 5 , 0.5 ) , n = 80 , (right) D 5 ( 5 , 0.5 ) , n = 80 .
Figure 4. (left) D 4 ( 5 , 0.5 ) , n = 80 , (right) D 5 ( 5 , 0.5 ) , n = 80 .
Symmetry 12 01415 g004
Figure 5. (left): Alcohol vs. Flavanoids. (right): Flavanoids vs. Intensity. Variables coming from wine data set from gclus R-package.
Figure 5. (left): Alcohol vs. Flavanoids. (right): Flavanoids vs. Intensity. Variables coming from wine data set from gclus R-package.
Symmetry 12 01415 g005
Figure 6. Duration vs. Interval of geyser data set [9].
Figure 6. Duration vs. Interval of geyser data set [9].
Symmetry 12 01415 g006
Figure 7. Boxplots of Return on CD by variable Type. See cdrate data set [9].
Figure 7. Boxplots of Return on CD by variable Type. See cdrate data set [9].
Symmetry 12 01415 g007
Figure 8. Class vs. Alcohol, see wine data set from gclus R-package.
Figure 8. Class vs. Alcohol, see wine data set from gclus R-package.
Symmetry 12 01415 g008
Table 1. Artificial data set, and its marginal ranks.
Table 1. Artificial data set, and its marginal ranks.
x i y i Rank ( x i )Rank ( y i )
5.310.215
5.39.311
6.19.331
6.110.133
7.110.153
7.311.066
Table 2. The proportion of p-value α computed from Definition 4 and Equation (2), in 1000 simulations of size n , of two independent and discrete Uniform distributions in { 1 , , m } . On top, results for α = 0.01 , on bottom results for α = 0.05 .
Table 2. The proportion of p-value α computed from Definition 4 and Equation (2), in 1000 simulations of size n , of two independent and discrete Uniform distributions in { 1 , , m } . On top, results for α = 0.01 , on bottom results for α = 0.05 .
n m = 10 m = 20 m = 50 m = 100
200.0130.0210.0220.032
400.0210.0380.0370.041
α = 0.01 600.0250.0330.0430.050
800.0190.0400.0530.050
1000.0280.0340.0440.059
n m = 10 m = 20 m = 50 m = 100
200.0840.0890.1120.100
400.0910.1050.1340.143
α = 0.05 600.1040.1140.1480.149
800.0950.1240.1390.159
1000.1130.1110.1250.143
Table 3. The proportion of p-value α computed from Definition 4 and Equation (3), in 1000 simulations of size n , of two independent and discrete Uniform distributions in { 1 , , m } . On top, results for α = 0.01 , on bottom results for α = 0.05 .
Table 3. The proportion of p-value α computed from Definition 4 and Equation (3), in 1000 simulations of size n , of two independent and discrete Uniform distributions in { 1 , , m } . On top, results for α = 0.01 , on bottom results for α = 0.05 .
n m = 2 m = 3 m = 4 m = 5 m = 10 m = 20 m = 50 m = 100
200.0040.0060.0080.0140.0110.0070.0070.004
400.0040.0090.0050.0080.0070.0080.0100.011
α = 0.01 600.0050.0040.0090.0070.0060.0040.0100.014
800.0060.0050.0100.0110.0120.0090.0060.008
1000.0050.0110.0110.0090.0070.0090.0120.008
n m = 2 m = 3 m = 4 m = 5 m = 10 m = 20 m = 50 m = 100
200.0210.0320.0410.0470.0380.0480.0420.043
400.0190.0380.0300.0430.0300.0460.0450.037
α = 0.05 600.0290.0410.0420.0440.0400.0320.0560.044
800.0390.0310.0450.0460.0510.0460.0460.052
1000.0310.0410.0480.0490.0520.0540.0530.053
Table 4. The proportion of p-value α computed from Definition 4 and Equation (3), in 1000 simulations of size n . On top, results for α = 0.01 , on bottom results for α = 0.05 . m = 20 , a = 1 , b = 6 .
Table 4. The proportion of p-value α computed from Definition 4 and Equation (3), in 1000 simulations of size n . On top, results for α = 0.01 , on bottom results for α = 0.05 . m = 20 , a = 1 , b = 6 .
p = 1.0 p = 0.8
n D 1 D 2 D 3 M 1 M 2 M 3
201.0000.3490.0280.9940.1790.018
401.0000.7980.0501.0000.5680.034
α = 0.01 601.0000.9830.1361.0000.8580.078
801.0000.9990.2521.0000.9630.109
1001.0001.0000.3521.0000.9900.181
p = 1.0 p = 0.8
n D 1 D 2 D 3 M 1 M 2 M 3
201.0000.5370.1011.0000.3660.064
401.0000.9060.1771.0000.7570.125
α = 0.05 601.0000.9930.3061.0000.9340.192
801.0001.0000.4681.0000.9850.250
1001.0001.0000.6011.0000.9970.368
Table 5. The proportion of p-value α computed from Definition 4 and Equation (3), in 1000 simulations of size n . On top, results for α = 0.01 , on bottom results for α = 0.05 . m = 50 , a = 2 , b = 12 .
Table 5. The proportion of p-value α computed from Definition 4 and Equation (3), in 1000 simulations of size n . On top, results for α = 0.01 , on bottom results for α = 0.05 . m = 50 , a = 2 , b = 12 .
p = 1.0 p = 0.8
n D 1 D 2 D 3 M 1 M 2 M 3
201.0000.4310.0440.9930.2290.024
401.0000.9020.1721.0000.7190.079
α = 0.01 601.0000.9890.3741.0000.9240.180
801.0001.0000.6151.0000.9860.342
1001.0000.9990.7621.0000.9980.515
p = 1.0 p = 0.8
n D 1 D 2 D 3 M 1 M 2 M 3
201.0000.6100.1260.9980.4040.097
401.0000.9490.3681.0000.8370.196
α = 0.05 601.0000.9970.6081.0000.9630.370
801.0001.0000.8231.0000.9930.571
1001.0001.0000.9181.0000.9990.711
Table 6. The proportion of p-value α computed from Definition 4 and Equation (3), in 1000 simulations of size n . On top, results for α = 0.01 , on bottom results for α = 0.05 . m = 100 , a = 5 , b = 30 .
Table 6. The proportion of p-value α computed from Definition 4 and Equation (3), in 1000 simulations of size n . On top, results for α = 0.01 , on bottom results for α = 0.05 . m = 100 , a = 5 , b = 30 .
p = 1.0 p = 0.8
n D 1 D 2 D 3 M 1 M 2 M 3
201.0000.4090.0380.9970.1790.024
401.0000.8650.1371.0000.6230.063
α = 0.01 601.0000.9840.2921.0000.8840.141
801.0000.9990.4731.0000.9690.247
1001.0001.0000.6551.0000.9910.394
p = 1.0 p = 0.8
n D 1 D 2 D 3 M 1 M 2 M 3
201.0000.5970.1101.0000.3450.090
401.0000.9330.3001.0000.7710.194
α = 0.05 601.0000.9960.5201.0000.9490.326
801.0001.0000.7151.0000.9900.468
1001.0001.0000.8481.0000.9980.620
Table 7. The proportion of p-value α computed from Definition 4 and Equation (3), in 1000 simulations of size n . On top, results for α = 0.01 , on bottom results for α = 0.05 . Distribution M 4 with a = 0.5 .
Table 7. The proportion of p-value α computed from Definition 4 and Equation (3), in 1000 simulations of size n . On top, results for α = 0.01 , on bottom results for α = 0.05 . Distribution M 4 with a = 0.5 .
p = 1.0 p = 0.8
n m = 2 m = 3 m = 5 m = 10 m = 2 m = 3 m = 5 m = 10
201.0001.0001.0001.0000.7130.9360.9850.996
401.0001.0001.0001.0000.9931.0001.0001.000
α = 0.01 601.0001.0001.0001.0001.0001.0001.0001.000
801.0001.0001.0001.0001.0001.0001.0001.000
1001.0001.0001.0001.0001.0001.0001.0001.000
p = 1.0 p = 0.8
n m = 2 m = 3 m = 5 m = 10 m = 2 m = 3 m = 5 m = 10
201.0001.0001.0001.0000.8820.9770.9990.999
401.0001.0001.0001.0000.9991.0001.0001.000
α = 0.05 601.0001.0001.0001.0001.0001.0001.0001.000
801.0001.0001.0001.0001.0001.0001.0001.000
1001.0001.0001.0001.0001.0001.0001.0001.000
Table 8. The proportion of p-value α computed from Definition 4 and Equation (3), in 1000 simulations of size n . On top, results for α = 0.01 , on bottom results for α = 0.05 . Distribution M 5 with a = 0.5 .
Table 8. The proportion of p-value α computed from Definition 4 and Equation (3), in 1000 simulations of size n . On top, results for α = 0.01 , on bottom results for α = 0.05 . Distribution M 5 with a = 0.5 .
p = 1.0 p = 0.8
n m = 3 m = 5 m = 10 m = 3 m = 5 m = 10
200.0700.2090.4460.0310.1050.209
400.2110.6340.9260.1180.3740.708
α = 0.01 600.4460.9100.9960.2240.6550.945
800.6110.9771.0000.3640.8520.994
1000.7280.9991.0000.5280.9490.999
p = 1.0 p = 0.8
n m = 3 m = 5 m = 10 m = 3 m = 5 m = 10
200.1610.3850.6380.1120.2390.388
400.3580.7910.9710.2310.5780.828
α = 0.05 600.6120.9700.9990.3840.8100.973
800.7360.9951.0000.5290.9510.998
1000.8500.9991.0000.6710.9811.000
Table 9. p-value of Copula’s test, Hoeffding’s test, J L N D n ’s test ( B = 5000 ); p-value and coefficient of Kendall’s test, Pearson’s test and Spearman’s test. Case (ii) Flavanoids vs. Intensity.
Table 9. p-value of Copula’s test, Hoeffding’s test, J L N D n ’s test ( B = 5000 ); p-value and coefficient of Kendall’s test, Pearson’s test and Spearman’s test. Case (ii) Flavanoids vs. Intensity.
CopulaHoeffdingSpearmanPearsonKendall JLND n Equation (2) JLND n Equation (3)
p-value0.00050.00000.56950.02140.57130.03800.0044
coefficient −0.0429−0.17240.0287
Table 10. Data set cdrate from [9] organized by attributes Return on CD and Type = 0 (bank), 1 (thrift).
Table 10. Data set cdrate from [9] organized by attributes Return on CD and Type = 0 (bank), 1 (thrift).
Return on CDType = 0Type = 1Return on CDType = 0Type = 1Return on CDType = 0Type = 1
7.51018.15018.4903
7.56108.17108.5019
7.57108.20018.5110
7.71108.25028.5201
7.75018.30128.5510
7.82208.33218.5710
7.90118.34018.6520
8.00738.35028.7001
8.05208.36018.7110
8.06108.40168.7501
8.11108.45018.7801
Table 11. Independence tests between Return on CD and Type of cdrate data set from [9]. In Equations (2) and (3), B = 5000.
Table 11. Independence tests between Return on CD and Type of cdrate data set from [9]. In Equations (2) and (3), B = 5000.
Test χ 2 JLND n (Equation (2)) JLND n (Equation (3))
p-value0.05580.03200.0068
Statistic’s value45.6450 (df = 32)50.27550.275

Share and Cite

MDPI and ACS Style

García, J.E.; González-López, V.A. Random Permutations, Non-Decreasing Subsequences and Statistical Independence. Symmetry 2020, 12, 1415. https://doi.org/10.3390/sym12091415

AMA Style

García JE, González-López VA. Random Permutations, Non-Decreasing Subsequences and Statistical Independence. Symmetry. 2020; 12(9):1415. https://doi.org/10.3390/sym12091415

Chicago/Turabian Style

García, Jesús E., and Verónica A. González-López. 2020. "Random Permutations, Non-Decreasing Subsequences and Statistical Independence" Symmetry 12, no. 9: 1415. https://doi.org/10.3390/sym12091415

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop