Canonical Correlations and Nonlinear Dependencies

Canonical correlation analysis (CCA) is the default method for investigating the linear dependence structure between two random vectors, but it might not detect nonlinear dependencies. This paper models the nonlinear dependencies between two random vectors by the perturbed independence distribution, a multivariate semiparametric model where CCA provides an insight into their nonlinear dependence structure. The paper also investigates some of its probabilistic and inferential properties, including marginal and conditional distributions, nonlinear transformations, maximum likelihood estimation and independence testing. Perturbed independence distributions are closely related to skew-symmetric ones.


Introduction
Canonical correlation analysis is a multivariate statistical method purported to analyze the correlation structure between two random vectors where the only nonnull correlations are those between components of z and w with the same indices, that is Z 1 W 1 , . . . , Z r W r with r = min(p, q).
The random vector (Z 1 , W 1 ) is the first canonical pair and the correlation between its components, that is, the first canonical correlation is the highest among all correlations between a projection of x and a projection of y. Similarly, the random vector (Z i , W i ) is the i-th canonical pair and the correlation between its components, that is, the i-th canonical correlation is the highest among all correlations between a projection of x and a projection of y, which are orthogonal to the previous canonical pairs, for i ∈ {2, . . . , r}.
Canonical correlation analysis is particularly appropriate when the joint distribution of the vectors x and y is multivariate normal but it often performs poorly when the data are nonnormal [1]. The problem has been addressed nonparametrically [2], semiparametrically [1] and parametrically [3]. In this paper we introduce a semiparametric model to investigate the nonlinear dependence structure by means of canonical correlations. Kernel canonical correlation analysis (KCCA) and distance canonical correlation analysis (DCCA) play a prominent role among nonparametric generalizations of CCA aimed at addressing nonlinear dependencies (see, e.g., [4,5]).
The main contributions of the paper are as follows. Firstly, it defines the perturbed independence distribution as a statistical model for the joint distribution of two random vectors. The proposed model is somewhat reminiscent of copula models, in that the parameters addressing the dependence structure between two random vectors do not appear in the marginal distributions of the vectors themselves; however, the generating mechanism of perturbed independence distributions is very different from those of ordinary copulas.
Secondly, the perturbed independence model allows for flexible and tractable modeling of the nonlinear dependence structure between two random vectors, since the conditional distribution of a random vector with respect to the other is skew-symmetric. The proposed model provides a parametric interpretation of KCCA and DCCA, which are commonly regarded as nonparametric multivariate methods.
Thirdly, some appealing properties of canonical correlation analysis that hold true in the normal case still hold true in the perturbed independence case. For example, the first (second) component of a canonical pair is independent from the second (first) component of any other canonical pair. Further, if the marginal distributions of the two given vectors are normal, any canonical pair is independent of any other canonical pair.
Fourthly, the paper investigates the bivariate perturbed independence models within the framework of positive and negative association. In particular, it shows that the canonical pairs obtained from a perturbed independence distribution have the desirable properties of being positive quadrant dependent, under mild assumptions on the perturbing function.
The rest of the paper is structured as follows. Section 2 defines perturbed independence distributions and states some of their probabilistic and inferential properties. Section 3 connects perturbed independence distributions, canonical correlation analysis, positive dependence orderings and ordinal measures of association. Section 4 uses both theoretical and empirical results to find nonlinear transformations that increase correlations. Appendix A contains all proofs.

Model
This section defines the perturbed independence model, states its invariance properties and the independence properties of its canonical pairs. The theoretical results are illustrated with the bivariate distribution 2φ(x)φ(y)Φ(λxy) introduced by [6,7], where φ(·) and Φ(·) denote the probability and the cumulative density functions of a standard normal distribution, while λ is a real value. Ref. [8] thoroughly investigated its properties and proposed some generalizations.
A p-dimensional random vector x is centrally symmetric (simply symmetric, henceforth) if there is a p-dimensional real vector ξ such that x − ξ and ξ − x are identically distributed ( [9]). A real-valued function π(·) is a skewing function (also known as perturbing function) if it satisfies the equality π(−a) = 1 − π(a) and the inequalities 0 ≤ π(a) ≤ 1 for any real vector a [10]. The probability density function of a perturbed independence model is twice the product of two symmetric probability density functions and a skewing function evaluated at a bilinear function of the outcomes. A more formal definition follows. Definition 1. Let the joint distribution of the random vectors x and y be where h(·) is the pdf of a p-dimensional, centrally symmetric distribution, k(·) is the pdf of a qdimensional, centrally symmetric distribution, Ψ is a q × p matrix and π(·) is a function satisfying 0 ≤ π(−a) = 1 − π(a) ≤ 1 for any real value a. We refer to this distribution as to a perturbed independence model, with components h(·) and k(·), location vectors µ and ν, perturbing function π(·) and association matrix Ψ.
In the bivariate distribution 2φ(x)φ(y)Φ(λxy), both components coincide with the normal pdf, both location vectors coincide with the origin, the perturbing function is the standard normal cdf and the association matrix is the scalar parameter λ.
Random numbers having a perturbed independence distribution can be generated in a very simple way. For the sake of simplicity, we illustrate it in the simplified case where µ and ν are null vectors and π(·) is a cumulative distribution function of a distribution symmetric at the origin. First, generate the vectors u and v from the densities h(·) and k(·). Second, generate the scalar r from the distribution whose cumulative density function is π(·). Third, let the vector w be u , v if the bilinear form u Ψv is greater than r and either −u , v or u , −v in the opposite case. Then, the distribution of w is perturbed independence with components h(·) and k(·), null location vectors, perturbing function π(·) and association matrix Ψ.
The bivariate distribution 2φ(x)φ(y)Φ(λxy) might be generated as follows. First, generate three mutually independent, standard normal random numbers U, W and Z. Second, set X equal to U and Y equal to W if the product λUW is greater than Z. Otherwise, set X equal to −U and Y equal to W. Then the joint distribution of X and Y is 2φ(x)φ(y)Φ(λxy).
A p-dimensional probability density function 2g(a − ξ)π(a − ξ) is skew-symmetric with kernel g(·) (i.e., a probability density function symmetric at the origin), location vector ξ and skewing function function π(·). The function g(·) would be more precisely denoted by g p (·), since it depends on the dimension of the corresponding random vector. However, we use g(·) instead of g p (·) to relieve the notational burden. Ref. [11] discuss hypothesis testing on g(·) for any choice of function π(·). The most widely studied skew-symmetric distributions are the linearly skewed distributions, where the skewing function depends on a − ξ only through its linear function α (a − ξ), as it happens for the multivariate skew-normal case. [12], as well as [13], investigated their inferential properties. Ref. [14] used them to motivate kurtosis-based projection pursuit.
In the notation of the above definition, the first part of the following theorem states that the marginal distributions of x and y are h(x − µ) and k(y − ν). Thus, perturbed independent distributions separately model the marginal distributions and the association between two random vectors, and constitute an alternative to copulas. The second part of the following theorem states that the conditional distribution of a component with respect to the other is linearly skewed. Hence, the association between the two components has an analytical form, which has been thoroughly investigated. Theorem 1. Let the random vectors x and y have a perturbed independence distribution with components h(·), k(·) and location vectors µ, ν. Then the following statements hold true.

•
The marginal probability density functions of x and y are h(x − µ) and k(y − ν).

•
The conditional probability density functions of x given y and y given x are skew-symmetric with kernels h(·) and k(·), while the associated location vectors are µ and ν.
The marginal distributions of 2φ(x)φ(y)Φ(λxy) are standard normal: X ∼ N(0, 1) and Y ∼ N(0, 1). The conditional distributions are skew-normal: the probability density functions of X|Y = y ∼ SN (λy) and of Y|X = x ∼ SN (λx) are 2φ(x)Φ(λxy) and 2φ(y)Φ(λxy). The sign of the correlation between X and Y is the same as the sign of λ but the two random variables are nonlinearly dependent [7]: There is a close connection between order statistics and either skew-normal distributions or their generalizations. For example, any linear combination of the minimum and the maximum of a bivariate, exchangeable and elliptical random vector is skew-elliptical [15]. In particular, any skew-normal distribution might be represented as the maximum or the minimum of a bivariate, normal and exchangeable random vector. At present, it is not clear whether there exists a meaningful connection between order statistics and perturbed independence distributions, which would ease both the interpretation and the application of these distributions.
The mean vector m and the covariance matrix S of the n × d data matrix X are statistically independent, if the rows of X are a random sample from a multivariate normal distribution. As a direct consequence, the components of the pairs (m 1 , S 2 ) and (m 2 , S 1 ) are statistically independent, too, where m 1 and S 1 (m 2 and S 2 ) are the mean vector and the covariance matrix of X 1 (X 2 ), that is the data matrix whose columns coincide with the first 0 < p < d (the last d − p) columns of X. The same property holds true for perturbed independence models, as a corollary of the following theorem. Theorem 2. Let the random vectors x and y have the perturbed independence distribution with location vectors µ and ν. Then any even function of x − µ is independent of y. Similarly, any even function of y − ν is independent of x.
Let the joint distribution of the random variables X and Y be 2φ(x)φ(y)Φ(λxy). Then Y and X 2 are mutually independent. Similarly, X and Y 2 are mutually independent.
The components of the canonical covariates z = Z 1 , . . . , Z p and w = W 1 , . . . , W q are uncorrelated when their indices differ: A p-dimensional random vector v is said to be sign-symmetric if there is a p-dimensional real vector u such that v − u and U(v − u) are identically distributed, where U is any p × p diagonal matrix whose diagonal elements are either 1 or −1 [9]. For example, spherical random vectors are sign-symmetric. The following theorem shows that the canonical covariates belonging to different canonical vectors and with different indices are independent, if the joint distribution of the original variables is perturbed independence with sign-symmetric components. Theorem 3. Let the random vectors x ∈ R p and y ∈ R q have a perturbed independence distribution with sign-symmetric components. Further, let z = Z 1 , . . . , Z p and w = W 1 , . . . , W q be the canonical covariates of x and y. Then Z i and W j are independent when i = j.
Under normal sampling, the components of different canonical pairs are statistically independent. The following corollary of the above theorem shows that the same property still holds true when the original variables have a perturbed independence distribution with normal components. Corollary 1. Let the random vectors x ∈ R p and y ∈ R q have a perturbed independence distribution with normal components. Further, let z = Z 1 , . . . , Z p and w = W 1 , . . . , W q the canonical covariates of x and y. Then the variables Z i , Z j , W i and W j are pairwise independent when i = j.
As remarked by [16], the default measures of multivariate skewness and kurtosis are those introduced by [17]. Mardia's skewness is the sum of all squared, third-order, standardized moments, while Mardia's kurtosis is the fourth moment of the Mahalanobis distance of the random vector from its mean. Mardia's kurtosis of 2φ( so that it increases with the squared correlation between X and Y ( [18]).
Let f xy be the joint probability density function of the p-dimensional random vector x and of the q-dimensional random vector y. Further, let f x and f y be the marginal probability density functions of x and y. The distance covariance between x and y with respect to the weight function w is where t ∈ R p , s ∈ R q and w(t, s) ≥ 0 [20]. If the joint distribution of x and y is a perturbed independence model with components h(·) and k(·), location vectors µ and ν, perturbing function π(·) and association matrix Ψ we have A little algebra leads to the identities Hence, for perturbed independence models, the distance covariance is just half the difference between f xy (t, s; µ, ν,Ψ) and f xy (t, s; µ, ν, −Ψ), which is the probability density functions of x y and x −y .
In particular, if the joint distribution of the random variables X and Y is 2φ

Concordance
This section investigates the bivariate perturbed independence models within the framework of positive and negative association. In particular, it shows that the canonical pairs obtained from a perturbed independence distribution have the desirable properties of being positive quadrant dependent, under mild assumptions on the perturbing function. The seminal paper by [21] started a vast literature on dependence orderings and their connections with ordinal measures of association. For the sake of brevity, here we mention only some thorough reviews of the concepts in this section: [22][23][24][25][26][27].
Two random variables are said to be either concordant, positively associated or positively dependent if larger (smaller) outcomes of one of them often occur together with larger (smaller) outcomes of the other random variable. Conversely, two random variables are said to be either discordant, negatively associated or negatively dependent if larger (smaller) outcomes of one of them often occur together with smaller (larger) outcomes of the other random variable. For example, financial returns from different markets are known to be positively dependent (see, e.g., [28][29][30]). The degree of concordance or discordance is assessed with ordinal measures of association, of which the most commonly used are Pearson's correlation (simply correlation, for short), Spearman's rho and Kendall's tau.
The correlation is the best known measure of ordinal association. The correlation between two random variables X and Y is where µ X and µ Y are the expectations of X and Y. The ordinal association between two random variables might be decomposed into a linear component and a nonlinear component. The liner component refers to the tendency of the random variables to deviate from their means in a proportional way. The correlation only detects and measures the linear component of the ordinal association. When the nonlinear component is not negligible, the information conveyed by the correlation needs to be integrated with information from other measures of ordinal association. Spearman's rho, also known as Spearman's correlation, between the random variables X and Y is the correlation between the two variables after being transformed according to their marginal cumulative distribution functions: where F X (·) and F Y (·) are the marginal cumulative distribution functions of X and Y. Its sample counterpart is the correlation between the observed ranks. Spearman's rho is a measure of ordinal association detecting both linear and nonlinear dependence. It is also more robust to ouliers than the Pearson's correlation.
Kendall's tau, also known as Kendall's correlation, between two random variables is the difference between their probability of concordance and their probability of discordance. The former (latter) is the probability that the difference between the first components of two independent outcomes from a bivariate distribution have the same sign of (a different sign than) the difference between the second components of the same pairs. More formally, Kendall's tau between the random variables X and Y is where (X 1 , Y 1 ) and (X 2 , Y 2 ) are two independent outcomes from the bivariate random vector (X, Y) . Just like Spearman's rho, Kendall's tau is an ordinal measure of association detecting linear as well as nonlinear dependence and is more robust to outliers than Pearson's correlation.
Unfortunately, Pearson's correlation, Spearman's rho and Kendall's tau might take different signs, thus making it difficult to measure ordinal association. In order to prevent this from happening, it is convenient to impose some constraints on the bivariate distribution. The distribution of a bivariate random vector (X, Y) is said to be positively quadrant dependent (PQD) if its joint cdf is greater or equal than the product of the marginal cdf: for any two real values x and y. Similarly, the distribution of a bivariate random vector (X, Y) is said to be negatively quadrant dependent (PQD) if its joint cdf is either smaller or equal than the product of the marginal cdf: for any two real values x and y. Pearson's correlation, Spearman's rho and Kendall's tau of PQD (NQD) distributions are either null or have positive (negative) signs.
Independent random variables are special cases of PQD and NQD random variables. In order to rule this case out, the PQD and NQD condition can be made more restrictive that the above inequalities needs to be strict for measurable sets of x and y values. For example, a strictly positive quadrant dependent pair of random variables satisfies the inequality for any two real values x and y belonging to given interval of positive length. Pearson's correlation, Spearman's rho and Kendall's tau of strictly positive (negative) quadrant dependent distributions have positive (negative) signs. As shown in the following theorem, a bivariate perturbed independence model is strictly positive (negative) quadrant dependent if the perturbing function is a cumulative density function and the association parameter is a positive (negative) scalar.

Theorem 4.
Let the joint distribution of the random variables X and Y be perturbed independent with components h(·) and k(·), perturbing function π(·) and association parameter λ: f (x, y) = 2h(x)k(y)π(λxy). Further, let π(·) be the cumulative density function of a symmetric distribution. Then the random variables X and Y are strictly positive (negative) quadrant dependent when λ is positive (negative).
The joint distribution 2φ(x)φ(y)Φ(λxy) of the bivariate random vector (X, Y) introduced in the previous section fulfills the assumptions in Theorem 5. In particular, if the association parameter λ is positive, the random variables X and Y are strictly positive quadrant dependent: for any two real values a and b. As a direct consequence, their Pearson's correlation ρ(X, Y), their Spearman's rho ρ S (X, Y) and their Kendall's tau τ(X, Y) are positive.
Pearson's correlation between the components of a canonical pair is nonnegative. However, within a nonparametric framework, their Spearman's rho and their Kendall's tau can take any sign. When Pearson's correlation between the components of a canonical pair is positive but their Spearman's rho and their Kendall's tau are negative, the former ordinal association measure becomes quite unreliable and canonical correlation analysis provides little insight into the dependence structure. This problem does not occur under a perturbed independence model satisfying the assumptions stated in the following theorem.
Theorem 5. Let (Z 1 , W 1 ) , ..., (Z r , W r ) , with r = min(p, q) the canonical pairs obtained from a perturbed independence distribution, and let their density be where π(·) is a strictly increasing perturbing function. Then the joint distribution of the i-th canonical pair is a bivariate perturbed independence model: where G(·) is a strictly increasing perturbing function.
We illustrate the above theorem with the perturbed independence distribution where φ q (·; Ω) is the q-dimensional normal density with null mean vector and covariance matrix Σ, F(·) is the cdf of a continuous distribution symmetric at the origin and Ψ is a symmetric p × p matrix. The distribution of the canonical variates z = Ax = Z 1 , ..., Z p and w = By = W 1 , ..., W q is which fulfills the assumptions in Theorem 6. Then the joint distribution of the i-th canonical where φ(·) is the pdf of a univariate, standard normal distribution and G i (·) is the cdf of a continuous distribution symmetric at the origin. By Theorem 5 and since the i-th canonical

Nonlinearity
As a desirable property, CCA decomposes the covariance matrix between the pdimensional random vector x and the q-dimensional random vector y into linear combinations of the covariances between uncorrelated linear functions of x and y. Ref. [31] thoroughly investigate the interpretation of CCA within the framework of linear dependence. The first output of CCA are the linear combinations of x and y, which are maximally correlated: where G is the set of all real valued monotonic functions. In the general case, the maximization needs to be performed simultaneously with respect to the nonlinear functions g 1 (·), g 2 (·) and the real vectors a, b, thus being difficult to compute and difficult to interpret. Ref. [1] addressed the problem by proposing the Gaussian copula model, where the components of x and y have a joint distribution that is multivariate normal, after being transformed according to monotonic and nonlinear functions. However, these monotonic transformations do not have a clear interpretation and they are not guaranteed to increase the correlations.
Perturbed independence models do not suffer from these limitations. Firstly, the monotonic transformations have a simple interpretation, being the expectations of one variable conditioned with respect to the other. Secondly, the same transformations are guaranteed to increase the correlations, under mild assumptions. These statements are made more precise in the following theorem. Theorem 6. Let the joint distribution of the random variables X and Y be perturbed independent with null location parameters, nonnull association parameter and increasing perturbing function. Finally, let X and Y have finite second moments. Then the conditional expectation g(X) = E(Y|X) is a monotone, odd and nonlinear function, while the correlation between Y and X is smaller than the correlation between Y and g(X).
We illustrate the above theorem with the distribution 2φ(x)φ(y)Φ(λxy) of the bivariate random vector (X, Y) introduced in Section 1. The conditional expectations of Y and X with respect to the outcomes x of X and y of Y are so that the nonlinear function of X and Y maximally correlated with Y and X are The above theorem does not guarantee that E(Y|X) is the nonlinear transformation of one component that is maximally correlated with Y, nor that such correlation is smaller than the correlation between E(X|Y) and E(Y|X). We empirically address this point by simulating n = 10,000 bivariate data from 2φ(x)φ(y)Φ(λxy), where λ ∈ {1, 2, 3, 4, 5, 6}. The left-hand scatterplots in Figure 1 clearly hint at positive dependence: more points lie in the first and in the third quadrants as the association parameter increases, despite the absence of the ellipsoidal shapes associated with bivariate normality. For each simulated sample, we computed Kendall's tau, Spearman's rho and Pearson's correlation and report their values in Table 1. The three measures of ordinal association are positive and they increase with the association parameter, consistently with the theoretical results in Section 2. More surprisingly, Spearman's rho is always greater than Kendall's tau and Pearson's correlation, unlike the bivariate normal distribution, where Pearson's correlation is always greater than Kendall's tau and Spearman's rho. Finally, for each simulated sample (X 1 , Y 1 ), ..., (X n , Y n ), we computed Pearson's correlation between Z 1 , . . ., Z n and W 1 , . . ., W n , where are proportional to the sample counterpart of the expectation of Y given X = x and X given Y = y under the model 2φ(x)φ(y)Φ(λxy). For each simulated sample, these correlations are always greater than the correlations between the original data, consistently with Theorem 7. Moreover, Pearson's correlations between W 1 , . . ., W n and Z 1 , . . ., Z n are always greater their Spearman's correlations. As shown in the right-hand scatterplots of Figures 1 and 2, the transformed data lie at the lower left corner and at the upper right corner of a square. This pattern becomes more evident as the association parameter increases. The histograms of Z 1 , . . ., Z n in Figure 2 are symmetric and bimodal, with both modes at the ends of the observed range. Bimodality becomes more evident as the association parameter increases. The behavior of the transformed data W 1 , . . ., W n is virtually identical and therefore is not reported. We conclude that perturbed independence distributions, by modeling the nonlinear association between random variables, might help in finding the nonlinear transformations that are maximally correlated to each other. A positive Pearson's correlation much lower than Spearman's rho and Kendall's tau hints for the presence of nonlinear association, whose analytical form might be estimated by looking for the maximally correlated nonlinear transformations of the random variables. This approach is particularly appropriate for the single index regression model Y = g(X) + ε, where the response variable Y is the sum of a smooth function g(·) of the predictor X and the error term ε. When g(·) is monotone, its analytical form might be estimated by looking for the transformation g(X) that is maximally correlated with Y. As remarked in the Introduction, kernel canonical correlation analysis (KCCA) and distance canonical correlation analysis (DCCA) are the two most popular generalizations of CCA aimed at dealing with nonlinear dependencies. A formal description of KCCA, based on Hilbert spaces and their inner products, might be found in the seminal papers by [32,33]. For most practical purposes, KCCA might be defined as the statistical method searching for linear projections of nonlinear functions of a random vector that are maximally correlated with linear projections of nonlinear functions of another random vector. Let F be a class of p-dimensional random vectors whose i-th components are nonlinear fuctions of the p-dimensional random vector x. Similarly, let G be a class of q-dimensional random vectors whose i-th components are nonlinear fuctions of the q-dimensional random vector y. Then KCCA looks for the random vectors f ∈ F, g ∈ G and for the real vectors a ∈ R p , b ∈ R q such that a f and b g are maximally correlated with each other.
In a nonparametric framework, the choice of the nonlinear functions may not be straightforward. On the other hand, in the perturbed independence framework, the theoretical and empirical results in this section suggest to set them equal to the conditional expectations: f = E(x|y) and g = E(y|x). In particular, for the perturbed independence model the suggested nonlinear functions of x and y are . DCCA looks for two projections whose joint distribution differs the most from the product of their marginal distributions, where difference is measured by distance correlation. The distance correlation between the random variables X and Y with respect to the weight function w is where V 2 (X, Y, w) is the distance covariance between X and Y with respect to w, as defined in the previous section. Hence the first canonical correlation between the p-dimensional random vector x and the q-dimensional random vector y is For other distance canonical correlations, the distance canonical pairs and the distance canonical transformations are defined similarly to their CCA analogues.
A natural question to ask is whether CCA and DCCA lead to identical projections, under the assumption of perturbed independence. At present, we are unable to either prove or disprove this statement, which we conjecture to be true, under the assumptions of Theorem 6: increasing perturbing functions that increase more steeply are more likely to imply both higher Pearson and distance correlations. We plan to investigate this conjecture by means of both theoretical arguments and simulation studies. Let A + and A − be the sets of q-dimensional real vectors whose first nonnull component are nonnegative and negative, so that A + ∪ A − = R q , A + ∩ A − = ∅ and If a nonnull q-dimensional real vector y belongs to A + then −y belongs to A − . By making the change of variable u = −y in the second integral we have By assumption, the functions k(·) and π(·) satisfy the identities k(v) = k(−v) and π(a) = 1 − π(−a): The marginal density of x is then The last identity follows from k(·) being a symmetric probability density function. In a similar way it can be proved that the marginal probability density function of y is k(·). The conditional probability density function of x given y is that is skew-symmetric with symmetric kernel h(·), null location vector, skewing function π(·) and shape parameter Ψ y. In a similar way it is possible to prove that the conditional probability density function of y given x is skew-symmetric with symmetric kernel k(·), null location vector, skewing function π(·) and shape parameter x Ψ .
Proof of Theorem 2. Let the joint distribution of the p-dimensional random vector x and the q-dimensional random vector y be Further, let w(u) = w(−u) be an even function of the p-dimensional real vector u. We prove the theorem for an even function of x − µ only, the proof for an even function of y − ν being very similar. Without loss of generality we assume that µ and ν are null real vectors, that f (x, y) is an absolutely continuous probability density fuction and that w = w(x) is a k-dimensional random vector. The characteristic function of w in the k-dimensional real vector t is Let S + and S − be the sets of p-dimensional real vectors whose first nonnull component are nonnegative and negative, so that S + ∪ S − = R p , S + ∩ S − = ∅ and ϕ w (t) = S + R q exp w (x)t 2h(x)k(y)π y Ψx dxdy + S − R q exp w (x)t 2h(x)k(y)π y Ψx dxdy.
If a nonnull vector x belongs to S + then −x belongs to S − . By making the change of variable u = −x in the second integral we have By assumption, the functions h(·), π(·) and w(·) satisfy the identities h(v) = h(−v), π(a) = 1 − π(−a) and w(u) = w(−u), so that The characteristic function of w is then The definitions of S + and S − , together with the identity h The characteristic function of w is then which does not depend neither on the association matrix Ψ nor on the perturbing function π(·). In order to prove that w and y are stochastically independent we consider their joint characteristic function where r is a q-dimensional real vector. An argument very similar to the one in the first part of the proof yields ϕ w,y (t, r) = E exp w (x)t · E exp y r = ϕ w (t) · ϕ y (r), thus implying that w and y are stochastically independent.
Proof of Theorem 3. Without loss of generality we assume that the location vectors are null vectors. We prove the theorem in the special case of both x and y being bivariate vectors, the proof for the general case being very similar but much more cumbersome in the notation. Let the joint distribution of z and w be f (z 1 , z 2 , w 1 , w 2 ) = 2h(z 1 , z 2 )k(w 1 , w 2 )π(λ 1 z 1 w 1 + λ 2 z 2 w 2 ).
The change of variables u 1 = −z 1 and u 2 = −w 2 yields The change of variable u 1 = −z 1 yields We can show that the random variables Z 1 and W 2 are independent: Proof of Theorem 4. For the sake of simplicity and without loss of generality we assume that the perturbing function π(·) is the cumulative distribution function of a continuous distribution. The random variables X and Y are either positively or negatively dependent if they satisfy the inequalities for all real values a and b, with strict inequalities holding for at least some a and b. We only prove the theorem for positive values of a, b and λ: the proofs of the remaining cases are very similar. The integral representation of the joint probability that X and Y are smaller than a and b is 2h(x)k(y)π(λxy)dxdy.
By making the transformations z = −x and w = −y we obtain 2h(x)k(y)π(λxy)dxdy The above integral identity leads to the probabilistic identity The probabilities Pr(X < 0), Pr(0 < X < a), Pr(Y < 0) and Pr(0 < Y < b) do not depend on the perturbing function π(·), while the probability Pr(X > a, Y > b) does. By assumption, π(·) is the cdf of a symmetric, continuous distribution, so that π(q) > 0.5 for any positive real value q. It follows that Logical steps similar to the above ones lead to the inequality for any two negative values −a and −b. The joint distribution of X and Y is centrally symmetric, so that the random variables −X and −Y have the same joint distribution of X and Y: Proof of Theorem 5. Let h i (·) and k i (·) be the probability density functions of the random variables U i and V i ,for i ∈ {1, . . . , n}. Further, let the random variables U 1 , . . ., U n and V 1 , . . ., V n be mutually independent, so that their joint probability density function is By assumption π(·) is a strictly increasing perturbing function, so that we can define a random variable Q, which is symmetric at the origin and whose cumulative density function is π(·): Pr(Q ≤ a) = F Q (a) = π(a) = 1 − π(a), where a ∈ R.
By further application of the generating argument described in Section 2 we conclude that the joint distribution of Z j and W j have a joint distribution, which is perturbed independent: f j (z, w) = 2h j (z)k j (w)π j λ j zw .
Proof of Theorem 6. By assumption, the association parameter is nonnull and the perturbing function is increasing. Therefore, by Theorem 5, X and Y are either positively dependent (if the association parameter is positive) or negatively dependent (if the association parameter is negative). By ordinary properties of monotone association, the conditional expectation g(x) = E(Y|X = x) of Y|X = x is an increasing and nonconstant function of x if the association parameter is positive and a decreasing function of x if the association parameter is negative. In either case, g(·) is a monotone and nonconstant function. By definition, a real valued function q(·) is odd if it satisfies the identity q(−a) = −q(a) for any real value a belonging to the domain of q(·). As remarked at the end of Section 2, a perturbed independence distribution is centrally symmetric: the bivariate random vectors (X, Y) and (−X, −Y) are identically distributed, so that As a direct consequence, the identity −g(x) = g(−x) holds true for any real value x belonging to the support of X.
By assumption, both X and Y have finite second moments and their probability density functions are symmetric at the origin. Hence their expectations equal zero: E(X) = E(Y) = 0. Without loss of generality we can assume that their variances equal one: E X 2 = E Y 2 = 1. Finally, let ρ be the correlation between X and Y: E(XY) = ρ. Were g(x) a nonlinear function, it would coincide with ρx, that is the linear regression of Y on X. By Theorem 1, the conditional distribution of Y given X = x is skew-symmetric, thus implying the identity E Y 2 |X = x = 1. However, ρx in an unbounded function but g(x) = E(Y|X = x) is not, since it must satisfy the inequality 1 = E Y 2 |X = x ≥ E 2 (Y|X = x) = g 2 (x).
We therefore proved by contradiction that g(·) is a nonlinear function. By the minimizing property of the expected value we have where A is the set of all real valued functions defined on the support of X. As a direct consequence, we have By taking expectation with respect to X we have By recalling that E(X) = E(Y) = 0 and E X 2 = E Y 2 = 1 we therefore obtain the inequality E {Y − g(X)} 2 ≤ 2 − 2ρ.
By expanding the squares we obtain which in turn leads to the inequality E{Yg(X)} ≥ ρ by noticing that E g 2 (X) < 1 = E Y 2 , by ordinary properties of variance decomposition. We further apply the same inequality to obtain corr{g(X), Y} = E{Yg(X)} E(Y 2 )E{g 2 (X)} > E{Yg(X)} ≥ ρ, which implies the strict inequality corr{g(X), Y} > ρ, thus completing the proof.