Association Factor for Identifying Linear and Nonlinear Correlations in Noisy Conditions

Background: In data analysis and machine learning, we often need to identify and quantify the correlation between variables. Although Pearson’s correlation coefficient has been widely used, its value is reliable only for linear relationships and Distance correlation was introduced to address this shortcoming. Methods: Distance correlation can identify linear and nonlinear correlations. However, its performance drops in noisy conditions. In this paper, we introduce the Association Factor (AF) as a robust method for identification and quantification of linear and nonlinear associations in noisy conditions. Results: To test the performance of the proposed Association Factor, we modeled several simulations of linear and nonlinear relationships in different noise conditions and computed Pearson’s correlation, Distance correlation, and the proposed Association Factor. Conclusion: Our results show that the proposed method is robust in two ways. First, it can identify both linear and nonlinear associations. Second, the proposed Association Factor is reliable in both noiseless and noisy conditions.


Introduction
Analyzing large datasets is becoming central in science, engineering, and technology. In data mining and statistical analysis, it is essential to detect relationships between different variables [1]. Different correlation factors have been introduced to identify and quantify the relationship between variables. Pearson's correlation coefficient has been broadly used to identify and measure the strength and direction of a linear relationship between two variables.
Pearson's correlation can effectively detect linear relationships; however it is not reliable to identify nonlinear relationships between two variables. To address this shortcoming of Pearson's correlation, Distance correlation was introduced by Gábor J. Székely [2,3] to find both linear and nonlinear relationships between two variables. Regardless of the relationship type, Distance correlation quantifies the degree of correlation by a value between zero and one. Values close to one represent strong correlation, while values close to zero suggest weak correlation between two variables. It has been demonstrated that Distance correlation is superior to Pearson's correlation for identifying nonlinear relationships.
Although Distance correlation can identify and quantify nonlinear correlations, it does not necessarily obtain the same or comparable values for different nonlinear relationships. For example, the Distance correlation of an exponential relationship could be higher than a quadratic relationship. Moreover, Distance correlation values drop in noisy conditions and may not robustly demonstrate the strength of the correlation, where low correlation values may contribute to the wrong conclusion about the strength of relationship between two variables.
To address these shortcomings and improve the performance of Distance correlation, in this paper we propose the Association Factor (AF). The proposed AF performs robustly with regard to identifying both linear and nonlinear relationships. Moreover, we show that AF performs robustly in noisy conditions and outperforms Distance correlation in identifying noisy linear and nonlinear relationships. An overview of Pearson's correlation and Distance correlation is provided in the next section. The proposed Association Factor is presented in Section 2. Simulation models, Results, and Conclusions are presented in Sections 3-5 respectively.

Pearson's Correlation
Pearson's correlation is a measure of the strength and direction of the linear relationship between two variables. Its score ranges between −1 and one, and it describes the degree to which one variable is linearly related to another. Pearson's correlation between two variables X and Y is defined by: where cov(X, Y) is the covariance between X and Y, σ X is the standard deviation of X, and σ Y is the standard deviation of Y. Pearson's correlation is essentially the covariance of X and Y normalized by the product of the standard deviations of X and Y [17].

Distance Correlation
Distance correlation is a measure of the correlation between two random vectors X and Y, and its value ranges from zero to one. Analogous to product-moment correlation (Pearson's correlation), Distance correlation can identify linear and nonlinear correlations using Euclidean distance. The empirical Distance correlation [2] is computed by: where R(X, Y) is empirical Distance correlation, ν n (X, Y) is empirical Distance covariance of X and Y, ν n (X) and ν n (Y) are empirical Distance variances of X and Y respectively, n is sample size, and ν n (., .) is a scalar. R(X, Y) is zero if and only if X and Y are independent.

Proposed Association Factor
In this paper, we introduce the Association Factor (AF), the Distance correlation of Optimal Transformations of variables X and Y: where R AF (X, Y) is the proposed AF, and h 1 : dom(X) → B and h 2 : dom(Y) → C are measurable mean zero transformations where B, C ⊆ R, and for ν 2 n (h 1 (X))ν 2 n (h 2 (Y)) > 0, we have: where ν n (h 1 (X), h 2 (Y)) is empirical Distance covariance of h 1 (X) and h 2 (Y), ν n (h 1 (X)) and ν n (h 2 (Y)) are empirical Distance variances of h 1 (X) and h 2 (Y) respectively, n is sample size, and ν n (., .) is a scalar: where: To quantify the degree of association between X and Y, we discuss a bivariate case of a response variable Y and a predictor X. Regardless of the relationship type between X and Y, we assume there are transforming functions h 1 (X) and h 2 (Y) that can transform the relationship between X and Y to a linear relation between h 1 (X) and h 2 (Y): where has a Gaussian distribution with zero mean and standard deviation σ. We can find h 1 (X) and h 2 (Y) by minimizing the Sum of Squared Errors (SSE): To minimize ∑ n i=1ˆ 2 i with regard to h 1 (X) and h 2 (Y), we use a simplified optimal transformation [18] by an iterative estimation. Let and for a given h 1 (X), in a similar way, we have: In each iteration, h 1 (X) and h 2 (Y) will be estimated, and an iterative optimization continues until the estimate of error ∑ n i=1 2 2 does not decrease in iteration T, where h 2 (Y)(T) and h 1 (X)(T) are optimal estimates with regard to unexplained variance.
Estimated transforming functions h 1 (X) and h 2 (Y) are optimal and linear for a joint normal distribution [19], where marginal distributions of X and Y are normal. If joint distribution of X and Y is not normal, estimated transforming functions h 1 (X) and h 2 (Y) are not optimal, but they are close to optimal linear transformations [18]. AF has the following properties: Disappears if and only if the two vectors are not associated, R AF (X, Y) = 0 for unassociated X and Y.

Simulation Models
To test the performance of the proposed AF, we modeled several simulations and computed Pearson's correlation, Distance correlation, and the proposed AF. Because Distance correlation and the proposed AF take values between zero and one, we calculate the absolute value of Pearson's correlation to provide a fair comparison between these methods. The aforementioned correlation coefficients are quantified for linear and nonlinear correlations. We have also obtained these correlation coefficients for random relation (no relationship) as follows.

Linear and Nonlinear Relationships in Noiseless Conditions
We simulated the following relationships: • Linear: Y = β 0 + β 1 X, where β 0 is intercept and β 1 is slope. • Fourth order polynomial: where β 0 and β 1 are coefficients.
The simulation steps are summarized below.

1.
Let Ω D be a set of D relationship types l 1 to l D . Generate pairwise variables using relationships in the relationship set Ω D so that D different datasets Γ 1 , Γ 2 , ..., Γ D representing the relationship types l 1 to l D are obtained. 2. For each generated dataset Γ d , compute Pearson's correlation (absolute value) ρ d , distance correlation R d , and Association Factor R d AF .

Linear and Nonlinear Relationships in Noisy Conditions
To test the performance of the proposed AF in noisy conditions, we corrupt the true relationships with low, medium, and high noise: where σ is White (Gaussian) noise and noise level is specified by standard deviation of the Gaussian distribution (σ). We then quantify the linear and nonlinear correlations in noisy conditions using Pearson's correlation, Distance correlation, and Association Factor. In the noisy conditions, we calculate the Monte Carlo average of each correlation coefficient over T = 100 instances (trials) of the same noise level. The simulation steps are summarized below. 1.
Let Ω D be a set of D relationship types l 1 to l D . Generate pairwise variables using relationships in the relationship set Ω D so that D different datasets Γ 1 , Γ 2 , ..., Γ D representing the relationship types l 1 to l D are obtained.
.., Φ D t by adding Gaussian noise (with noise level σ ) to the datasets Γ d 's generated using the true relationships (l d 's). 4. Compute and save Pearson's correlation (absolute value) ρ d t , distance correlation R d t , and association factor R d AF,t for each noisy dataset Φ d t . 5. Increase t by one (t = t + 1). 6. Repeat Steps 3 to 5 while t ≤ T.

No Relationship
We also investigated whether functions h 1 and h 2 may introduce a spurious relationship into the relationship between X and Y. To address this, we obtained Pearson's correlation, Distance correlation, and AF for no relationship (random noise).

Symmetry Regarding Sample Size, Missing Data, and Noise Level
Next, we study the symmetry of AF regarding the response and factor. The goal here is to investigate whether the Association Factor quantifies the relationship between X and Y regardless of their order. This means whether R 2 AF (X, Y) is equal to R 2 AF (Y, X). For the true relationship without noise, we calculate R 2 AF (X, Y) assuming: and to compute R 2 AF (Y, X), we have: For the noisy relationship, to compute R 2 AF (X, Y), we assume: and similarly for R 2 AF (Y, X), we have: We study the symmetry of AF with regard to the sample size and noise level for nonlinear relationships. We will show that with a small sample size, the underlying relationship cannot be visually identified even in the noiseless case due to the missing data.

Entropic Distance
We also compute Entropic Distance (ED) and compare it with AF. Entropic Distance, also called "relative entropy", is the differences between entropies with and without a prior condition [20]. The conditional entropy of two variables X and Y taking values x and y, respectively, is defined by: where b is the logarithm base. ED has the following properties [21]: 1. ED is symmetric. 2. ED is zero for comparing a distribution with itself. 3. ED is positive for two different distributions.
AF values are bounded between zero and one, but ED does not have an upper bound and can take any positive value. Therefore, the interpretation of ED is subjective, while AF can objectively represent the strength of the underlying relationship. Therefore, rather than comparing the ED and AF values, we computed the AF ratio and the ED ratio for different noise conditions. Let R 2 AF L and R 2 AF H be the AF for a relationship corrupted with different noise levels. The AF ratio, I AF , is computed by: Hence, the AF ratio can be interpreted as:

Detrended Fluctuation Analysis (DFA)
Peng et al. introduced Detrended Fluctuation Analysis (DFA), which is commonly used in time series analysis and stochastic processes [22]. It is an alternative method in comparison with the autocorrelation function and is often used for determining the statistical self-similarity of a signal. It can detect long-range correlations in a patchy signal. The computation of DFA [22,23] is summarized below.
For a time series of total length N: • Integrate the time series: where B i is the ith interval and B avg is the average interval.  Remove the trend (detrend) from the integrated time series y(k) by subtracting the local trend y n (k) in each box.

•
Calculate the root-mean-squared fluctuation, F(n), of the obtained detrended time series by: Repeat this computation over all time scales (box size n) to provide a relationship between F(n) and the box size (n).

Simulation Results and Discussion
We compared the performance of Pearson's correlation, Distance correlation, and the Association Factor for the following relationships that were explained in detail in the previous section: Noiseless linear and nonlinear relationships are depicted in Figure 1. The noisy relationships with low, moderate, and high noise are shown in Figures 2-4, respectively. The performance of Pearson's correlation, Distance correlation, and Association Factor in identifying these linear and nonlinear relationships are depicted in Figures 5-8. The performance of these correlation factors at different noise levels are discussed in the following section.

True Signal (No Noise)
Noiseless linear, exponential, parabolic, and fourth order polynomial are shown in Figure 1. Quantified correlations by Pearson's correlation, Distance correlation, and Association Factor are summarized in Table 1. As we expect, Pearson's correlation obtained a value of one for noiseless linear relationship, but its value was not reliable for nonlinear relationships such as exponential and polynomial. Distance correlation identified both linear and nonlinear relationships, but as we can see in Table 1, its performance was not robust with regard to the underlying relationship type between two variables. It scored one for a noiseless linear relationship, while it scored 0.47 for the fourth order polynomial, 0.91 for exponential, and 0.5 for parabolic. In contrast, the proposed AF could robustly identify the underlying relationship, and its value was one regardless of the relationship type (linear, exponential, or polynomial).

Noisy Relationships
Linear, exponential, parabolic, and fourth order polynomial relationships corrupted with low, moderate, and high noise are shown in Figures 2-4 respectively. Pearson's correlation, Distance correlation, and Association Factor are summarized for low, moderate, and high noise in Tables 2-4 respectively. The Pearson's correlation absolute value dropped from one (noiseless) to 0.98, 0.82, and 0.58 for low, moderate, and high noise in identifying linear relationship between two variables. We can observe that its value is not reliable for nonlinear relationships. Its absolute value dropped from 0.86 (noiseless) to 0.58 (high noise) for exponential relationship. For fourth order polynomial, its absolute value increased from noiseless (0.06) to low noise (0.23) and then dropped from low noise to high noise (0.14). For parabolic relationship, the Pearson's correlation absolute value increased from noiseless (0.05) to low noise (0.21) and then dropped from low noise to high noise (0.16). Distance correlation had steady performance for linear relationship, and its value decreased from one (noiseless) to 0.56 (high noise). However, its performance for nonlinear relationships was not consistent. Its score in identifying exponential relationship was comparable with its score for linear relationship. Its value for exponential relationship was 0.98 (for noiseless) and decreased to 0.56 (for high noise). Its score for parabolic relationship was 0.50 (for noiseless) and dropped to 0.37 (for high noise). In identifying the fourth order polynomial, Distance correlation scored 0.47 for noiseless and decreased to 0.29 for high noise.
In contrast, as we can see, the proposed AF had robust performance regardless of the relationship type ( Figures 5-8). Moreover, it had robust performance in noiseless and noisy conditions. Its value for noiseless relationships (linear, exponential, parabolic, and fourth order polynomial) was steady and equal to one. Its value in low noise was still about one (0.99) regardless of the relationship type. In moderate noise condition, AF consistently identified the underlying relationship with scores from 0.82 (linear) to 0.95 (parabolic). Even in high noise condition where the underlying relationship was substantially corrupted with noise (Figure 4), AF was able to identify the underlying correlations with scores from 0.58 (linear) to 0.72 (parabolic).
AF had comparable performance with Pearson's correlation in identifying linear relationship. It outperformed Distance correlation in identifying noiseless nonlinear relationships. Moreover, it outperformed Distance correlation in identifying nonlinear relationships in noisy conditions. Its score was up to twice (0.69) as high as Distance correlation (0.29) in identifying nonlinear correlations in high noise.

No Relationship
We computed Pearson's correlation, Distance correlation, and AF for no relationship (random noise). The results are summarized in Table 5. As we can see, all correlation factors including Pearson's correlation, Distance correlation, and AF obtained values close to zero, indicating there was no relationship between X and Y. This also clarifies that functions h 1 and h 2 did not introduce a spurious relationship into the relationship between X and Y. To study the symmetry of AF quantifying the relationship between response Y and factor X regardless of their order, we computed R 2 AF (X, Y) and R 2 AF (Y, X) and compared them. We computed AF with regard to sample size and noise level for different relationships. Figure 9 from top to bottom shows randomly sampled true (noiseless) circular relationship of size 100, 50, and 30, respectively. The second and third columns of Figure 9 show h 1 and h 2 obtained by R 2 AF (X, Y) and R 2 AF (Y, X) respectively. As we can observe, the transform functions regardless of the order of the response and factor were symmetric and linear even for small sample size.
The first and second row of Figure 10 show two instances of the randomly sampled true (noiseless) circular relationship of size 10. The third and fourth row of Figure 10 show two instances of the randomly sampled true (noiseless) circular relationship of size 30. As we can observe in this figure, because of small sample size, the true underlying relationship is not visible due to the missing data points. The second and third columns of Figure 10 show h 1 and h 2 obtained by R 2 AF (X, Y) and R 2 AF (Y, X) respectively. As we can observe, regardless of the order of the response and factor, and despite missing data, transform functions h 1 and h 2 were symmetric and almost linear even for a dataset with very small sample size.
Distance correlation, Maximal Information Coefficient (MIC) [24], R 2 AF (X, Y), and R 2 AF (Y, X) are obtained for the randomly sampled true circular relation of different sample sizes and are summarized in Table 6. As we can see, regardless of the sample size, AF could quantify the relationship even for a very small sample size of 10 (with missing data points). Moreover, AF was symmetric even for a small sample size of 30 and was almost symmetric for a very small sample size of 10. AF slightly decreased by reducing the sample size. AF outperformed MIC, and MIC performed better than Distance correlation. MIC also decreased by reducing the sample size. Distance correlation was in a range between 0.22 and 0.25 for sample sizes from 30 to 100. Its performance for sample size of 10 was sporadic. For example it scored 0.56 for a typical example of randomly sampled true circular relationship of size 10 depicted in Figure 10, second row. This could be potentially due to the arrangement of data points in this random sample of a circle that rather represents a linear relationship. Figure 11 from top to bottom shows a randomly sampled circular relationship of size 10, 30, 50, and 100 respectively corrupted with Gaussian noise. The second and third columns of Figure 11 show h 1 and h 2 obtained by R 2 AF (X, Y) and R 2 AF (Y, X) respectively. As we can observe regardless of the order of response and factor, the transform functions were symmetric and almost linear even for a very small sample of size 10.
Distance correlation and the Maximal Information Coefficient (MIC), R 2 AF (X, Y), and R 2 AF (Y, X) were obtained for the randomly sampled circular relation of different sample sizes corrupted with high noise and are summarized in Table 7. As we can see regardless of the sample size, AF could quantify the relationship even for a very small sample size of 10 (with missing data points). Moreover, AF was symmetric for moderate sample size (from 50 to 100) and was almost symmetric for small and very small sample size of 30 and 10 respectively. Similar to the noiseless scenario, AF slightly decreased by reducing the sample size. AF outperformed MIC, and MIC performed better than Distance correlation. MIC values were in a range from 0.28 to 0.31 for sample size from 30 to 100, but MIC was higher for the noisy relationship with a sample size of 10. The Distance correlation values were in a range from 0.19 to 0.25 for sample size from 30 to 100; however it was 0.36 for a typical random sample of size 10 from circular relationship corrupted with noise (depicted in Figure 11).
Next, we investigated the distribution of Distance correlation, Maximal Information Coefficient (MIC), R 2 AF (X, Y), and R 2 AF (Y, X). Figure 12 shows the Monte Carlo empirical distribution of these correlation measures for randomly sampled true circular relationship of size 30. Distributions were estimated by 1000 Monte Carlo samples. Distance correlation had a positively skewed distribution with a mode at about 0.3. MIC had a multimodal distribution with modes at about 0.3, 0.4, 0.5, and 0.6 with the highest mode at about 0.4. AF was negatively skewed with a mode at about 0.9975. We can also observe that R 2 AF (X, Y) and R 2 AF (Y, X) have similar distributions.  Figure 13 shows the Monte Carlo empirical distribution of these correlation measures for randomly sampled circular relationship of size 30 corrupted with high Gaussian noise. Similar to the previous simulation, distributions were estimated by 1000 Monte Carlo samples. Distance correlation had a positively skewed distribution with a mode at about 0.27. MIC had a multimodal distribution with the highest mode at about 0.25. AF was negatively skewed with a mode at about 0.7. Again here, we can see that R 2 AF (X, Y) and R 2 AF (Y, X) have similar distributions. To study AF in a different noise condition, we corrupted the randomly sampled circular distribution with exponential noise and obtained the values of Distance Correlation, Maximal Information Coefficient, R 2 AF (X, Y), and R 2 AF (Y, X). Figure 14 shows the Monte Carlo empirical distribution of these correlation measures for randomly sampled circular relationship of size 30 corrupted with high exponential noise. Distributions were estimated by 1000 Monte Carlo samples. We see again that Distance correlation had a positively skewed distribution with a mode at about 0.27. MIC had a multimodal distribution with the highest mode at about 0.3. AF was negatively skewed with a mode at about 0.8. As we can see, R 2 AF (X, Y) and R 2 AF (Y, X) have similar distributions.

Entropic Distance
We compared the performance of Entropic Distance (ED) and the Association Factor (AF) for linear, polynomial, exponential, and parabolic relationships. Their values for noiseless, low, moderate, and high noise are summarized in Table 8. AF performed consistently with a value of one for true relationships regardless of the relationship type. ED ranged from 1.1 to about two for different true relationships. The highest value obtained by ED was for the true linear relationship. In low, moderate, and high noise, the lowest value obtained by ED was for the parabolic relationship (0.848, 0.825, and 0.812, respectively). ED did not have an upper bound and could take any positive value, while the AF values were bounded between zero and one. Therefore, rather than comparing AF and ED, we computed the AF ratio and the ED ratio in different noise conditions. Table 9 shows the ratios as a percentage for both metrics where the noise level was increased from (a) noiseless to low noise, (b) low noise to moderate noise, and (c) moderate noise to high noise. ED ratios indicated that ED decreased by increasing the noise level. Similarly, the AF ratio decreased by increasing the noise level (Table 9). For polynomial and parabolic relationships, ED had a substantial decrease from noiseless to low noise, and it almost stabilized and had slight changes afterward by increasing the noise level. In contrast, AF had a consistent response to noise, and it decreased gradually, while its values for low noise were almost the same as the noiseless case. In comparison with ED: 1. AF is bounded; 2. AF obtains the same value regardless of the relationship type in noiseless condition; 3. AF can better quantify the correlation in noisy conditions.

Detrended Fluctuation Analysis
Pearson's correlation, AF, and Detrended Fluctuation Analysis (DFA) [22] are obtained for different relationships and are summarized in Table 10. The Pearson's correlation coefficient was computed before and after detrending the data. As we can see in the Table 10, Pearson's correlation could identify a strong correlation even for nonlinear relationships after detrending the data. Interestingly, the DFA values were almost identical to the Pearson's correlation values obtained for detrended data. This could be explained by visualizing the detrended data for a nonlinear relationship. As we can see in Figure  15, a polynomial relationship (Figure 15, left) was transformed to an approximately linear relationship after detrending the data (Figure 15, right). Hence, after detrending the data, Pearson's correlation could detect the nonlinear relation. We can conclude that AF and DFA are both hybrid methods. Both methods, transform the data first, and then quantify the relationship of the transformed data.

Conclusions and Future Work
We introduced a new method to identify and quantify correlation between two variables. The proposed coefficient, Association Factor (AF), is a robust method for the identification and quantification of both linear and nonlinear associations. We applied the proposed method to several different relationships including linear, exponential, parabolic, polynomial, and circular. The results demonstrated that AF could identify both linear and nonlinear relationships. Its value was equal to one in noiseless conditions regardless of the relationship type. Moreover, we tested AF in noisy conditions where the true relationships were corrupted with noise. AF could successfully identify the correlations in low, moderate, and high noise conditions. We also tested AF under different noise distributions, Gaussian and exponential. Regardless of the noise distribution, AF could successfully quantify the correlation.
We studied the distribution of AF and compared it with the distributions of Distance correlation and MIC. We also investigated the AF values for a very small sample size where the relationship was severely under-sampled. Despite the fact that a substantial amount of data was missing due to very small sample size, AF still could quantify the underlying correlation. We compared AF with ED and discussed its advantages over ED. AF had similar performance to Pearson's correlation in identifying linear relationship in noiseless and noisy conditions, and its value was equal to one for the noiseless linear relationship. AF outperformed Distance correlation and MIC in noiseless linear and nonlinear relationships. It also outperformed Distance correlation and MIC in noisy linear and nonlinear relationships. The results demonstrated that AF was robust with regard to the relationship type, as well as the noise condition. Although, we studied the bivariate case in this work, AF could be extended to quantify the relationship between several factors and a response, and our future work is focused on implementing the Multivariate Association Factor (MAF). The potential iterative model for a kx1 vector of factors X k and response Y can be defined by: and:ĥ

Conflicts of Interest:
The authors declare no conflict of interest.