Using Ramsey theory to measure unavoidable spurious correlations in Big Data

Given a dataset we quantify how many patterns must always exist in the dataset. Formally this is done through the lens of Ramsey theory of graphs, and a quantitative bound known as Goodman's theorem. Combining statistical tools with Ramsey theory of graphs gives a nuanced understanding of how far away a dataset is from random, and what qualifies as a meaningful pattern. This method is applied to a dataset of repeated voters in the 1984 US congress, to quantify how homogeneous a subset of congressional voters is. We also measure how transitive a subset of voters is. Statistical Ramsey theory is also used with global economic trading data to provide evidence that global markets are quite transitive.


Introduction
In the realm of data science, the conventional wisdom is that "more data is always better", but is this the case?As a dataset D becomes larger, Ramsey theory describes the mathematical conditions by which disorder becomes impossible.The impossibility of disorder is analogous to the existence of unavoidable and spurious correlations in large datasets.This makes understanding and measuring the extent of these spurious correlations essential in any attempt to glean meaningful information about D. In 2016 [2], Calude asked the question, how can Ramsey theory be used to understand spurious and unavoidable correlations in data science?
For example, the pigeonhole principle is an extreme, basic version of the Ramsey statement, "if a given person wears 8 different shirts in a given week, then there must have been a day where they wore at least 2 shirts."Here the dataset is the collection of shirts, with each shirt assigned a day.The unavoidable spurious correlation is that (at least) two shirts are assigned to the same day.In this case, there is no meaningful conclusion we can draw, despite the natural human desire to attribute meaning to a pattern that is observed.
However, we might try to draw meaningful conclusions if we identify a day where the person wore 3 shirts on the same day, or multiple days where they wore multiple shirts, because the pigeonhole principle on its own cannot guarantee these beyond the base requirement that there is a single day where two shirts must be worn.
Goodman's formula [5] provides a way to calculate the required number of certain relationships in a relational database.We use Goodman's formula to quantify how many correlations must be observed to ensure that some of the correlations are not spurious.Put another way, we use Goodman's formula to test the null hypothesis H 0 that a graph representing the relationships in a dataset is random.
In section 2 we present the relevant definitions and mathematical framework.In section 3 we introduce the needed Ramsey technology of Goodman's formula.In section 4 we apply this to two real life models: (1) similarity of voting records are for the members of the 1984 US congress, and (2) economic trading data between countries.In section 5 we give an application of Goodman's formula to measuring the transitivity of a graph.Finally, in section 6 we discuss further directions for research.

Mathematical framework
Our main model is a graph G, which is a collection of data points V , called the vertices, and a collection of connected (unordered) pairs of vertices E, called the edges, such that G = (V, E).An edge a between vertices v 1 and v 2 represents that v 1 and v 2 are related (in an abstract sense).This edge relationship will be intrinsic to each dataset and what it is trying to measure.For example, if the vertices are points in a metric space we might assign an edge when the distance between two points is ≤ 1.Another example is when the vertices are people in a room, and we put an edge between two people if they are both friends.
We insist that a vertex cannot be related to itself (a so-called loop) and that it can be described as an adjacency matrix by explicitly listing out which vertices have an edge between them: (1) ,j≤N is an adjacency matrix if it is symmetric with entries of 0, 1 with 0s along the diagonal.An adjacency matrix can be thought of as a graph on vertices {1, . . ., N } where there is an edge between i and j iff a ij = 1.This perspective is useful for the following reason: Lemma 1.Let A be an N × N adjacency matrix, and k ≥ 1.In the matrix In other words, if the first power (k = 1) of the adjacency matrix A represents an edge (path length = 1) between two vertices v 1 and v 2 , higher powers of the adjacency matrix give us insight into the number of paths between v 1 and v 2 of length k.A graph with N vertices where all N 2 (pairwise) possible edges are included is called a complete graph, and is denoted by K N .In the case N = 3, we call K 3 a triangle.

Corollary 2.
Let A be an N × N adjacency matrix.The ii th diagonal entry of A 3 is the number of triangles in A containing the vertex i.The number of triangles in A is Trace(A 3 )

6
, the sum of the diagonal entries of A 3 , taking into account overcounting.
Example of Corollary 2: Suppose we have a dataset with size N = 6.The number of triangles that exist in the complete K 6 graph is , where each triplet (e i , e j , e k ) is a triplet of edges that create a triangle (K 3 ).Depending on whether or not each edge has a value of 1 or 0 in the adjacency matrix A will determine if these triangles exist.Then no triangles exist because when we replace the edges a, b, c, and d with 1 and everything else with 0, no triplet of edges is complete: In this framework, if a triangle exists in the adjacency matrix A, then all three points (v i , v j , v k ) are connected to each other based on how the predetermined relationship is defined (whether it be geographic distance or some measurement of friendship, for example).In this way, a K 3 represents the simplest non-trivial emergent "pattern" that can be observed in a graph connecting data points in D, so it's the natural starting point for asking the question, "Which patterns are forced to exist in D given how we've connected its data points in the adjacency matrix A?".
This framework is good in black-and-white, binary situations where any pair of vertices is either (strongly) related or not related (at all).In non-binary relationships, it can be useful to think about graphs whose edges are classified by multiple colors.This can be represented as a partition of the edge set E into r-many disjoint sets E = E c1 E c2 . . .E cr , where c 1 , ..., c r represent a total of r−colors or classifications.
In the case of two colors, we will often just refer to red (R) and blue (B) edges.In the framework of adjacency matrices, a complete graph A with an edge-coloring using two colors is represented by an adjacency matrix B indicating a relationship exists or does not R: Take the edge a between v 1 and v 2 in R and set it equal to a = 1.Since the edge is colored red, it necessarily has to have an entry equal to zero (a − 1 = 1 − 1 = 0) in the blue edge adjacency graph B. In this case R + B must be the matrix of all ones, except on the diagonal where it has zeros.Counting monochromatic triangles in A is particularly simple: Corollary 3. Let A be an N × N adjacency matrix whose edges are colored using two colors.The number of monochromatic triangles in A is Trace(B 3 )+Trace(R 3 )
Therefore, the total number of triangles in the dataset D is equal to the sum of red and blue triangles present in the adjacency matrices R and B.

The Ramsey perspective
Classical Ramsey theory asks: "Fix m, r.Does every edge coloring of a K N complete graph with r colors contain a sub-collection K m , all of which have the same color?"In other words, how big does a multi-colored, complete graph need to be to force the existence of a smaller single-colored, complete graph?
In 1929, Ramsey [9] showed that if the size of the dataset D was N ≥ 6, and the number of ways the data points could be related to each other was m = 2 (either related or unrelated), then unavoidable subgraphs of mutually related or unrelated data points are forced to exist.
In 1959, Goodman quantified how many single-colored (monochromatic) triangles must be present in a two-colored K N .Because a (K 3 ) represents the simplest object that describes how data points relate to each other beyond a simple edge, it will form the basis of our application of Ramsey theory.
Theorem 4 (Goodman 1959, [5]).Let G be a graph with N vertices and edgecolored with red and blue.The quantity of monochromatic triangles in G is at least: Since the total number of triangles in

6
, Goodman's formula may be reinterpreted as a percentage.
Corollary 5 (Goodman 1959, [5]).Let G be a graph with N vertices and edgecolored with red and blue.The percentage of triangles in G that are monochromatic is asymptotically at least N −3 4N → 1 4 .This can be shown directly by dividing the quantities in Theorem 4 by N  3 .Alternatively, by applying Schwenk's reformulation of Goodman's formula [10], we can easily prove this: Proof.For N number of data points, the forced number N of monochromatic red (R) and blue (B) triangles is: and since the number of triangles present in any complete graph is N 3 , the following ratio describes the percentage of triangles in G that are monochromatic: The Floor Function of f (x) is equivalent to the function of f (x) with discontinuities at non-integer values x, therefore describing the asymptotic nature of the above ratio can be done without taking the floor functions into consideration: From this we can establish a threshold for when a two-colored graph can be interpreted to have meaningful correlations.Definition 6.Let G be a graph with n vertices and edge-colored with red and blue.Let Mono(G) be the percentage of triangles in G that are monochromatic, among all possible N 3 triangles in G. Let Goodman(N ) be the minimum percentage of monochromatic triangles in G guaranteed by Corollary 5, which has been shown to approach 0.25 as N → ∞.If Mono(G) > Goodman(N ) then we say that G has potentially meaningful correlations, which we explore further in section 4.3.
If Mono(G) is much larger than Goodman(N ), then we might say that G obeys a triangle dichotomy, which means that we expect a lot of triangles to be either completely one color, or completely the other.This is a relative term, and the larger Mono(G) is the more that this resembles a true dichotomy.If one color is more heavily represented in G than another, then we might say that G has a triangle bias.When triangle bias exists, this is at odds with the expectation, in a randomly colored graph, that the ratio of the number of color R triangles to color B triangles should be 1 : 1, and is therefore a further indication that the correlations in G are meaningful.
How can this be used in a dataset?In section 4.3, we discuss a best-fit approach in order to test the null-hypothesis that a dataset is indeed random.

Models
We will focus our analysis on two datasets: (1) voting records of members of the US congress in 1984, and (2) economic partnership among countries.4.1.Similarity in voting records.We now look at a set of people that have voted multiple times, specifically the 1984 United States Congressional Voting Records [6].
Goodman's formula can quantify how strong the triangle dichotomy and triangle bias are; that is, the percentage of three person cliques (B) and independent triples (R) and their deviation from the expected 1 : 1 color ratio.We will use the Hamming distance to measure how similar two voting records are for the 435 congress members of the 1984 US congress.Definition 7. The Hamming distance of two strings of the same length is the total number of positions where the entries are different.
For example, the Hamming distance between 00010 and 01001 is 3.These strings differ in the second, fourth and fifth spots.
In this session there were 16 separate votes, and to each voter we assign the string of length 16 with entries 'N' (voted nay), 'Y' (voted yea) or 'A' (some other action, such as abstaining).The minimum Hamming distance is 0, which indicates two identical voting records, and the maximum distance is 16, meaning the two voters always voted differently.
Applying this notion of distance to the data from Table 4.1 gives the following adjacency matrix and graph: Table 1.As a small example, the first 6 voters have the following strings associated with them.

Voter Party
Voting string How do we turn this into a binary adjacency matrix for classification purposes?In other words, how do we decide what constitutes "similar" voting and "dissimilar" voting, and where do we make that cut off?Definition 8. Let (M, d) be a metric space with vertex set M and distance d.Let t ≥ 0. Define the two-colored threshold graph G(t) in the space (M, d) by coloring an edge between two vertices v i , v j blue B iff d(v i , v j ) > t, and red R iff d(v i , v j ) ≤ t.Therefore, when for example t = 5, the following graph G(t = 5) maps to the binary case: (5) For example, taking the sample Congressional voting data from Table 4.1, we get the following threshold graphs in Figures 2, 3 and 4.

. , G(17)
Consider the complete graph G(t), the subgraph composed entirely of Democrat Congress votes D(t), and the subgraph composed entirely of Republican Congress votes R(t).Applied to the total voting records available in the dataset at various thresholds t, the following table shows the ratio of Mono(G(t)) to the total number of triangles K N : In section 5 we give another natural interpretation of these results by giving a measure of how transitive these graphs are.This is maybe a more intuitive interpretation of the data since it gives us a direct measurement of cooperation and independence.
4.2.How to measure the deviation away from a random graph.Goodman's formula tells us how many monochromatic triangles are forced to exist for a dataset D of size N , but what would the threshold graph of a truly random coloring of G N look like?4.2.1.Theoretical Construction.Suppose we have a random graph G(N, t) of related to N data points in D and probability t that an edge between two vertices n i , n j exists, the Erdos-Renyi model tells us that the expected number of edges in G(N, t) is N 2 t.The parameter t can be thought of as the threshold parameter introduced in section 4.1 as it ranges from 0 → 1 (assuming t min and t max have been normalized to [0, 1]).Therefore, the expected number of red (R) and blue (B) triangles T in D is: , where N 3 represents the total possible number of triangles, and t 3 represents the probability that all three edges are red (R); likewise (1 − t) 3 represents the probability of all three edges being blue (B).This information can be used to calculate the number of monochromatic triangles.

Corollary 9. The expected number of monochromatic triangles in G
Proof.The probability that 3 adjacent edges are the same color in a 2-colored graph is ( 12 ) 3 , there are N 3 number of triangles, and we multiply by 2 to account for the symmetry of how the edges can be colored with equal probability.This creates the following threshold plot for any randomly colored graph G(N, t): Deviation away from the expected distribution can allow us to determine the likelihood that the null hypothesis H 0 (that G N is actually random) is accepted or rejected.This can be done with a simple χ 2 test: The average of χ 2 R , χ 2 B and their resulting p−value can be used to determine with some significance level whether to accept or reject H 0 .
While the expected value is a good benchmark, it still doesn't answer the more fundamental question of how many monochromatic triangles are present in G N versus how many are required by Ramsey theory.This creates a stricter χ 2 calculation, but one that's better suited to our needs and is a measurement of the triangle dichotomy and triangle bias: 4.2.3.Applied to voting threshold graphs.We are now faced with applying the χ 2 method from 4.2.2 to the Congressional voting threshold graphs.What is the likelihood that these are random, or equivalently, what is the likelihood that there is a bias in the congressional voting record?We can say a p-value is significant if it is sufficiently different from how the expectation value differs from what is required from Ramsey theory.At a significance level of 0.01,the non-significant deviations are underlined1 .The furthest deviation can be attributed to the Republican congressional voters and is an indication that a strong bias exists in their voting records.4.3.Collaboration model.Suppose we have a collection of people V , working together on a communal project.
As an example we look at economic trading data [13].Every country is represented by a node, and we add a blue edge from a country to its 5 largest importers and exporters by volume..In this way, two countries are connected by a blue edge if their countries are historically economically connected and by a red edge if they are smaller trading partners.There is an asymmetry in the way edges are added, as for example, China only adds at most 10 blue edges to other countries, but many countries add blue edges to China.In this way it is possible for a country to have blue degree much higher than 10.This graph is best described as an Interaction Graph similar to the "friends at a party".For N = 214 countries, the number of monochromatic triangles equals 85.0% of the total number N 3 of triangles in a K 214 .These monochromatic triangles are almost entirely red, representing a lack of strong trade relations.This is significantly more than the required number of triangles given by Goodman's formula, which at N = 214 is 24.7 %.Since this graph has a threshold of only the top 5 trading partners for each country, it can be seen as a discrete sample of the threshold graph that would exist on the scale [t min = top trading partner to t max = all trading partners].In order to determine if the percentage of monochromatic triangles in this graph can be interpreted as meaningful evidence that the global economy connected with a strong dichotomy, we need to measure its p-value.For n = 214 countries, a threshold of t = 5 corresponds to t = 0.0234 on a normalized scale of [0, 1].When t = 0.0234, the expected deviation for the total number of monochromatic triangles from those required by Ramsey theory has a χ 2 = 4.236, whereas the trading graph has a χ 2 = 2.907.The difference between these is 1.329, which corresponds to a p-value of 0.248983, which is not statistically significant.We can therefore not reject the null-hypothesis that this trade graph is random.
While we cannot reject H 0 based on the number of superfluous monochromatic K 3 's in the trading data, the presence of higher dimensional complete subgraphs might provide sufficient evidence.
We can compute the percentage of monochromatic K 4 , and the percentage of monochromatic K 5 .This is computationally complex, so we computed these percentages for only small N .Table 9.Data for the country graph in section 4. It is natural to then ask what happens when we consider larger substructures, that is K 4 , K 5 , ..., K N instead of triangles.4.3.1.χ 2 for Higher Dimensions.For higher dimensions, there is no analogue of Goodman's formula, which we would expect to give us a percentage of 1 32 for K 4 , 1 16384 for K 5 , etc... using the same methods described in Corollary 9.In [12], Thomason has shown that an upper bound for the corresponding percentage of monochromatic K 4 is 1 33 , although it is not known if this is tight.In the same work he gave an upper bound on the number of monochromatic K m , as 0.936 • 2 1−( m 2 ) .For the χ 2 's related to larger substructures, Thomason's upper bound can be used in the same way that Goodman's is used for K 3 , with the understanding that this will give us an upper bound on a graph's deviation from what's required by Ramsey theory.Our new χ2 is an average of each K m 's associated χ 2 and can include up to N −dimensional substructures:

Ki
If instead we increase the number of colors and therefore allow for more than two classifications, a perfect answer for three colors and triangles is given by [4].

Applications to transitivity
When we have sufficient evidence to reject H 0 , we define a non-random graph in terms of its transitivity.Transitivity can be thought of as the likelihood that a relationship in a dataset is meaningful and therefore not spurious.Let's again consider the model for the party problem: the nodes are people at a party and we assign a blue (B) edge between two people if they are friends (and red (R) if they are not friends).

Definition 11. Transitivity
In this setting, we first remark that the blue "friend" relation is not by-default transitive, and neither is the red "not friend" relation.For example, I am friends with someone who does not know my brother.
It is easy to see that the only way for the red relation to be transitive is if all edges are red in a particular subgraph.Similarly, the blue relation is transitive only if all edges are blue.Typically, such a graph will not be transitive in both relations.
Transitivity can be described in terms of monochromatic triangles, specifically three vertices v i , v j , v k are members of a graph that is not transitive when the edges between them are not monochromatic.In this way, the percentage of monochromatic triangles in a graph is a measure of how transitive a graph is.In the context of uncolored graphs this has been studied as the clustering coefficient.However, by looking at two colored graphs, Goodman's formula implies that there is a lower limit on how non-transitive a graph.We know that least 0.25 of its triangles must be monochromatic in the case of a 2 colored graph.The higher the observed percentage is than 0.25, the more transitive the graph is, and this can be measured in terms of χ 2 .
Let's use this to interpret the results from section 4.1.Suppose we have three democrats v i , v j , v k and we know that v i Rv j iff v j Rv k ; that is, the relationship between v i and v j is the exact same as the one between v j and v k (although we don't necessarily know if both have an edge or not).
We ask: how likely is it that the relationship between v i and v j is the same as the one between v i and v k , i.e. that the triangle is transitive?Theorem 12. Let G be a complete graph whose edges are colored red (R) or blue.The percentage of monochromatic paths of length 2 that complete to a monochromatic triangle is measured by , where f (G) is the number of monochromatic triangles in G.
Proof.This quantity comes from the observation that every monochromatic triangle contains three monochromatic paths of length 2, but each non-monochromatic triangle contains precisely one monochromatic path of length 2. For ease of computation we use that (the number of non-monochromatic triangles) + 3×(the number of monochromatic triangles) is ( , since n 3 is the total number of triangles.Thus, N  3 + 2f (G) is the total number of monochromatic paths of length 2 in G, since this counts every non-monochromatic triangle once and counts every monochromatic triangle three times.
By using Goodman's formula, this observation above translates to the following (completely expected) result: Proposition 13.Let G be a graph with N vertices and edge-colored with red and blue.The ratio of monochromatic paths in G that are part of a monochromatic triangle is asymptotically at least 0.5.
The observation above provides an efficient way to compute the ratio of monochromatic paths in G that are part of a monochromatic triangle.We, for example, don't need to count the number of monochromatic paths directly.

Application to previous examples.
5.1.1.Application to voting records.In the case of the threshold graphs from section 4.1, the threshold graph G(t) with the minimum "transitivity" percentage is precisely the threshold graph with the minimum number of monochromatic triangles, namely t = 9 (52.7%).Analogously, for D(t) this occurs at t = 7 (66.6%)and for R(t) this occurs at t = 5 (72.0%).In total, using all countries, 94.4% of all monochromatic paths complete to an edge of the same color.This is well above the 50% guaranteed by Proposition 13.Again, a complication is introduced by only looking at one threshold level rather than calculating the entire χ 2 .

Conclusions and questions
We now make two major calls to use these methods: applications and development of related theory.6.1.Theory building.This use of Goodman's formula suggests the need for other quantitative Ramsey statements.For higher dimensional objects, we mention a couple that already exist and some that have yet to be developed.
A recent survey of Ramsey bounds for hypergraphs is a useful place to see the current best known bounds for various Ramsey numbers [8].This survey also goes through proof sketches, many of which contain a weak Goodman-style lower bound.These bounds typically come from a use of the probabilistic method (see for example [1]).
In general, the probabilistic bounds provide a first non-trivial upper bound on the percentage of monochromatic structures, and improving them can be difficult.In order to use Ramsey theory in a generalized way, a closed form analogous to Goodman's formula needs to be developed for all K n subgraphs and all C n -colored graphs.
6.2.Further applications.The case of triangles is simple, but still captures the quantitative notion of transitivity of a relation.Additionally, counting the number of monochromatic triangles in a graph is computationally efficient.
Further progress could be motivated by finding interpretations for other quantitative Ramsey statements.For example, a quantitative version of Van der Waerden's theorem for a fixed length.That is, given a 2-coloring of the points {1, 2, . . ., 9} it is known that there must be at least one arithmetic progression of length 3 (i.e. a 0 , a 0 + m, a 0 + 2m) where all points are the same color.The following question has a reasonable answer in [11], which has serious mathematical content: Question 14.For N sufficiently large.Give reasonable lower-bounds and upper bounds on the percentage of monochromatic 3-term progressions that must exist for any 2-coloring of {0, 1, 2, . . ., n}.
For 4-term progressions, see [14] and the strengthening [7].Both of these are non-trivial results.
The next step is to interpret 3-term progressions (or 4-term progressions) in a data-set in a meaningful, physical way.6.3.Closing remarks.We believe that the connections between data science and Ramsey theory are still largely unmade and will prove to be profound.We have shown that Ramsey theory can be used to rigorously define spurious correlations in datasets, and how deviations from the number of required spurious correlations might be meaningful in terms of transitivity.
On behalf of all authors, the corresponding author states that there is no conflict of interest.

Figure 1 .
Figure 1.A graph where only the edges a, b, c and d exist.The dashed lines are used to indicate a lack of edge. 1

Figure 6 .
Figure 6.Countries are arranged alphabetically starting at the top and going counterclockwise.The green nodes are the G7 and G20 countries.The graph has 214 vertices, 1363 blue edges, the average blue degree is 12.7, the five highest blue degrees are 162 (China), 125 (United States), 96 (Germany), 66 (France) and Italy (61).The largest complete subgraph has 8 vertices: Algeria, China, France, Germany, Italy, Spain, United Kingdom, United States, forming a K 8 .The largest independent set has 70 vertices, forming an I 70 .

Figure 7 .
Figure 7.The expected number of monochromatic K 4 and K 5 s as a function of t.The Goodman-type upper bound for K 4 is 0.0295, and 0.00183 for K 5 .

Table 2 .
The percentage of monochromatic triangles for various threshold graphs.The minimum values are boxed.

Table 3 .
The χ 2 fit for the overall voting record G(t), Democrats D(t),

Table 4 .
The χ 2 fit for the overall voting record G(t), Democrats D(t), and Republicans R(t) by color (R,B).This demonstrates the degree of the triangle bias for each pre-defined classification.These χ 2 values have p−values that are very, very small.A way to place these in context is to compare them to the expected value's deviation from what's required by Ramsey theory:

Table 5 .
The χ 2 fit for the overall expected value of forced monochromatic triangles.

Table 6 .
The deviation of χ 2 of G(t), D(t), and R(t) from their respective expected χ 2 values.

Table 10 .
Transitivity numbers for the threshold graphs.