# Assessing Spurious Correlations in Big Search Data

## Abstract

**:**

## 1. Introduction

## 2. Materials and Methods

## 3. Results

## 4. Discussion

## Supplementary Materials

## Author Contributions

## Funding

## Data Availability Statement

## Conflicts of Interest

## References

**Figure 2.**Screenshot of identified spurious correlation from Google Correlate. Gamma1 is a randomly generated cross-state distribution resulting from draws of the gamma (1, 1) distribution.

**Figure 3.**Screenshot of identified spurious correlation from Google Correlate. RandWalkWK9 is a random walk variable generated by adding successive standard normal draws to generate a weekly time series. The time frame for the weekly time series is from January 2004 through January 2016.

**Figure 4.**Density of maximum spurious correlations: gamma (1, 1), spatially correlated, Gaussian random walk, and mean-reverting normal distributions. Note: each distribution is censored at r = 0.60, which excludes some experiments from the figure (see Table 3 for the portion of experimental runs that were censored).

Gamma Run (top 90 results with maximum correlation 0.72 to minimum correlation 0.65): whistling, ron jones, red ticking, purdy, james alan, auburn golf, city of mount vernon, maximilien, weather mount vernon, eastgate park, tucker park, pine box, richard pope, nancy stewart, auburn theater, liquid lime, rock orchestra, state abortion laws, hunter tree, elma,, amazon grocery, burger master, state adoption, foley library, diagnoser, stanley and, lynnwood apartments, state congress, college running, baker lab, motor trucks, state polls, mount vernon zip, the rainier, scan tv, callison, hope place, ivan the gorilla, hooverville, auburn medical center, weight loss for life, pignataro, funtasia, ballard, gates hall, days inn auburn, elma, pi, weather in mount vernon, ken hutcherson, 5 tv, genealogy search engine, state congressional districts, state rivers, 1077, ups university, capital mall, mill creek, phinney, idiot pilot, lakewood cinema, center laser, narrows bridge, white center, the airlock, emerald ridge, bainbridge high, bainbridge high school, avacyn restored spoilers, healthfinder, small works, the mural, state pta, the other coast, the patriarchs, mount vernon police department, row to hoe, home lodge, bonney, evergreen medical center, treehouse for kids, bellevue high school, three dollar bill, james g, reid realty, the family pet, figgy, bellevue high, teneriffe, egg nest. |

Random Walk Run (top 90 results with maximum correlation 0.9532 to minimum correlation 0.9353): inmate locator, chase, period calculator, best wordpress, 26 weeks pregnant, 14 weeks pregnant, 29 weeks pregnant, view text messages, jail inmates, 15 weeks pregnant, nyc midtown, 18 weeks pregnant, wordpress page, landers mclarty, chances of getting pregnant, 33 weeks pregnant, pain during pregnancy, hard to get, on a mac, wordpress admin, mucus in stool, weeks pregnant, miami dade inmate, franklin tn, 32 weeks pregnant, clip in hair, how to text, madison heights mi, email to text, 25 weeks pregnant, do girls like, skype history, 33165, clip in hair extensions, find my cell, songs like, what song goes, could i be pregnant, what is a good, like a guy, lansing mi, macbook pro screen, 33186, m and t, find my cell phone, best pdf, 23 weeks pregnant, clip in, plugin for wordpress, dade inmate search, your high, canton mi, fg, 19 weeks pregnant, allen tx, girl you like, miami dade inmate search, okemos mi, gluten free?, uitableview, what does te, acoustic chords, county jail inmates, chase bank in, birds barbershop, in charlotte nc, chico ca, what is the easiest, pregnant symptoms, xps to pdf, altamonte springs fl, dream mean, during pregnancy, good name, how far along am i, 31 weeks pregnant, how far along am, lls, livonia mi, chase on, restore to factory, a pregnancy test, move wordpress, in memphis tn, artists like, how do you tell, grand rapids mi, jquery scroll, kp.org, frederick md. |

**Table 2.**Maximum spurious correlations in simulations testing 1,000,000 pairs of random variables following specified distributions.

Uniform | Normal | Gamma (1, 1) | Spatial | Random Walk | Mean Reverting | |
---|---|---|---|---|---|---|

Uniform | 0.66 | |||||

Normal | 0.62 | 0.63 | ||||

Gamma (1, 1) | 0.59 | 0.61 | 0.80 | |||

Spatial | 0.68 | 0.62 | 0.62 | 0.73 | ||

Random Walk | 0.59 | 0.64 | 0.62 | 0.59 | 0.98 | |

Mean Reverting | 0.61 | 0.63 | 0.66 | 0.61 | 0.88 | 0.82 |

Max Overall | 0.68 | 0.64 | 0.80 | 0.73 | 0.98 | 0.88 |

**Table 3.**Frequency of spurious correlations identified by Google Correlate by distribution of random variable.

Probability Distribution from Which Random Variables Were Drawn | N of RVs | Portion of Random Variables with Spurious Correlation > 0.6 | Mean Largest Correlation per Variable (Standard Deviation) | 95th Percentile of Largest Spurious Correlation across Variables | Largest Correlation Found across Variables |
---|---|---|---|---|---|

Spatial | 500 | 68% | 0.66 (0.04) | 0.72 | 0.78 |

Gamma (1, 1) | 600 | 97% | 0.71 (0.06) | 0.82 | 0.91 |

Std. Normal | 499 | 33% | 0.63 (0.02) | 0.66 | 0.71 |

Uniform | 500 | 22% | 0.63 (0.02) | 0.64 | 0.7 |

Mean-Reverting | 500 | 76% | 0.69 (0.06) | 0.78 | 0.85 |

Random Walk | 500 | 99% | 0.87 (0.08) | 0.97 | 0.98 |

