3.1. Correlated Data: Matched K-Tuples, K ≥ 2
A preponderance of empirical evidence advocates that pairs (i.e., k = 2) are the most common, and most frequently occurring in history, matched k-tuple. Additionally, history discloses that the workhorse of classical statistics is linear regression, which offers a technique for calculating means and, for example, differences of two means. Consider the regression equation:
where
Gj is a n-by-1 indicator variable such that its cell entry is 1 if an observation is in group j (j = 1, 2), and 0 otherwise, μ
1 = μ + β, μ
2 = μ − β, the entries in vector
Y are sequential pairs (i.e., y
i and y
i+1 are matched, where i is an odd consecutive integer index), and
ε ~ N(
0,
Iσ
2). Here the observations covariance structure matrix
V is given by
which is a n-by-n block-diagonal matrix with n/2 2-by-2 matrices on its diagonal such that
where
denotes the Kronecker product matrix operation, and
In/2 denotes a
-by-
identity matrix. Because the quantity of interest involves the variance of the difference between (rather than the sum of) two random variables, matrix
V is of the form
Exploiting the block-diagonal form to write this latter matrix avoids Kronecker products:
where the diagonal blocks are of dimension k-by-k. The generalized least squares estimate of β is b =
, with a variance,
, of
. The resulting t-test statistic is the customary
, an outcome corroborated by the preceding mathematical statistics theorem for the difference between two random variables. The number of zeroes in both the inverse covariance matrix and its square root is n(n − 2); in other words, its network density is 100/(n − 1)%; this quantity generalizes to 100(k − 1)/(n − 1)%, where k ≥ 2 is the number of matched observations (e.g., repeated measures).
For this correlated observations case (the Liu-Liang exchangeability category), the effective sample size is
with the positive definite restriction of −1/(k − 1) < ρ < 1/(k − 1); this interval is (−1, 1) for k = 2, (−1/2, 1/2) for k = 3, and continues to shrink toward (0, 0) as k increases. In other words, a sufficiently large number of repeated measures, k, results in n* ≈ n/k = kn
o/k = n
o, the number of independent observations, often assumed in this situation by default. Regardless, if ρ = 0, then n* = n, the case for the difference of two means for two independent samples with a common variance. If ρ = 1, then, for matched pairs, n* = n/2, the case for the difference of two means for two matched/paired samples with a common variance. Because the ideal k = 2 study involves the same n/2 observations with a pair of repeated measures (e.g., before-after), presumably ρ never should be negative. Andrews and Herzberg ([
17]
https://www.york.ac.uk/depts/maths/histstat/pml1/r/andrews.htm); and the R Project furnish empirical examples of this type of data with their respective Table 13.1 and twins dataset (
https://app.quadstat.net/dataset/r-dataset-package-kmsurv-twins). The R Project furnishes female-male twins data about age of death (n = 24, r = 0.83). Table 13.1 furnishes same day measures of wind direction made in the morning and in the evening at a given location (n = 210; r = 0.38, and density = 0.5%). The former t-test statistic increases from 0.74 (ignoring observational correlation; i.e., pseudo-replications [
2]) to 1.67 (accounting for observational correlation), and the latter increases from 2.15 (ignoring observational correlation) to 2.74 (accounting for observational correlation).
Figure 1 portrays these two cases. The respective n* values here are roughly 13 (rather than 24) and 152 (rather than 210). As an important aside, the variances in
Figure 1a suggest the potential for nonconstant variance; however, Pitman’s correlated variance t-test confirms that the two sample variances (i.e., 103.06 and 61.72; 66.11 pooled variance) are not significantly different (t = 1.47, df = 10).
This preceding conceptualization resembles that for analysis of variance (ANOVA) with repeated measures, which underscores how correlated observations impact upon variance calculations, with an assumption of a compound symmetric observations covariance matrix, which is consistent with positing a constant ρ. This implementation produces the following ANOVA tabulation for the twins dataset (assuming an equicorrelation covariance matrix):
Source | df | Mean Squared Error | Correlated data | Pseudo-Replication Data |
F-Ratio | Probability | F-Ratio | Probability |
Between | 1 | 45.38 | 2.79 | 0.12 | 0.55 | 0.47 |
Within | 22 | 82.39 | | | | |
Error | 11 | 16.28 | | | | |
Twins | 11 | 66.11 | | | | |
Total | 23 | | | | | |
The F-ratios here are the respective squared values of their preceding paired t-test statistics. This perspective asserts a sample size of n = 24, paralleling the preceding linear regression conceptualization. The network density in this case study is 4.3%.
One popular practice of observations matching is with monozygotic (i.e., identical) twins, especially those raised in the same shared environment rather than apart; in this situation, the correlation between, for example, IQ scores is approximately 0.86 [
18]. Another familiar correlated data situation of this type is family members constituting a household because this assemblage of people engages in many activities as a group. A weaker matching practice is between treatment and control observations, where similarity of only a relatively small set of selected attributes sanctions the creation of artificial pairs. Andrews and Herzberg [
17] include correlated data examples, such as artificial pairings (Tables 21.1, 27.2, 33.1, and 39.1), and same-observation repeated measures (Tables 24.1, 25.1, 28.3, 35.1, and 41.1). Hand et al. ([
19]
https://www2.stat.duke.edu/courses/Spring03/sta113/Data/Hand/Hand.html) include dozens of correlated data examples, ranging from the strengths of chemical pastes (Table 16), cork deposits by the compass direction sides of trees (Table 55), before-after patient treatment (Tables 72, 202, and 285), house insulation contrasts (Tables 88 and 93), linear and road distance separating locations (Table 115), chemical and magnetic measurements of iron in slag specimens (Table 132), and brothers’ head sizes (Table 111) and heights of husbands and their wives (Table 231).
Repeated measures ANOVA extends the paired t-test to more than two means; in other words, k > 2. The utility of this correlated sample ANOVA technique is that it effectively removes extraneous variability attributable to pre-existing observational differences. The simplest block-diagonal version of the inter-observations correlation matrix for three means is as follows:
where n always is a multiple of three, and which, as noted in the preceding discussion, is consistent with assuming a compound symmetric observations covariance matrix. For illustrative purposes, consider the distinct (i.e., triplet order is unimportant) primitive (i.e., the greatest common divisor is 1) Pythagorean quadruplets of Number Theory, each of which is a sum of triplets (i.e., a, b, and c) of squared integers that equals another squared integer (i.e., d): a
2 + b
2 + c
2 = d
2, 0 < a ≤ b ≤ c, where the count of such numbers is 347 for a ≤ b ≤ c ≤ d ≤ 100 ([
20] the dataset utilized here was extracted from
https://pastebin.com/FGwtqsrs and then screened to delete non-distinct and non-primitive entries). This set of integers {a, b, c} constitutes matched triplets. Box-Cox transformations for normality coupled with division by appropriate quantities to adjust for unequal variances attributable to the ordering of and constraints on a, b, and c, yield the following analysis variables: tra =
, trb = [(b − 2)
0.75]/3.2, and trc = [(c − 2)
1.25]/43.1. The resulting correlated sample variances are not statistically significantly different:
≈ 1.61
2,
≈ 1.61
2, and
≈ 1.60
2. Furthermore, these three transformed integers’ pairwise correlations are r
tra,trb = 0.52, r
tra,trc = 0.22, and r
trb,trc = 0.45. The null hypothesis of interest here is:
The associated ANOVA tabulation is as follows, assuming equicorrelation with
= 0.40 and a common variance (i.e., an assumption implied by the lack of three significantly different sample variances):
Source | df | Mean Squared Error | Correlated Data | Pseudo-Replication Data |
F-Ratio | Probability | F-Ratio | Probability |
Between | 2 | 19.53 | 7.55 | <0.001 | 12.51 | <0.001 |
Within | 1038 | 2.59 | | | | |
Error | 692 | 1.56 | | | | |
Triplets | 346 | 1.03 | | | | |
Total | 1040 | | | | | |
These findings imply that the means are different in the population, based upon the sample means of
≈ 3.30,
≈ 3.76, and
≈ 3.43; this outcome highlights the current concern about distinctions between statistical and substantive differences [
21]. The network density for this case study is 0.2%, and n* = 450; as noted previously, to ensure a positive-definite covariance matrix, the equicorrelation assumption restricts ρ in this repeated measures triplets case (i.e., k = 3) to the interval (−1/2, 1/2).
In summary, the main point of this section is that repeated measures is a genre of correlated data whose regression residual specification is exactly the same in form as the other categories of correlated data. Another is that for sufficiently large k, it disappears from the Liu-Liang classification.
3.2. Correlated Data: Temporal Autocorrelation
Here matrix
V includes, for a response variable vector
Y, whose elements are organized in ascending time order, time structure matrix
where n = T, and
IT is a T-by-T identity matrix. Now the effective sample size is n* = n
2/[n + 2
], and the number of zeroes in the basic observations covariance structure matrix
CT is (n − 1)
2. Again, if ρ = 0, then n* = n, whereas if ρ = 1, then n* = 1. Because a time series observation only interacts with its preceding observation(s), one-dimensional and one-directional temporal autocorrelation often is very strong and usually positive. The number of zeroes in the inverse covariance matrix is (n − 1)
2; in other words, its network density is 100/n%.
Hand et al. [
19] include Table 121 reporting annual whooping crane counts for 35 years (1938–1972;
Figure 2a; n = 35,
= 0.95, density = 2.9%). For this example, the sample variance estimate decreases from 12.3
2 (ignoring observational correlation) to 4.6
2 (accounting for observational correlation), and n* = 2. Andrews and Herzberg [
17] furnish Table 62.1, time series data tabulated for the quarterly size of the pig herd in the United Kingdom between 1967 and 1978 (
Figure 2c; n = 48,
= 0.94, density = 2.1%). For this example, again n* = 2. Both of these publicly available collections of relatively small datasets furnish numerous additional examples of time series, some with weaker temporal autocorrelation (e.g., Table 8.1 in Andrews and Herzberg [
17], days between coal mining disasters,
= 0.31), and some with the relatively rare case of negative temporal autocorrelation [e.g., Table 280 in Hand et al. [
19], geyser eruption timings,
= −0.70]. Because these specimen dataset sources are dated, many of their time series can be augmented with additional more recent observations (e.g., see
https://www.canada.ca/en/environment-climate-change/services/species-risk-public-registry/cosewic-assessments-status-reports/whooping-crane-2010.html for an extension of the whooping crane time series to 2007; this data source contains a few discrepancies vis-à-vis Table 121).
Today, numerous time series datasets exist, some dating back several decades. Both Andrews and Herzberg [
17] and Hand et al. [
19] include annual lynx trapping counts (respectively, Tables 3.1 and 109), and monthly sunspots (respectively, Tables 11.1 and 112). Andrews and Herzberg [
17] also include annual lynx pelt counts and prices (Table 3.2), body temperature measurements (Tables 48.1 and 48.4), monthly ozone thickness measures (Table 12.1), the Earth’s annual rotation change angles (Table 20.1), hourly, daily, and weekly rainfall quantities (Tables 13.2, 14.1, and 15.1), and monthly employment (Tables 65.1–65.4) and unemployment (Table 64.1) figures. Meanwhile, Hand et al. [
19] also include weekly and monthly sales figures (Tables 245 and 107), monthly airline passenger counts (Table 113), university enrollment figures (Table 116), daily rainfall quantities (Table 157), monthly temperature measures (Table 341), monthly lung cancer death counts (Table 326), and various sequences of time intervals (Tables 160, 234, and 255).
In summary, the main point of this section is a reinforcement of the notion that time series constitute correlated data.
3.3. Correlated Data: Spatial Autocorrelation
Analogous to time series, spatial series also constitute a category of correlated data [
22]; Cressie [
23] furnishes an excellent overview of the common models used to describe this variety of data. A spatial data series does not necessarily have a discernable pattern in its observations’ covariance matrix (see Reference [
24], reprinted in Hand et al. [
19], Table 270; [
25]). If a landscape is linear, then this covariance matrix includes:
where
In is an n-by-n identity matrix, and the right-hand matrix is the spatial weights matrix
Cs; this matrix has (n − 1)(n − 2) zeroes, and is reminiscent of the time series covariance matrix structure. A P-by-Q (i.e., n = PQ for a complete rectangular grid with P rows and Q columns of polygons) regular square tessellation, such as that associated with a remotely sensed or other digital image whose data form a pixel mesh (i.e., raster data), has as its basic structure matrix:
where a right-hand side matrix subscript denotes the dimension of the square matrix to which it is attached; matrices
CQ and
CP are of the preceding linear case form [
26]. For this specification, the number of zeroes is n
2 − 5n + 2(P + Q). Meanwhile, for a planar surface partitioned into n mutually exclusive and collectively exhaustive polygons, the maximum number of zeroes is n
2 − 13n + 24; its density has a minimum of 200/n% and a maximum of 600(n − 2)/[n(n − 1)]%, with its calculation given by 100
1TCs1/n%.
The preceding time series discussion reveals that the popular description of temporal autocorrelation involves a second-order covariance matrix for observations. This also is the case for spatial series (e.g., the simultaneous autoregressive [
27] and autoregressive response specifications). Frequently the basic correlation structure matrix for geospatial data is row-standardized (i.e., a rescaling such that each row sum of its cell entries, which all are non-negative, is one), converting it from matrix
Cs to matrix
Ws. One intuitively appealing feature of this modification is that the maximum positive spatial autocorrelation parameter ρ value always is 1. Furthermore, for the examples presented in this section, the definition of geographic neighbor is an areal unit adjacency based upon a shared non-zero length polygon boundary (i.e., the rook definition, utilizing an analogy with chess piece moves). In general, because the operating correlation mechanisms are two/three-dimensional and multi-directional, socio-economic/demographic georeferenced data tend to contain a moderate degree of positive spatial autocorrelation described by 0.4 < ρ < 0.6, whereas remotely sensed georeferenced data tend to contain substantial positive spatial autocorrelation described by 0.90 < ρ < 0.95; in contrast, many time series contain autocorrelation values in the interval (0.95, 1.00) (e.g., the average daily temperature in Honolulu for a 58-year period [
28]).
Andrews and Herzberg [
17] furnish Table 67.1, a spatial series (based upon zip code zone areal units) dataset for Chicago insurance provision (
Figure 3a, which is based upon their map); zip code zones without data as well as two zip code zones creating an isolated island in the northeastern part of the city are excluded from this illustrative analysis, whereas one included zip code zone constitutes two geographically separated areal units (n = 44,
= 0.68, density = 8.5%). For this example, the sample variance (without adjusting for covariates) estimate decreases from 1.4
2 (ignoring observational correlation) to 0.8
2 (accounting for observational correlation), and n* = 8. Meanwhile, most meaningful remotely sensed images have at least hundreds-of-thousands, if not millions, of pixels. The following are several available relatively small (but substantively rather meaningless) specimen images often used for illustrative purposes: Getis and Ord (16-by-16, with a single remotely sensed image variable [
29]); High Peak Landsat 7 (30-by-30, with seven spectral bands [
30]); and, Houston pre- and post-Hurricane Harvey paired pixel NDVI differences from Landsat 8 (41-by-41, Griffith, Chun, and Li,
https://personal.utdallas.edu/~yxc070300/sra_esf/ or
https://github.com/ywchun/sp_esf; also see
https://giscrack.com/list-of-spectral-indices-for-sentinel-and-landsat/). The example presented here employs this last dataset because the two smaller ones require spatial covariance matrices of an order higher than two (presumably because of amplified edge/boundary effects attributable to small sample sizes). The spatial series (pixel areal units) data for the difference between pre- and post-Harvey (6 April 2017, and 3 January 2018) NDVI values appears in
Figure 3c (n = 1681,
= 0.93, density = 0.2%). For this example, n* = 48.
Today, numerous spatial series datasets exist, mostly because of the widespread popularity and dissemination of geographic information systems (i.e., GISs); a map needs to accompany each of these series for their adequate descriptions, which occurs only for Tables 5.1, 6.1–6.2, 16.1, 18.1, 49.1, 52.1, and 67.1 in Andrews and Herzberg [
17], with Hand et al. ([
19], p. ix) arguing that their deliberate exclusion of maps maintains dataset presentation simplicity. Both Andrews and Herzberg [
17] and Hand et al. [
19] include agricultural field plot yields (respectively, Tables 5.1 and 320). Andrews and Herzberg [
17] also include additional agricultural field plot yields (Tables 6.1 and 58.1–58.2), groundwater chemicals and soil assay findings (Tables 16.1, 17.1, and 18.1), locational accuracy repeated measures (Table 10.1)—a dataset also relating to
Section 3.1 and
Section 3.4—species distribution counts (Table 49.1), and earthworm biomass density by constructed quadrats (Table 52.1). Meanwhile, Hand et al. [
19] additionally include United States city and state crime statistics (Tables 134, 262, and 356), yeast counts and sand sedge presence/absence by constructed quadrats (Tables 163, 260), pine trees by stands (Table 250), temperature measures by cities (Table 262), industrial employment figures by countries (Table 363), and village dialect similarities (Table 145).
In summary, the main point of this section is a reinforcement of the notion that space series constitute correlated data.
3.4. Correlated Data: Space-Time Autocorrelation
Combining space and time series expands the size of an observations covariance matrix to nT-by-nT, where n denotes the number of geographic areal units, and T denotes the number of time periods; Cressie and Wikle [
31] furnish an excellent overview of the common models used to describe this variety of data. In this context, the simplest version of an inverse covariance matrix includes a factor that may be written in terms of the following two distinctive specifications:
and contemporaneous:
where
InT =
InIT is an nT-by-nT identity matrix, matrices
Cs and
CT respectively are n-by-n and T-by-T and denote the spatial and temporal observations linkage structures (i.e., equivalent to those appearing in expressions (9) and (10)), and ρ
s and ρ
T respectively denote the spatial and temporal autocorrelation parameters. The lagged description furnished by expression (11) states: what occurs at location i at time t is a function of what occurred at location i at time t − 1, as well as what occurred at the neighbors of location i at time t − 1; geographic observation correlation requires time to materialize as inertia accumulates in data. The contemporaneous description furnished by expression (12) states: what occurs at location i at time t is a function of what occurred at location i at time t − 1, as well as what occurs at the neighbors of location i at time t; geographic observation correlation almost instantaneously materializes as inertia accumulates in data. Matrix expressions (11) and (12) tend to have a higher density than a pure time series, but a lower density than a pure space series, observations covariance matrix; the calculation corresponding to expression (11) is given by 100[T
1TCs1 + n(T − 1)]/[nT(nT − 1)]%, with the sum of geographic neighbors term
1TCs1 typically not having a closed-form expression.
By the late 1980s and early 1990s, numerous time series and spatial series datasets had become readily available. In contrast, relatively few space-time datasets had become accessible, although this situation changed considerably by the dawn of the new millennium. Andrews and Herzberg [
17] and Hand et al. [
19] highlight this former paucity of space-time datasets: the former scholars include two (i.e., Tables 5.1 and 50.2), whereas the latter scholars include none, in their books. One noticeable characteristic of most space-time datasets is that temporal autocorrelation dominates correlation among observations, being much more pronounced than spatial autocorrelation because of its one-direction single linkage accumulation of inertia in a system. However, Andrew and Herzberg’s Table 5.1, a tabulation of 74 years (1852–1924) of agricultural wheat and straw yields for a linear arrangement of 17 experimental field plots (the Broadbalk field at Rothamstead Experimental Station), fails to display this feature, most likely because conducting annual randomized crop experiments would tend to prevent an accumulation of inertia in a time series (Grondona and Cressie [
32] note that one goal of experimental design randomization involving areal units is to neutralize spatial autocorrelation; sequential years’ randomizations also tend to neutralize temporal autocorrelation). The space-time inverse covariance structure matrix here has a density of 11.8%. The 17 wheat and straw time series produce sets of
values spanning the respective intervals (−0.50, 0.62) and (−0.63, 0.61); these two quite wide ranges defy a simplifying assumption of a constant ρ
T value for this pair of space-time series. Meanwhile, the 74 wheat and straw space series produce sets of
values spanning the respective intervals (−0.26, 0.33) and (−0.19, 0.44); these two rather wide ranges also defy a simplifying assumption of a constant ρ
s value for this pair of space-time series, although their small sample size most likely also contributes to this degree of dispersion. Consequently, this particular space-time series dataset seems suitable for a mixed model analysis, in which a time-invariant random effects (RE) term is estimated, exploiting the repeated measures nature of the time series for the given set of areal units. This RE term functions as a common factor across the 74 time series, and can be decomposed into spatially structured (SSRE) and spatially unstructured (SURE) components.
Figure 4a,b portray scatterplots visualizing the observed-predicted pairs of values based upon estimated REs.
Figure 4c,d respectively visualize the SSRE and SURE components with tertile maps describing the Broadbalk field;
= −0.30 (which is not significantly different from zero; its H
0 probability is 0.20).
Meanwhile, Andrew and Herzberg’s Table 50.2, a tabulation of 96 months (1963–1971) of Southern Germany fox rabies cases for a 32-by-32 grid of constructed quadrats, is another space-time dataset, with Table 50.1 reporting its cumulative counts over time by these quadrats in map form. Unfortunately, counts reported for the 96 individual maps have a sum that is 13 greater than that for the collective map; furthermore, the cumulative maximum count for the former is 50, whereas it is 20 for the latter. In addition, 11 monthly quadrat maps have missing entries somewhere in their recordings; three of these maps have two, and one has four, missing entries. Consequently, this specific publicly available dataset is too corrupt to allow a proper space-time analysis of it, one that should display an accumulation of inertia. The importance of noting these mistakes here is that so many courses and textbooks, as well as databases such as that for the R project, include the Andrew and Herzberg tables for example data analyses (a contention confirmed by a cursory internet search); almost certainly, these inaccuracies compromise pedagogy. Fortunately, the United States Census Bureau furnishes numerous easily accessible space-time datasets (see
https://www.census.gov/), one of which is decennial population counts for Florida counties since 1930; although some earlier data dating back to 1821 also are available, the last of the 67 contemporary Florida counties was not established until 1925.
For illustrative purposes, this section utilizes a Florida population density dataset covering nine decennial census periods, which is too few time series observations for sensibly assessing and estimating temporal autocorrelation; the simplest of analyses requires a time series with a minimum of about 50 observations [
33]. Therefore, selected preliminary exploratory data analyses were undertaken using the annual estimated Florida county population counts from 1970 to 2019 (see
http://edr.state.fl.us/Content/population-demographics/data/CountyPopulation_2016.pdf) that provide 67 time series with 50 observations each. The space-time inverse covariance structure matrix here has a density of 1.1%. Initial findings include that population density should be subjected to a logarithmic transformation to better mimic a bell-shaped curve, and that log-transformed population density time series should be differenced, converting the sequences from simply log-density to change in log-density time series (this differencing dramatically degrades a conformity with normality). These latter sequences display strong temporal autocorrelation described by a second-order autoregressive specification, suggesting that the lagged space-time specification would be more suitable than the contemporaneous specification. Spatial autocorrelation across the 50 maps is moderate-to-strong, with
spanning the interval (0.49, 0.69); spatial autocorrelation across the 49 differenced maps is more diverse, with
spanning the interval (−0.16, 0.66). One principal consequence of this differencing is that the illustrative space-time dataset now has 8 × 67, rather than 9 × 67, observations; employing the space-time lagged specification further reduces this number to 7 × 67 = 469 observations. The resulting autocorrelation parameter estimates are:
= 0.17, and
= 0.45; again, temporal autocorrelation overshadows spatial autocorrelation. For this example, the sample variance (without adjusting for covariates) estimate decreases from 1.4
2 (ignoring observational correlation) to 0.2
2 (accounting for observational correlation with a random effects linear regression specification), and n* = 66 (based upon a STAR model specification).
In summary, the main point of this section is a reinforcement of the notion that space-time series constitute correlated data.
3.5. Correlated Data: Social Network Autocorrelation
The undirected graph-theoretic representation of the preceding inter-observations correlation structures renders sparse adjacency matrices that tend to increase in density, although only rather modestly, but nevertheless constitute a small percentage of all possible adjacency matrices (e.g., see Reference [
34]). For example, many spatial series relate to connected planar graphs (e.g., Reference [
35]), whereas some spatial series, which employ the queen adjacency definition (another chess move analogy) or a particular set of geographic nearest neighbors or areal units within buffer zone geographic distances, relate to connected nearly-planar graphs; these low-density correlation structure articulations continue to embrace a very small percentage of all possible graphs when quantified by matrix density. More recently, correlated data analysis attention has turned to social network observations correlation structures, which relate to higher density inverse covariance structure matrices. The dataset collections compiled by Andrews and Herzberg [
17] and Hand et al. [
19] fail to include these types of specimen datasets because few sources for them existed prior to the dawn of contemporary network science [
14]; Facebook and Flickr (launched in 2004), Twitter (launched in 2006), Instagram (launched in 2010), and Snapchat (launched in 2011), to name a few, were unknown in the 1980s and 1990s.
Hashmi et al. [
36] tabulate densities for the following selected 500-node social networks: Epinions, 27.5%; Wikipedia, 23.3%; Twitter, 6.2%; e-mail, 4.8%; and, authors, 4.8%. Comparable geographic planar graphs would have densities less than 0.2%. Faust [
37] summarizes descriptive statistics for 51 social networks, reporting that their densities range from 2% to 86%. In addition, the KONECT project (
http://konect.cc/) makes 1326 networks available to the public. Of these, 1197 have their density (which, employing KONECT terminology, equals 100 ×
Fill) tabulated; many appear to have a very low density (186 are nearly zero), similar to the preceding correlated data categories, with the largest density being 79.5% (only 10 networks have a density of at least 50%). Nevertheless, a number of these densities are substantially greater than any of those for the preceding inter-observation correlation structure
C matrices.
Gatewood and Price [
38] analyze a social network of jazz musicians (Reference [
39]; n = 198, density = 14.1%) that is available in this KONECT dataset collection. Their response variable is the degree of centrality within this network, which is the principal eigenvector of n-by-n matrix
C (
Figure 5a). This specimen network furnishes one illustrative example here. Similarly, the centrality index for a University Rovira i Virgili (Tarragona) e-mail network (Reference [
40]; n = 1133, density = 0.9%) furnishes a second social network illustration (
Figure 5c). These two graphs, whose Cartesian coordinates in their portrayals here are the first two eigenvectors of their doubly-centered matrix (
In −
11T/n)
C(
In −
11T/n) appearing in the numerator of the adapted spatial autocorrelation Moran [
41] coefficient index, conspicuously differ from the preceding correlated data structure graphs. Not surprisingly, the existing spatial statistical techniques, which originally evolved from time series techniques, can be exploited to analyze social network autocorrelated data (e.g., the Moran scatterplot [
42];
Figure 5b,d). For the jazz musicians social network,
= 0.82, and n* = 7, whereas for the university e-mail social network,
= 0.46, and n* = 317. Respectively, after a conventional standardizing of the two principal eigenvectors (i.e., dividing each element by its vector’s maximum, and then multiplying by 100), the sample variances (without adjusting for covariates) estimate decrease from 20.9
2 and 10.4
2 (ignoring observational correlation) to 15.7
2 and 9.1
2 (accounting for observational correlation).
In summary, the main point of this section is confirmation of the notion that social network series constitute correlated data. In doing so, it sets the stage for social network autocorrelation being a fundamental item for big data discussions.