## Appendix A. Articles in Our Analysis

The number of edits in each page’s time series is listed in parentheses after the article title, along with a simple classification: George_W._Bush (45,448; biography, politician), United_States (33,725; geography), Wikipedia (32,592; technology), Michael_Jackson (27,587; biography, entertainment), Catholic_Church (24,813, religion), Barack_Obama (23,889; biography, politician), World_War_II (23,173; event, political), Global_warming (20,003; science), 2006_Lebanon_War (19,972; event, political), Islam (18,523; religion), Canada (18,150; geography), Eminem (18,066; biography, entertainment), September_11_attacks (17,564; event), Paul_McCartney (16,973; biography, entertainment), Israel (16,790; geographic), Hurricane_Katrina (16,753; event), Xbox_360 (16,753; technology), Pink_Floyd (16,037; biography, entertainment), Iraq_War (15,891; event), Blackout_(Britney_Spears_album) (15,832; entertainment), Turkey (15,663; geography), Super_Smash_Bros._Brawl (15,432; technology), World_War_I (15,292; event), Gaza_War (14,920; event), Lost_(TV_series) (14,897; entertainment), Blink-182 (14,789; entertainment), Scientology (14,727; religion), John_Kerry (14,307; biography, political), Heroes_(TV_series) (14,223; entertainment), Australia (14,186; geography), China (14,023; geography), Bob_Dylan (13,916; biography, entertainment), Neighbors (13,547; entertainment), The_Holocaust (13,346; event), Atheism (13,295; religion), Hilary_Duff (13,222; biography, entertainment), Mexico (13,213; geography), The_Dark_Knight_(film) (13,025; entertainment), France (12,800; geographic), John_F._Kennedy (12,788; biography, politician), Lindsay_Lohan (12,757; biography, entertainment), Girls’_Generation (12,746; entertainment), Argentina (12,745; geography), Virginia_Tech_massacre (12,682; event), RMS_Titanic (12,451; event), Russo-Georgian_War (12,365; event), Homosexuality (12,170; science), Circumcision (12,149; religion, science), Hillary_Rodham_Clinton (11,981; biography, politician), Star_Trek (11,919; entertainment), Shakira (11,712; biography, entertainment), Sweden (11,666; geography), New_Zealand (11,639; geography), Paris_Hilton (11,635; biography, entertainment), Wizards_of_Waverly_Place (11,520; entertainment), Genghis_Khan (11,410; biography, politician), Cuba (11,390; geography), Linux (11,316; technology), Che_Guevara (11,250; biography, politician), Golf (11,141; entertainment), iPhone (11,085; technology), God (10,731; religion).

## Appendix B. Choosing the Number of States in an HMM

**Table B1.**
Using AIC and BIC to choose the number of states in a hidden Markov model fit. Here, we take an actual model from our data (the 8-state best fit model for the

`God` page), use that model to generate a new time series of equal length (10,731 samples) and attempt to fit a new model, using either AIC or BIC to select the preferred number of states in a manner similar to Refs. [

92,

93]. The table lists the fraction of the time this process led to a preferred machine of each size, for the two different penalties. Both AIC and BIC tend to underestimate model complexity; in general, BIC performs worse, significantly underestimating the true number of states. AIC performs better, recovering the correct number of states nearly half the time.

**Table B1.**
Using AIC and BIC to choose the number of states in a hidden Markov model fit. Here, we take an actual model from our data (the 8-state best fit model for the `God` page), use that model to generate a new time series of equal length (10,731 samples) and attempt to fit a new model, using either AIC or BIC to select the preferred number of states in a manner similar to Refs. [92,93]. The table lists the fraction of the time this process led to a preferred machine of each size, for the two different penalties. Both AIC and BIC tend to underestimate model complexity; in general, BIC performs worse, significantly underestimating the true number of states. AIC performs better, recovering the correct number of states nearly half the time.
**Number of States** | **1** | **2** | **3** | **4** | **5** | **6** | **7** | **8** (Truth) | **9** | **10** |

AIC | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 37.5% | 7.2% | 45.8% | 9.3% | 0.0% |

BIC | 0.0% | 0.0% | 0.0% | 20.8% | 54.1% | 23.9% | 1.0% | 0.0% | 0.0% | 0.0% |

Choosing the number of states to include in an HMM is an example of a model selection problem; one is selecting between models with different numbers of states. In general, the larger the HMM, the better the data can be fit: when does one stop improving the fit because the model is becoming “too complex”? A generic solution to model selection is cross-validation: one fits the model using a subset of the data (the “training set”), and sees how well the model performs when predicting out of sample (the “test set” or “hold-out” set).

Cross-validation uses the phenomenon of over-fitting to determine when a model is too complex: if a model has too many parameters, it will overfit to the training set, finding “patterns” that are really due to coincidence. These patterns will fail to hold in the test set, and will degrade performance on the fit; this degradation can be measured, and one stops increasing model complexity when performance on the test set first starts to decline. Standard cross-validation techniques work best when the data are independently sampled, i.e., when it is possible to construct a test set that is uncorrelated with the training set conditional on the underlying model. As increasing levels of correlation appear in the data, the construction of an appropriate training set becomes difficult.

When, as is the case for HMMs, the model is Bayesian, another method is possible: the likelihood penalty. To use a likelihood penalty, you fit the model to all of the data and note the posterior log-likelihood. You then apply a penalty, reducing the log-likelihood depending on features related to the complexity of the model, including, usually, the total number of parameters. After applying this penalty, it is usually the case that one particular model, often not the most complex one, maximizes the penalized log-likelihood, and this is the one considered preferred. Numerous likelihood penalties exist, including the Bayesian Information Criterion (BIC [

36]), the Bayesian Evidence (introduced in [

94]; used on Wikipedia data in [

1,

16]) and the Akaike Information Criterion (AIC; introduced in [

35]).

The existence of both long- and short-range correlations in data fit using an HMM makes the use of cross-validation and hold-out techniques difficult. The work in [

92] proposes two methods for cross-validation on HMMs that attempt to compensate for the failure of independence. Both methods have difficulties, and work well only on a subset of HMMs, where correlations are not particularly long-range and the transition matrix is not too sparse. Because we find that both conditions are violated in our data, we do not attempt cross-validation tests. The work in [

92] also considers the AIC and BIC methods; they find that both work well in recovering the true size of an HMM used to generate simulated data. We are not aware of work that has tested the use of Bayesian Evidence on simulated data and defer this interesting, but involved, question to later work.

The authors of Ref. [

92] find that, when it fails, BIC tends to slightly underestimate, and AIC to slightly overestimate, the true number of states, although both criteria work well. Work by [

93] confirms this result. However, both papers consider regimes that do not directly apply here, and both papers found cases where (for example) BIC significantly underestimated model complexity. In addition, both papers consider problems with an order of magnitude less data than we have and true model HMM sizes much smaller than ten. In the main text, we use the AIC penalty, a common choice made across the biological and signal processing communities, and strongly argued for on general grounds by [

95] when the goal is minimizing prediction loss.

To validate our choice and, in particular, to determine whether AIC or BIC produces more valid results for the regimes relevant here, we did an in-depth test with a particular model: the eight-state HMM associated with the

`God` page. We took the derived HMM for this page and used it to generate 96 simulated datasets of equal length to the original. We then ran our fitting code on each of these datasets and compared what happened when we used the AIC and the BIC criteria to select the preferred number of states. Consistent with [

92,

93], we found that BIC tended to underestimate model complexity, choosing machines significantly smaller than the true number and, in fact, never recovering the true system size. We found, by contrast, that AIC worked better; like BIC, it still often underestimated model complexity, but did so by smaller amounts. Conversely, a small fraction of the time (less than 10%), it preferred a model that was one state more complex than reality.

Because our main concern is to minimize overfitting and the introduction of fictitious structure, without losing too much of the actual structure in the process, AIC’s slight tendency to underestimate model complexity is not a major concern. A small fraction of the time (roughly 10%), AIC preferred a machine that was one state larger; however, robustness tests show that none of our main conclusions depend on the exact number of states being that chosen by the AIC criterion. We recover the same conclusions, for example, if we arbitrarily fix the number of states equal to twelve. It is worth noting that we do not believe the “true” model of Wikipedian conflict is itself a finite-state machine [

1]; i.e., the fundamental problem is finding the best approximation, rather than locating the correct model within a known class.

## Appendix C. Relaxation Time, Mixing Time, Decay Time, Trapping Time

By the Levin–Peres–Wilmer theorem [

96], the relaxation time,

τ, defined in Equation (

1) above, provides an upper bound on the mixing time,

${\tau}^{\prime}\left(\u03f5\right)$, the maximum time it takes an arbitrary initial condition to be no further than a small distance

ϵ from the stationary distribution, where distance is defined as the maximum absolute value difference for any state occupation probability. In particular, we have

where

${\pi}_{\mathrm{min}}$ is the smallest value in the stationary probability distribution; of characteristic order

${10}^{-1}$ in the HMMs considered in this paper. These relationships hold when the HMM is “reversible” and “irreducible”; in empirical work, such as that presented here, those conditions are almost always satisfied. For example, they hold when states in an HMM have a small self-loop probability, no matter how small, and a directed path exists between any two states; both of which are true by default when using a Dirichlet prior and are true by inspection in our EM fits. This justifies a useful and intuitive interpretation of

τ as a measure of how quickly an arbitrary initial condition converges to the average, as well as suggesting more sophisticated measures that take into account the allowable level of deviation (

ϵ) and inhomogeneities in the stationary distribution itself (

${\pi}_{\mathrm{min}}$).

The relaxation time

τ is related to another natural quantity, the decay time of the second eigenvector. The time constant for decay,

${\tau}_{d}$, is related to

${\lambda}_{2}$ as

which implies that

On doing a Laurent series expansion around

${\lambda}_{2}$ equal to unity, we find

which implies that, to zeroth order in

${\lambda}_{2}-1$ (i.e., when relaxation times are long),

${\tau}_{d}\approx \tau -1/2$ as used in Equation (

2). All of the machines in this paper are in the regime where the relaxation time and decay time are nearly equivalent.

In contrast to relaxation, mixing, and decay time, trapping time is an empirically-measured quantity. To compute trapping time, you define a set of internal states of interest, and then use the Viterbi algorithm to reconstruct the maximum-likelihood path through a particular time series. Trapping time is then defined as the average length of time the system spends in the set of interest before it leaves. In this paper, we track trapping time for the two main subspaces as defined by the sign structure of the second eigenvector.