The World Within Wikipedia: An Ecology of Mind

: Human beings inherit an informational culture transmitted through spoken and written language. A growing body of empirical work supports the mutual inﬂuence between language and categorization, suggesting that our cognitive-linguistic environment both reﬂects and shapes our understanding. By implication, artifacts that manifest this cognitive-linguistic environment, such as Wikipedia, should represent language structure and conceptual categorization in a way consistent with human behavior. We use this intuition to guide the construction of a computational cognitive model, situated in Wikipedia, that generates semantic association judgments. Our unsupervised model combines information at the language structure and conceptual categorization levels to achieve state of the art correlation with human ratings on semantic association tasks including WordSimilarity-353, semantic feature production norms, word association, and false memory.


Introduction
Miller [1] offered the term informavore to capture our tendencies as cognitive agents to devour the information that we encounter in our environment.Miller's notion places emphasis on the agent as recipient of this information.Yet human beings are also actively constructing this information.In the past few millennia, the cultural institutions of human beings have produced a vast wealth of artifacts, and since the onset of written language, much of this creation has been linguistic in nature.Humans both devour, and construct, the informational culture around them, thus producing a dynamic that some have termed niche construction in biological systems [2].In niche construction, the behavior of an organism transforms its environment in a manner that can then facilitate the survival of the organism itself, thus producing a feedback loop.Just as beavers' dam construction modifies its immediate ecology in which it lives, or particular species of trees can alter the nutrient content of a forest floor around them, human cognition and its external linguistic products affect each other through mutual influence: Over a short time scale, a single human extracts information from and adds information to this environment (e.g., linguistic input, reading, etc.), and over a longer time scale, the cumulative impact of the linguistic behavior of many humans produces change in that environment itself [3], creating an inherited linguistic and cognitive ecosystem.
To some cognitive scientists who focus solely on internal mechanisms, this may seem like a strange theoretical agenda.It may be useful to note, however, that there are a multitude of explanatory goals in the cognitive sciences, and these goals lie at different timescales [4].If a cognitive scientist is interested in the immediate influences on language behavior-communicative goals, lexical knowledge, and so on-it may make sense to focus on cognitive theories that best explain the rapid deployment of such knowledge and processes.However, if a cognitive scientist is interested in understanding longer-timescale phenomena-such as cultural or linguistic change, or language evolution and origins-then a broader set of variables becomes relevant.Many theorists have argued that an understanding of longer-timescale biological phenomena is incomplete without attending to the ecological conditions in which these phenomena function (e.g., recently [5]).This notion of the cognitive agent and its environment suggests a mutuality that defies the conventional internalist/externalist dichotomy sometimes framed in the philosophy of mind and cognitive sciences.The linguistic environment and the cognitive agent are, from this perspective, parts of the same system, mutually constraining and shaping each other over a range of time scales [6].
Some effects of our linguistic environment are enormous and unfold gradually.For example, a child raised in an English speaking community will learn to speak English natively and will not spontaneously start speaking Chinese.However, many effects of our linguistic environment are quite subtle.It is well known that exposure to certain kinds of sentence structure will temporarily facilitate participants to produce language with the same structure, a phenomena called structural priming [7].In addition, prior exposure to particular sentence structures can also cause grammaticality judgments of greater acceptability for new sentences with the same structure [8].This effect persists at least seven days after exposure and increases when participants read for comprehension.These research efforts indicate that subtle sentence structure effects are long enough in duration to be an influential part of our cognitive-linguistic environment.
Perhaps the most striking evidence of a cognitive-linguistic ecosystem comes from developmental studies.The common experimental paradigm in these studies is to situate a child and adult in a play session with a new toy.The adult then produces a novel name for the toy multiple times in the session, and some time later the adult uses the novel name to ask the child to get the toy.Children at 13 months of age will correctly respond to a paired novel non-word as well as a paired novel word, but at 20 months children lose this ability and can only correctly respond when labels are novel words [9].Similarly 20-26 month old children respond to words as labels but only when they are produced by the mouth, rather than by a tape recorder held by the adult [10].During development, attention is increasingly focused on words and child-directed words as a cue to naming objects.
Related work in named category learning builds on these effects.In this paradigm, multiple objects/toys belonging to the same category are presented with a word label.When 17 month old children are presented with a label for two toys that are different in all respects except shape, not only do they correctly learn that the label corresponds to shape and generalize it to new objects, but when presented with a new label and new objects with a novel shape, children are able to correctly generalize that the new label refers to the novel shape in a single trial [11].In addition, children who participated in the 8 week experiment showed a roughly 250% increase in object name vocabulary growth during this time compared to a control group that was exposed to the same objects without corresponding word labels.Only children exposed to categories and word labels were able to generalize the property of shape to new objects in a single trial.In a related study with 13 month olds, not only were word labels found to increase attention to novel objects of the same category, but word labels were also found to increase attention to the superordinate category (cow-animal), relative to a non-word-label condition [12].These studies demonstrate the mutual influence between language and cognition during development: Word labeling focuses attention on category features, attention to discriminating features improves category structure, and improved category structure facilitates the learning of more word labels.
Although the growing body of empirical work above indicates that our cognitive-linguistic environment affects language structure and categorization, it also highlights the difficulty of long duration experiments with human participants.An alternative approach is to provide a comparable cognitive-linguistic environment to a computational cognitive model and observe the similarities between that model's behavior and human behavior.There is an extensive literature using this approach to model human semantic behavior.One popular approach, known as latent semantic analysis (LSA), represents text meaning as the spatial relationships between words in a vector space [13,14].LSA has been used to model a variety of semantic effects including approximating vocabulary acquisition in children [13], cohesion detection [15], grading essays [16], understanding student contributions in tutorial dialogue [17,18], entailment detection [19], and dialogue segmentation [20], amongst many others.LSA is part of a larger family of distributional models.The underlying assumption of distributional models is that the context of use determines the meaning of a word [21].Thus doctor and nurse would have a similar meaning, because these words (as well as their referents) typically occur in the same context.In the example of LSA, the contexts associated with a word are represented as vector components, such that the jth component of the word vector is the number of times that word appeared in the jth document in the text collection.Other distributional models vary according to how they define, represent, and learn contexts [22].
However, traditional models such as LSA are based solely in language structure, and so they do not model the mutual influence between cognition and language.This is partly because the available environments for such models have been entirely linguistic, e.g., text-dumps of books, newspapers, and other abundant sources of text.In contrast, the advance of the Internet has given rise to data sets that are created and organized in novel ways that reflect human conceptual/categorical organization.Wikipedia is the prototypical example of this new breed of cognitive-linguistic environment.It is read and edited daily by millions of users [23].As an online encyclopedia, Wikipedia is structured around articles pertaining to concept-specific entries.Additionally, Wikipedia's structure is augmented by hyperlinks between articles and other kinds of pages such as category pages, which provide loose hierarchical structure, and disambiguation pages, which disambiguate entries with exact or highly similar names.Using Wikipedia as a cognitive-linguistic environment, a computational model that incorporates both the mutual influences of conceptual/categorical organization and the structure of language should produce behavior closer to human behavior than a model without such mutual influence.
Several researchers have already used Wikipedia's structure in models that emulate human semantic comparisons [24][25][26].In this paper we extend their work in two significant ways.First, rather than focus on a single type of structure, e.g., link structure or concept structure, we present a model that utilizes three levels of structure: Word-word, word-concept, and concept-concept (W3C3) to more fully represent the cognitive-linguistic environment of Wikipedia.As we will show in the following sections, each of these levels independently contributes to an explanation of human semantic behavior.Secondly, in addition to the common dataset considered by previous researchers using Wikipedia, the WordSimilarity-353 [27] dataset, we apply the W3C3 model to a wider array of behavioral data, including word association norms [28], semantic feature production norms [29], and false memory formation [30].Studies 1 to 4 examine how the W3C3 model manifests language structure and categorization effects across this wide array of behavioral data.Our analysis suggests that, at multiple levels of structure, Wikipedia reflects the aspects of meaning that drive semantic associations.More specifically, meaning is reflected in the structure of language, the organization of concepts/categories, and the linkages between them.Our results inform the internalist/externalist debate by showing just how much internal cognitive-linguistic structure used in these tasks is preserved externally in Wikipedia.

Semantic Models
In the following sections we present three approaches that when applied to Wikipedia extract models of semantic association at three different levels.The first model, the Correlated Occurrence Analogue to Lexical Semantics [31], operates at a word-word level.The second model, Explicit Semantic Analysis [24,25], operates at a word-concept level.The third and final model, Wikipedia Link Measure [26], operates at a concept-concept level.We then describe a joint model (W3C3) that trivially combines these three models.

Correlated Occurrence Analogue to Lexical Semantics
The Correlated Occurrence Analogue to Lexical Semantics (COALS) model implements a sliding window strategy to build a word by word matrix of normalized co-occurrences [31].Because the meaning representation of each word is defined by its co-occurrence with other words, COALS can be considered to be a word-word level model.The COALS matrix is constructed using the following procedure.For each word in the corpus, the four words preceding and following that word are considered as context.The word at the center of the window is identified with a respective row of a matrix, and each of the eight context words in the co-occurrence window are identified with a respective column.Thus for a particular window of nine words, eight matrix cells can be identified corresponding to the row of the center word and the columns of the eight context words.These eight cells are incremented using a ramped window, such that the immediate neighbors of the center word are assigned a value of 4, the next neighbors a value of 3, and so on such that the outermost context words are assigned a value of 1. Thus the eight cells are incremented according to a weighted co-occurrence, where the weight is determined by the distance of the context word from the center word.After the matrix is updated with all the context windows for the words in the corpus, the entire matrix is normalized using Pearson correlation [31].However, since the correlation is calculating the joint occurrence of row and column words (a binary variable), this procedure is equivalent to calculating the phi coefficient, which we present as a simpler description of the normalization process.Let v be the value of a cell in the co-occurrence matrix, c be the column sum of the column containing v, r be the row sum of the row containing v, and T be the sum of all cells in the matrix.Table 1 summarizes the entries for calculating the phi coefficient.
Table 1.Per cell calculation of the phi coefficient.

Word B Present Absent Total
The corresponding phi coefficient [32] is In addition to transforming each cell value into its corresponding phi value, COALS "sparsifies" the matrix by replacing all non-positive cells with zero, such that for any cell value Thus the final representation for a given word is its associated row vector, whose only non-zero components are positive correlations between that word and context words.The semantic similarity between two such words may be compared by locating their corresponding row vectors and calculating the correlation between them.
It is worth noting that the original COALS article proposes several variants based around the above process.One such variation removes 157 stop words before processing the corpus; another restricts the columns of context words to the most frequent words only.Although the usefulness of these variations has been disputed in other frameworks [33], we nevertheless follow the original process for replication purposes.A third variant of COALS transforms the co-occurrence matrix using singular value decomposition (SVD) [34].SVD is the key step in LSA and is used in COALS in a similar way: To eliminate noise in the matrix.In this variant, called COALS-SVD, the matrix A is first reduced to its most common 15,000 rows and 14,000 columns, forming the submatrix B. Phi-normalization as described above is applied to B, and finally B is transformed using SVD into three matrices where U and V are orthonormal matrices and Σ = diag(σ 1 , . . ., σ n ) and σ 1 ≥ . . .≥ σ n ≥ 0. The σ i are the singular values of the matrix B. The desired matrix for word-word comparisons is U , whose row vectors are the SVD-denoised versions of B's row vectors.Observe that right multiplying B by V Σ −1 yields U BV Σ −1 = U By this identity, the full vocabulary of the original A matrix may be projected into the SVD solution for B, as long as the column dimensions of A and B match (e.g., 14,000).To do this simply right multiply A by V Σ −1 AV Σ −1 = U A U A 's row vectors are SVD-denoised versions of A's row vectors, defined by the SVD solution of B.

Explicit Semantic Analysis
Explicit Semantic Analysis (ESA) uses the article structure of Wikipedia without considering the link structure [24,25].The intuition behind ESA is that while traditional corpora are arranged in paragraphs, which might contain a mixture of latent topics, the topics in Wikipedia are explicit: Each article is a topic, or correspondingly a concept.ESA defines term vectors in terms of their occurrence in Wikipedia articles.Because the meaning representation of each word is defined by its co-occurrence with article concepts, ESA can be considered as a word-concept level model.ESA vectors are based on frequency counts weighted by a variation of term frequency-inverse document frequency (tf-idf) [35]: where v ij is the number of occurrences of a term i in an article j. Correspondingly: where |A| is the total number of Wikipedia articles and the denominator is the number of articles in Wikipedia that contain a given term.An ESA vector for term i is defined by: where v ij is the jth component of vector v i , normalized by the length of the tf-idf vector.As a result, all ESA vectors have a length of 1. Similarity between terms is computed as the vector cosine between term vectors.Because Wikipedia is a rather large corpus with many unique terms, ESA approaches have used pruning heuristics to reduce the space to a more manageable size [24,25].Pruning heuristics include removing articles with less that 100 words, removing articles with fewer than five total links (inlinks and outlinks), removing high frequency words (stop words), removing rare words, and transforming the remaining words into their approximate uninflected root form (a process called stemming).

Wikipedia Link Measure
Previous work has used the link structure of Wikipedia to derive a semantic similarity measure using the Wikipedia Miner toolkit [26].The basic premise of this approach is that every Wikipedia page has pages that link to it (inlinks) as well as pages it links to (outlinks).The links are inline with the text, meaning that a word or phrase in the text has been hyperlinked.The words/phrases themselves are called anchors, and they can be viewed as synonyms for the target page, e.g., motor car, car, and automobile link to the Wikipedia page Automobile.The Wikipedia Link Measure (WLM) uses anchors to map input text to Wikipedia articles and then uses the inlinks and outlinks of these articles to derive the similarity between words [26].Because the meaning representation of each word is defined by the links between its associated concept and other concepts, WLM matches our definition of a concept-concept level model.
Consider two articles, Football and Sport.For a particular link type, say inlinks, and these two articles, we can place all other articles in Wikipedia into one of four categories as shown in Table 2.The frequencies in Table 2 are hypothetical, but they serve to illustrate a common structure.First, the number of links shared by two articles are likely to be relatively small compared to the number of links they possess individually.Secondly, the number of links that neither has is likely to be very large and relatively close to the total number of articles.Intuitively, two articles that share inlinks and outlinks are likely to be similar; however, to the extent that some links may be common across many articles they should be weighted less.This intuition is captured in the WLM outlink metric, which weights each outlink o by log(|A|/|O|), the log of the number of total articles in Wikipedia |A| divided by the number of articles that link to that page |O|.Weighted outlink vectors are constructed based on the union of outlinks for two articles, and the outlink similarity is the cosine between these two vectors.The inlink metric is modeled after Normalized Google Distance [36] and measures the extent to which the inlinks X of article x intersect the inlinks Y of article y.If the intersection is inclusive, X = Y , the metric is zero: Inlink and outlink metrics are averaged to produce a composite score.Since each anchor defines a set of possible articles, the computations above produce a list of scored pairs of articles for a given pair of anchors.For example, the anchor bill links to bill-law and bill-beak, and the anchor board links to board-directors and board-game, leading to four similarity scores for each possible pairing.WLM selects a particular pair by applying the following heuristics.First, only articles that receive at least 1% of the anchor's links are considered.Secondly, WLM accumulates the most related pairs (within 40% of the maximum related pair) and selects from this list the most related pair.It's not clear from the discussion in Milne & Witten [26] whether this efficiency heuristic differs from simply selecting the most probable pair except in total search time.

W3C3: Combined Model
In this section we present our combined model using implementations of the models described above.We call this model W3C3 because it combines information at the word-word, word-concept, and concept-concept levels.For each model except COALS, reference implementations were chosen that are freely available on the web.
To implement Wikipedia Miner's WLM, we downloaded version 1.1 from Sourceforge [37] and an xml dump of Wikipedia from October 2010 [38].ESA does not have a reference implementation provided by its creators.However, Gabrilovich recommends another implementation with specific settings to reproduce his results [39].Following these instructions, we installed a specific build of Wikiprep-ESA [40] and used a preprocessed Wikipedia dump made available by Gabrilovich [41].We created our own implementation of COALS and created a COALS-SVD-500 matrix using the same xml dump of Wikipedia from October 2010 as was used for WLM above.
One intuition that motivates combining all three techniques into a single model is that each represents a different kind of meaning at a different level: Word-word, word-concept, and concept-concept.This intuition was the basis for our simplistic unsupervised W3C3 model, which is simply to average the relatedness scores given by these three techniques.Two relevant properties of the W3C3 model are worth noting.First, it has not been trained on any part of the data.Secondly, it has no parameters for combining the three constituent models; rather their three outputs are simply averaged to yield a single output score.

Study 1: WordSimilarity-353
The WordSimilarity-353 [27,42] collection is a standard dataset widely used in semantic relatedness research [24][25][26][43][44][45][46][47].It was developed as a means of assessing similarity metrics by comparing their output to human ratings.WordSimilarity-353 contains 353 pairs of nouns and their corresponding judgments of semantic association.The nouns range in frequency from low (Arafat) to high (love) and from concrete (car) to abstract (psychology).Judgments are made on a scale of 0 to 10, with 0 representing no relatedness and 10 representing maximal relatedness.The data is divided into two sets.The first set contains ratings by thirteen judges on 153 word pairs.The second set consists of ratings by sixteen judges on 200 word pairs.We assessed inter-rater reliability for both sets using Cronbach's α and found a high level of agreement, α = 0.97.The following analyses present results on all 353 pairs using the average rating across judges on each pair.Previous work on this dataset has reported results in terms of a non-parametric Spearman correlation; this metric of performance is also adopted here.
The correlation between ESA and the WordSimilarity-353 data set is the highest previously reported, r(351) = 0.75.It should be noted that an ESA model using additional link information has been attempted, but yielded no improvements over this basic model [25].The ESA implementation we used also correlated with the human ratings, r(351) = 0.67, p < 0.001, which is significantly lower than the original reported correlation, p = 0.02.A plausible explanation of this difference is that some tweaks crucial for high performance, such as inverted index pruning [25], which are not implemented in the reference implementation we used, are necessary to achieve the originally reported correlation.
Milne & Witten [26] found that WLM was highly correlated with human judgments in WordSimilarity-353, r(351) = 0.69.The WLM reference implementation we used also correlated with human ratings in the data set, r(351) = 0.66, p < 0.001, but was lower than the original reported correlation r(351) = 0.69.The difference in correlations was not significant, z = −0.73,p = 0.23.The reason for this discrepancy is unclear, but it may be attributed to the differences in versions of Wikipedia used here and in the initial reported research.
The W3C3 model has state of the art correlation with the WordSimilarity-353 data set, r(351) = 0.78, p < 0.001.Correlations for all models are presented in Table 3.The W3C3 model's correlation is significantly higher than all correlations in the replicated results, p ≤ 0.03, but not significantly higher than the best previously published ESA result, p = 0.17.Previous work has found that distributional models and graphical (WordNet-based) models have differing performance on WordSimilarity-353 depending on whether the word pairs in question have a similarity relationship or a more general relatedness relationship [43].To test this hypothesis with the COALS, ESA, WLM, and W3C3 models, we used the same partitioning of the dataset into similarity and relatedness pairs.The similar pairs are synonyms, antonyms, identical, or hyponym-hyperonym, and the related pairs are meronym-holonym or other relations.Inter-rater agreement in the coding of the pairs was high, Cohen's kappa = 0.77.The similarity and relatedness subsets contained the similar and related pairs described above and shared the same unrelated pairs, yielding 203 pairs for similarity and 252 pairs for relatedness.Correlations for all models on these subsets are presented in Table 4.The difference in correlations between the W3C3 model and the previous best Agirre model [43] is significant for both similarity, p = 0.0465 and relatedness, p = 0.02.For both similarity and relatedness subsets, the W3C3 model performed significantly better than its constituent models, and each model performed significantly better on the similarity set than on the relatedness set, p < 0.05, except for ESA, p = 0.06.However, these are rather coarse sets: As mentioned above, the similarity set is an aggregation of common semantic relationships.In order to better understand the relative performance of each model on these subtypes, we used the labeled semantic categories for each WordSimilarity-353 pair provided in the similarity/relatedness subsets.These are antonym, hypernym (first word is hypernym of second word or vice versa), identical, part-of (first word is a part of the second or vice versa), siblings (share a parent category, e.g., dalmatian and collie are children of dog), synonyms, and topical (some relationship other than previous relationships, e.g., ladder and lightbulb).Grouping pairs by semantic category, we calculated the average distance between predicted rank and the human rank for each model.The results are shown in Table 5.Since the ideal distance to the human ranking is zero, lower scores are better.The lowest score in each row is in boldface, and the second lowest score is italicized.
The most striking pattern in Table 5 is that two-thirds of the best scores per category belong to the W3C3 model.Moreover, for every category save one, the W3C3 model either has the best score or the second best score.Thus breaking down the WordSimilarity-353 pairs by semantic categories is producing the same pattern of results seen in Tables 3 and 4: The three constituent models are providing different kinds of information, and averaging their outputs is creating a more human-like measure of semantic comparison than any of them individually.A linear regression was conducted to explore this possibility.The scores given by COALS, ESA, WLM, and human judges were converted to ranks, and then a linear regression on the ranks was performed [49], using COALS, ESA, and WLM ranks to predict the human judgment ranks.The results of the linear regression are presented in Table 6.Tolerance analyses were conducted to test for multicollinearity of COALS, ESA, and WLM by regressing each on the other two.The obtained tolerances, all between 0.49 and 0.60, suggest that the three models are not collinear.The explanation that each model is contributing substantially and equitably to the prediction is further supported by the similar magnitudes of β in Table 6.To address the question of the maximum potential of the COALS, ESA, WLM, and W3C3 models for correlation with human ratings, an oracle analysis was undertaken [43].The oracle first converts the output of each model to ranks.Then for each word pair, the oracle selects the output of the model whose rank most closely matches the rank of the human rating.This procedure generates the best possible correlation with the human ratings, based on the assumption that the oracle will choose the closest model output every time.Using this methodology with all four models, the oracle correlation is r(351) = 0.93.Using only the three constituent models, the oracle correlation is r(351) = 0.92, which is equivalent to the previous best reported oracle correlation that used roughly an order of magnitude more data than the present study [43].So the maximum potential correlation of the three constituent models matches the previous best result, with a minor improvement due to including the W3C3 model in the oracle.
The preceding analyses provide fairly strong evidence for reason behind the W3C3 model's efficacy.The W3C3 model has significantly higher correlations than the constituent models on the entire dataset, this improvement is preserved across most semantic categories, the regression shows equal contribution by the constituent models, and the oracle shows the upside potential of these models is consistent with or perhaps slightly better than the previous best.All of these results support the conclusion that each constituent model's semantic level, i.e., word-word, word-concept, and concept-concept, contributes positively to increasing the correlation with human semantic comparisons.
Our final analysis tests whether an artifact of the similarity scores might be responsible for this difference.It has been previously noted that models can perform better on WordSimilarity-353 when word pairs that lack a semantic representation are removed from the model [43].Because of missing representations, these defective word pairs always have a zero similarity score (the default score).By averaging the three constituent model scores, the W3C3 model removes this deficiency: At least one of the models is likely to have a representation or otherwise produce a non-zero score.For example, due to missing or extremely sparse semantic representations for WordSimilarity-353 words, WLM yielded 73 zero relatedness scores, ESA yielded 81, and COALS yielded 0. Thus one explanation for the improved correlation of the W3C3 model over the individual models is that the W3C3 model minimizes the effect of missing/sparse word representations.
To explore the effect of zero relatedness scores on the performance of the individual and W3C3 models, we created a subset of WordSimilarity-353 for which none of the three models had a zero relatedness score, consisting of 226 word pairs.For the three individual models, higher correlations on this subset than on the whole set would support the missing/sparse representation explanation.Similarly, for the W3C3 model, a similar correlation to the other three models (with zeros removed) would also support the missing/sparse representation explanation.Correlations for all models on this subset are presented in Table 7.The pattern of correlations in Table 7 do not support the missing/sparse representation explanation.First, each model's correlation is lower than its counterpart on the whole data set given in Table 3, indicating that eliminating pairs with zero scores does not improve the performance of the individual models.Secondly, the W3C3 model in Table 7 has a higher correlation than any of the individual models by 0.08 or more, which is similar to the pattern on the whole dataset in Table 3, though all correlations were lower on this subset of data.Thus, the W3C3 model yields an improvement in correlation regardless of whether the words with missing/sparse representations are removed.

Study 2: Semantic Feature Production Norms
Although the WordSimilarity-353 data set has been widely used, it does have one notable weakness as a measure of human semantic behavior, namely its size.Given its limited size, it is possible that the results from Study 1 might not generalize.In order to assess the ability of the W3C3 model to generalize to new data sets, we applied the constituent models and W3C3 model from Study 1 to a large set of semantic feature production norms [29].For the sake of exposition we will refer to these norms using the first initial of the last name of the authors presenting the norms, MCSM.
The MCSM norms consist of 541 nouns, or concepts.Participants were asked to list features for a concept such as the physical, functional, or encyclopedic aspects of the concept.The generated features were regularized, e.g., usually has a tail and tail would be coded as has tail.After regularization, 2,526 features remained.Thus the data can be represented as a 541 × 2,526 matrix whose cell values v ij are the number of production occurrences for a feature j given a concept i.From this matrix, a 541 × 541 matrix was created in which the value at each cell is the cosine of two 2,526 dimension row vectors associated with their respective concepts.The 541 × 541 matrix has 95,437 non-zero pairs representing similarities between concepts.Although the collection methodology for the feature norms is an associative production task, this 541 × 541 matrix represents the feature overlap between concepts, which is more comparative in nature.
Table 8 presents the Pearson correlations between the similarities from the 541 × 541 similarity matrix and both the predictions from the constituent models and the W3C3 model.The overall pattern of correlations in Table 8 has a striking similarity to those of Table 3. First, the correlations of ESA and WLM are almost identical in both cases.Secondly, the correlation for COALS is significantly greater than that of both ESA and WLM.And finally, the correlation of the W3C3 model is significantly greater than all of the constituent models.That the pattern of correlations is the same in both cases, especially when the MCSM set is so large, suggests that the properties of the models observed in Study 1 are generalizing in a systematic way.In order the assess the relative contributions of each constituent model to MCSM performance, a linear regression was conducted.The regression used COALS, ESA, and WLM raw scores to predict the MCSM similarity scores.The results of the linear regression are presented in Table 9. Tolerance analyses were conducted to test for multicollinearity of COALS, ESA, and WLM by regressing each on the other two.The obtained tolerances, all between 0.55 and 0.68, suggest that the three models are not collinear.In contrast to the WordSimilarity-353 regression presented in Table 6, the constituent models are not equally weighted.The magnitudes of β in Table 9 show that COALS, ESA, and WLM may be rank ordered in terms of their contribution to the overall model.However, the difference between the correlation produced by the W3C3 model in Table 8 and the correlation from the regression equation in Table 9 is extremely small (0.01), suggesting that equal weighting of the three constituent models is fairly robust.

Study 3: Word Association Norms
Word association norms represent a qualitatively different type of task than WordSimilarity-353 or MCSM.Typically in word association tasks, a human participant is presented with a word and asked to produce the first word that comes to mind.This production task is quite different from the raw data of MCSM, where subjects have the task constraints of listing physical, functional, or encyclopedic features and are asked to list 10 such features for a given concept.Transforming the MCSM data into a similarity matrix in Study 2 further removes the data from a stimulus-response production task and more squarely situates it with a semantic comparison task.
The relationship between semantic and associative relations has been the subject of recent discussion in the literature [50][51][52][53][54].In particular, some have argued that framing word association and semantic relatedness as separate and distinct is a false dichotomy [53], whereas others have argued that word association and semantic feature overlap measure different kinds of information [51].Study 1 examined semantic relatedness quite generally; Study 2 examined similarity based on semantic feature overlap.The present study examines word association as a stimulus-response production task without the task constraints of explicitly comparing two concepts.
Perhaps the most widely known word association norms have been collected by Nelson and colleagues over several decades [28].The data used in the present study consists of 5019 stimulus words and their associated 72,176 responses.Each stimulus response pair is annotated by the proportion of participants who produced it, which we refer to as forward associative strength.We refer to these 72,176 triples (stimulus word/response word/forward associative strength) as NMS after the last names of its authors.
Previous work using a type of distributional model called a topic model (also known as latent Dirichlet allocation) and the TASA corpus [13] to train it, used the NMS dataset on two tasks [55].The first task examined the central tendency of the model predictions via the median rank of the first five predicted responses.Thus this task first ranks the human responses for a stimulus word by their associated production probabilities and then compares these to the model's predicted ranking.If the first five predicted responses for all stimulus words are some ordering of the ranks 1-5, then the median rank will be 3.The second task is simply the probability that the model's first response matches the human first response.The results from this previous work as well as the results from our three constituent and W3C3 models on this task are presented in Table 10.
Table 10.Median rank of each model's first five associates compared to human associates and proportion of model first associates that are the human first associate (N = 5,019).

Model
Median As in Studies 1 and 2, the W3C3 model has higher agreement with the human data than both its three constituent models or previous models.It should be noted that the previous results reported in Table 10 are based on models that used a corpus that is two orders of magnitude smaller than Wikipedia and also were tested against only about 90% of the NMS forward associative strength data.Thus the present study used more data but was also tested against 100% of the NMS forward strength data.
The results from Table 10 consider the NMS dataset as a collection of lists: Each stimulus word matched to list of response words ranked by forward associative strength.However, it is also informative to consider each triple (stimulus word/response word/forward associative strength) individually as was done in Study 2. Accordingly, Table 11 presents the correlations of model scores for each pair with the associated forward associative strength.The W3C3 model has a significantly higher correlation with the human data than its three constituent models, p < 0.001.However, these correlations are less than half as high for the NMS dataset as they were for the MCSM dataset presented in Table 8.This difference might imply that the underlying assumptions of these models may not be well aligned with the constraints of the word association task.
These findings are consistent with previous work that suggests measures of word association (e.g., NMS), semantic feature overlap (e.g., MCSM), and text-based distributional similarity are in fact measuring something different in each case.Maki and Buchanan [51] conducted a factor analysis in which these three types of measure loaded onto separate factors that were coherently either associative, semantic, or distributional in nature.One interpretations of these findings is that the observed separation is due to three separate cognitive representations aligned with these three measures.Alternatively, it could be the case that task-adaptive processing is acting on the same representation yet manifesting three different measures (cf.[52,53]).In either case, the work of Maki and Buchanan [51] suggests that modeling both semantic relatedness and word association data with a single representation and procedure is unlikely to be successful.The low correlations in Table 11 lend additional evidence to this claim.
As in the previous study, a linear regression was conducted to assess the relative contributions of each constituent model to NMS performance.The regression used COALS, ESA, and WLM raw scores to predict the NMS forward associative strength.The results of the linear regression are presented in Table 12.Tolerance analyses were conducted to test for multicollinearity of COALS, ESA, and WLM by regressing each on the other two.The obtained tolerances, all between 0.78 and 0.82, suggest that the three models are not collinear.As was previously found in Study 2, the constituent models are not equally weighted.The magnitudes of β in Table 12 show that COALS, WLM, and ESA may be rank ordered in terms of their contribution to the overall model.Thus compared to Table 9, the relative order of WLM and ESA is reversed.Interestingly, the correlation produced by the W3C3 model and the correlation from the regression equation in Table 12 are identical, again supporting the robustness of equally weighting the three constituent models.

Study 4: False Memory
Perhaps the most striking evidence of semantic relatedness' influence on cognitive processing can be found in the Deese-Roediger-McDermott (DRM) paradigm [56].In this paradigm, participants are presented with a list of words highly associated with a target word in previous word association norms experiments.For example, a list containing bed, rest and dream will likely lead to false recall of sleep.Participants in the DRM paradigm are highly likely to recall the non-presented target word-in some cases even when they are warned about such false memory illusions [57].These effects have lead Gallo et al. [57] to conclude that the influence of semantic relatedness on retrieval is intrinsic and beyond the participant's conscious control.Because word association norms are asymmetric, e.g., bed may evoke sleep with high probability but not the reverse, Roediger et al. [30] conducted a multiple regression analysis to determine whether forward association strength (target word evoking a list member), backward association strength (list member evoking a target word), or other features were most predictive of false memory.That study found that backward association strength was strongly correlated with false recall, r(53) = 0.73.It is generally believed that properties of the word lists themselves are only part of the explanation: The major theories of false memory, activation/monitoring [30] and fuzzy trace theory [50], both allow for cognitive processes of monitoring or strategies that intervene in the process of rejecting false memories.
Several researchers have proposed computational models to implement gist-type semantic representations that are consistent, but not tightly integrated with, fuzzy-trace theory [13,55].In general, any distributional model, by pooling over many contexts, creates a gist-type representation.LSA has been proposed to create a gist-type semantic representation [13].The intuition is that LSA abstracts meaning across many different contexts and averages them together.In LSA, although a single word may have multiple senses, e.g., river bank and bank deposit, LSA has one vector representation for each word which is pooled across all the documents in which that word occurs.More recently, latent Dirichlet allocation (LDA, also known as a topic model) has been proposed as an alternative to LSA for representing gist [55].One notable advantage of LDA is that the conditional probabilities used to compare two words are inherently asymmetric and therefore consistent with the asymmetries in human similarity judgments [58].Another advantage of LDA is that words have probabilistic representations over latent variables corresponding roughly to word senses.Thus the two senses for bank above could be preserved and represented in probability distributions over two distinct latent variables.
In this study we investigated the relationship between the W3C3 model and constituent models on backward associative strength in the DRM paradigm.Following previous research, we chose to focus on backward associative strength because it is highly correlated with false recall and may be considered independently of the monitoring processes that mediate false recall.In order to investigate the W3C3 model in this context, several of the constituent models had to be amended to produce gist representations for DRM lists.To create COALS gist vectors, the raw unnormalized vectors for each word were summed, then normalized using correlation, and then projected into a 500 dimensional SVD solution as described in Section 2.1.This operation ensured that each element of the gist vector was a correlation, just as is the case in a normal COALS vector.ESA naturally creates gist vectors from multi-word strings, so no additional algorithm needed to be developed.For WLM, we create a synthetic article using inlinks/outlinks of the most likely sense for each word.The most likely sense for each word was determined by considering only the previous word in the DRM list.The most likely sense is the sense that has the highest similarity to the previous word.For example, if the previous word is baseball and the current word is bat, then the club sense is more similar than the flying mammal sense according to WLM.Then the inlinks/outlinks for these most likely senses were aggregated and used to create a synthetic gist article with the union of inlinks and union of outlinks.This gist article was then compared to the non-present target article in the standard way.As before, the W3C3 model was an unweighted average of these three scores.We used the above methods for calculating gist and applied it to the standard set of 55 DRM lists [30].For each list, we computed the gist representation and then compared it to the representation for the non-presented target word, yielding a similarity score.Table 13 presents the correlation between the similarity scores for each model and backward associative strength, including correlations previously reported for the LSA and LDA models described above [55].
Although none of the comparisons in Table 13 are statistically significant, the W3C3 model has a higher correlation with the human data as reflected by backward associative strength than the constituent models.Since previous results with LDA or LSA used a different corpus (TASA), a direct comparison is not warranted.Nevertheless, the correlation of the W3C3 model is not as strong as has been previously reported for LDA.This result was surprising, particularly with regard to the low correlation for WLM.We undertook a qualitative analysis to determine if there is a better correspondence between Wikipedia's link structure and the associative strength behind the DRM paradigm than is reflected by the WLM metric.Table 14 provides some suggestion that the raw link structure of Wikipedia might be more strongly related to backward associative strength than the gist-like WLM metric reveals.Each word in Table 14 is from the DRM list for sleep [30].As shown in the table, most words (11/15) have sleep as an outlink or are used equivalently to mean sleep.In other words, this pattern of links is consistent with the backward association strength found in [30].In order to more rigorously assess the possibility that raw Wikipedia link structure might better reflect backward associative strength, we recomputed the correlations from Table 13 with separate measures for Wikipedia inlinks and outlinks.Recall from Section 2.3 that WLM has two separate measures for inlinks and outlinks, which are averaged together to compute the WLM metric.Table 15 presents correlations of these inlink/outlink measures separately, with the standard WLM measure, and with the corresponding W3C3 models.
By treating inlinks and outlinks separately, the correlation to backward associative strength increases markedly.What is perhaps most interesting about the pattern of correlations in Table 15 is that the traditional WLM metric performs worse than the two individual metrics, as though averaging somehow cancels them out.The implication is that list words and non-present target words may share inlinks (the same pages link to them), and they may share outlinks (they link to the same page), but they do not tend to share both at the same time.Thus there is an implicit asymmetry to the associative relationship that is lost if the gist-like representation considers both inlinks and outlinks.This finding is consistent with asymmetries in human similarity judgments [58] and may also explain why LDA performs so well at this task: It, unlike most distributional methods, is inherently asymmetric in the way it calculates gist.We conducted a linear regression on ranks to evaluate the relative contributions of each constituent model.The regression used COALS, ESA, and WLM outlink scores converted to ranks to predict the DRM backward associative strength.In this first model COALS was not a significant predictor and so was removed.The results of the linear regression are presented in Table 16.Tolerance analyses were conducted to test for multicollinearity of ESA and WLM outlink by regressing each on the other.The tolerances were both 0.98, strongly indicating a lack of multicollinearity.Consistent with previous regressions, the fit of the model is very close, in this case identical, to the correlation of the W3C3 inlink/outlink models given in Table 15.Therefore it appears that even though COALS is not a significant predictor in this task, it does not detract from the overall performance of the W3C3 model.Although the results in Table 15 are informative, they do not completely capture the qualitative data and intuition behind Table 14, the outlink structure for sleep.What that table describes is much further away from a gist-like representation and much closer to an activation based representation.Assuming that these articles are activated by list words, one can imagine activation flowing from them, through their outlinks and redirect links, to the article for sleep.Or put another way, instead of calculating the difference between entire vectors for list words and target word, only the vector element of the list word that corresponds to the target word is considered.We performed this analysis for Wikipedia outlinks on the DRM lists, using WLM's model of outlink structure.For each word on the list, we collected all outlinks from all possible senses of that word and counted those pointing to the target article.Since the list words converge on a particular sense (the target word's sense), taking the most frequent linked-to sense provided a strong predictor of backward associative strength, r(53) = 0.53.
We applied this same strategy to COALS and ESA.For COALS, we first identified the dimension associated with the non-presented target word.Then for each word on in the list, we retrieved the associated vector and added up the value at that target dimension.No normalization or SVD projection was used.Likewise, for ESA we found the target dimension and summed the word list vectors on that dimension.In the case of ESA, the standard normalization was used.The obtained correlations using these more activation-aligned models are presented in Table 17, along with the correlation for WLM outlink-based activation measure above.As shown in Table 17, framing the model more in terms of activation rather than gist improves correlations considerably for ESA and WLM Activation models, such that the WLM Activation model has a non-significantly higher correlation with backward associative strength than the gist-type LDA model in Table 13 and W3C3 model presented in Table 15.For WLM this is perhaps not surprising because of how the outlinks on Wikipedia pages are generated in the first place: by people.Wikipedia's guidelines on linking center on the likelihood that a reader will want to read the linked article [59].It is up to the authors of Wikipedia pages to consider the association between one page's topic and another before linking them.It seems only natural that some level of backward association strength would manifest in this process.By the same token, one might argue that WLM Activation only provides a circular definition of backward associative strength, since similar word associative processes are at work when people link to Wikipedia pages as are in word association tasks.While this is likely true, it also is evidence that the internal cognitive-linguistic processes involved in word association and DRM are externally represented in Wikipedia's link structure.
As in previous work, these results seem to give some level of support to both activation/monitoring theory and fuzzy-trace theory.Activation/monitoring theory explains false memory largely in terms of the spreading activation in an associative network between words [56], which is consistent with the large predictive role of backward associative strength.Fuzzy-trace theory explains false memory in terms of gist, and gist traces, which are fuzzy semantic representations hypothesized to be more durable than verbatim traces, consistent with the finding that recall of the non-presented target words is greater than recall of list words and that the recall for list words decays more rapidly than the recall of target words [60].The results in this section suggest that these two accounts might be different perspectives on the same underlying mental representation.Since a vector may be compared to another vector using all elements and therefore all contexts, a vector can be used to represent gist.Likewise, since the vector elements can be treated individually and the rest of the vector ignored, a vector element can be treated as an associative strength in a given dimension.However an important caveat is that in our models, an entire list of words is required to raise the activation level of the target dimension above the noise of the other dimensions.Thus this approach does not work for simple stimulus-response word association like the NMS task in Study 3.

Discussion
We believe that the results of Studies 1 through 4 substantiate the claim that humans and Wikipedia are part of the same cognitive-linguistic ecosystem.The literature described in Section 1 demonstrates how our cognitive-linguistic environment affects our language structure and categorization.If Wikipedia's structure is an externalization of internal cognitive and linguistic processes, then there is strong reason to believe in the cognitive-linguistic influence of Wikipedia's past authors on future readers.In other words, Wikipedia would appropriately be described by the process of niche construction.
There are some good reasons for taking this niche-construction concept seriously.It is perhaps trivially true that reading a book or similar work will have some effect on an individual, e.g., through learning.However, the argument being made here is stronger, that the influence of Wikipedia derives from both its language structure and its network of concepts/categories.Analogous to developmental studies [11,12], one prediction would be that reading Wikipedia would affect a participant's language structure and category structure.By creating a computational cognitive model based on Wikipedia and applying it to multiple semantic tasks, we indirectly tested this hypothesis and found support for it.
First, the unsupervised W3C3 model produced state of the art correlations with human data in Studies 1 to 3. We claim the model is unsupervised because in all cases the three constituent predictors were evenly weighted by taking their average, or equivalently, not weighted at all.Studies 1 and 2 are best characterized as semantic comparison tasks.Study 1's comparison task included a mixture of semantic relations, e.g., synonym, antonym, part-of, whereas Study 2's task involved overlap of semantic features.The high correlations in these studies, between 0.67 and 0.78, indicate that the information necessary make semantic comparisons is well represented in the structure of Wikipedia.
Second, the unsupervised W3C3 model had a higher correlation than any of its constituent models in Studies 1 to 4. Regressions conducted in these studies indicate that each constituent model explains a unique portion of the variance in the human data except COALS in Study 4, and that while their relative weights change slightly for each task, equal weighting is nearly identical to the regression-derived weights in all cases.This finding directly supports our stronger claim: By allowing both the language structure and category structure of Wikipedia to guide the W3C3 model, we achieved a higher correspondence to human semantic behavior than if we had used either separately.Furthermore we allowed these two dimensions to influence each other by incorporating word-word, word-concept, and concept-concept levels into the model.
Our approach is consistent with a recently proposed new model visualization for language [61].In this work traditional box-and-arrow models are characterized as modular, stage based, and rank ordered.In contrast, our W3C3 model does not have autonomous modules, but overlapping ones (word-word, word-concept, and concept-concept).These constituent models operate in parallel, without stages, and have no rank ordering or dominance.We differ from this new model visualization in that although our constituent models exist in multidimensional spaces (vector spaces), they do not occupy a single state space subject to dynamical state space processes.
We propose that by using vector spaces as a conceptualization, we have moved closer to unifying accounts of activation-based models and gist-based models, which are closely aligned with association and comparison-based tasks.Word association (Study 3) is a quite straightforward case of retrieval in the absence of a larger discourse context: Given a stimulus word, retrieve a response word.In contrast, semantic relatedness (Studies 1 and 2) is much more closely related to a comparison task.Intuitively, it asks the question: Given a pair of words, how does their meaning compare?We argue that distributional models by definition are much more aligned with semantic relatedness than with association.The reason is that semantic relatedness is a holistic judgment that considers many possible contexts.Distributional models are well aligned with holistic judgments because they are defined in terms of many contexts.Word association, on the other hand, can and does operate on a single dimension of a context.For example, in the NMS dataset, tumor has association with kindergarten cop.This stimulus-response pair is a clear reference to the film entitled, "Kindergarten Cop," in which Arnold Schwarzenegger says in reference to his headache, "It's not a tumor!"This line in this one film is quite possibly the only way in which the stimulus-response pair tumor-kindergarten cop is associated.This example illustrates that word associations need not be guided by many converging contexts but rather may be solely determined by a single context.Study 4 in particular illustrates that gist and activation accounts are complementary views of the same underlying vector space structure.Both can have the same knowledge representation but differ in the operation performed on that structure.
Gist-like operations are inherently holistic, as in comparison tasks, and use the entire vector representation.When we applied the standard W3C3 model using gist-like measures, the correlation to backward associative strength was 0.34.In contrast, activation-like operations are inherently local, as in word association tasks, and can use only a single element of the vector.Incorporating activation into the gist-like W3C3 model by removing WLM inlink vectors increased correlation to 0.42.Further creating a completely activation-based measure using single WLM outlink vector elements increased correlation to 0.53.Since both gist-like and activation-like measures correlated positively with backward associative strength, these results explain why other studies may have evidence for either a semantic based or association based explanation of false memory [30,54].However, rather than requiring different cognitive representations for word association or semantic comparison, we argue that word association and semantic comparison are more productively viewed as task-driven operations on the same vector-based cognitive representation (cf.[51]).That the same underlying structure can be accessed differently according to different task demands is intuitive and matches observed behavior.Colunga and Smith [10] note that when children are asked to group a carrot, tomato, and rabbit, children will group rabbits and carrots together.However if children are told the carrot is a dax and are asked to find another dax, children will get the tomato.In the same way, comparison tasks evoke category structure and holistic judgments, whereas raw association as evoked by grouping or production tasks may be based on a single strong point of association, e.g., rabbit-carrot.

Conclusions
In summary, the internal cognitive-linguistic processes engaged in constructing Wikipedia has created a cognitive-linguistic environment that can be exploited by computational cognitive models.The crowd-sourcing process of creating, merging, and deleting article pages establishes a common view of shared concepts and topics for discussion.The words used within each page are a collaborative minimal summary of that concept, and the links between pages represent relevant associations between concepts.Wikipedia is perhaps unique in that it provides moderately clean structural relationships in natural language.As a product of the human mind, Wikipedia reflects aspects of human semantic memory in its structure.Our W3C3 model capitalizes on the cognitive-linguistic structure at different resolutions just as theories of memory purport access at different resolutions: COALS represents words in word contexts, ESA represents words in concepts, and WLM represents the links between concepts.The work we have presented in these studies suggests that these three resolutions may contribute to a more complete model of semantic association.Thus in creating Wikipedia to describe the world, we have created a resource that may reveal the subtleties of the human mind.

Table 2 .
Hypothetical inlink overlap for Football and Sport.

Table 17 .
Spearman rank correlations with backward associative strength for DRM lists, using an activation-type metric (N = 55).