Article Distribution of “Characteristic ” Terms in MEDLINE Literatures

Given the occurrence frequency of any term within any set of articles within MEDLINE, we define “characteristic” terms as words and phrases that occur in that literature more frequently than expected by chance (at p < 0.001 or better). In this report, we studied how the cut-off criterion varied as a function of literature size and term frequency in MEDLINE as a whole, and have compared the distribution of characteristic terms within a number of journal-defined, affiliation-defined and random literatures. We also investigated how the characteristic terms were distributed among MEDLINE titles, abstracts, and last sentence of abstracts, including “regularized” terms that appear both in the title and abstract of the same paper for at least one paper in the literature. For a set of 10 disciplinary journals, the characteristic terms comprised 18% of the total terms on average. Characteristic terms are utilized in several of our web-based services (Anne O’Tate and Arrowsmith), and should be useful for a variety of other information-processing tasks designed to improve text mining in MEDLINE.


Introduction
Terms occurring in a given set of articles (i.e., a literature) more than expected by chance form a literature-specific vocabulary that is similar to the concept of a domain -sublanguage‖ [1][2][3].They differ from the keywords extracted from a particular literature [4,5], insofar as keywords occur frequently relative to other terms in that literature, whereas a literature-specific term may occur only a few times (as long as it is more frequent in that literature than in MEDLINE as a whole).
In the present paper, we have computed empirical occurrence frequencies of terms within a number of journal-defined, affiliation-defined and random literatures.We derived statistical criteria for asserting that a single term occurs more often within any given literature than expected by chance, and denote the set of terms that occur more than expected by chance (at p < 0.001) as the -characteristic‖ terms for that literature.Finally, we have studied their distribution across MEDLINE titles, abstracts, and last sentences of abstracts, including -regularized‖ characteristic terms that appear both in the title and abstract of the same paper for at least one paper in the literature.These studies set the stage for utilizing characteristic terms as features in text mining models, and in creating thumbnail annotations of the literatures.

Delineating Characteristic Terms
We examined 10 different disciplinary journals published in English, containing abstracts, which comprised 2,000-10,000 papers each (average 5,132 papers), and characterized the distribution of term frequencies within the journal set vs. within MEDLINE as a whole.This distribution was compared with the distribution of terms in an affiliation-defined literature consisting of all articles published in 2000 having the word -California‖ in the affiliation field, and with a set of 5,000 articles chosen at random within MEDLINE.In each case, the Poisson approximation was used to define the distribution of term occurrence that would be expected by chance.
Figure 1 shows the raw distribution of term occurrence frequencies in the text fields (i.e., title or abstract) for Journal of Biomedical Materials Research compared to a random literature of similar size.Term occurrence frequency was almost exactly linear for the journal when plotted on a log-log scale, indicating that frequencies followed a regular Zipf distribution.The frequencies for the random set followed a parallel curve and were significantly different from that of the journal.
To identify individual terms that were significantly more frequent than expected by chance, we computed p-value scores for each term across 10 disciplinary journals and plotted the average p-value scores in comparison to the California set and to a random set of 5,000 articles (Figure 2).Although terms associated with p-values < 0.05 are nominally significant, that does not take into account the fact that multiple tests are carried out.Figure 2 emphasizes that the difference between journal-defined sets and random sets is most striking at p-values below 0.001.Thus, we have chosen p < 0.001 as our preferred cut-off value, and the set of terms in a literature with p-values below 0.001 will be called the characteristic terms of that literature.Research, 1967Research, -2002, 4, 4,824 articles) vs. a randomly selected literature (5,000 articles chosen across MEDLINE).Error bars show 95% confidence intervals around the regression curves.The journal literature contains more highly frequent terms, and therefore its curve extends beyond that of the random curve.For the set of 10 disciplinary journals, the set of characteristic terms comprise, on average, 18% of the total terms in that literature.The cut-off criteria for deeming a term as -characteristic‖ vary systematically as functions both of literature size and term frequency within MEDLINE (Figure 3).Among the entire set of characteristic terms for the 10 disciplinary journals, average term occurrence is 23 times within the set of journal articles, which is 88 times more frequent in that literature than in MEDLINE.To illustrate the types of terms that are characteristic for a specific literature, we show results from International Journal of Food Microbiology.Table 1 shows the 10 characteristic terms with the lowest p-values, 10 having moderate p-values and 10 having p-values near 0.001.Clearly, the top ten terms are closely related to the journal topic (food, listeria, meat, etc.), as are the moderate set (ethanol, shigella, mold, etc.), whereas those at the margin of significance are still relevant but less specific (tbg, gene coding, fever vomiting, sandwich, etc.).

Distribution of Characteristic Terms within Individual Article Records
Several previous studies have emphasized that specific terms or MeSH concepts may be enriched in particular sections of scientific papers [6,7].We examined how the set of characteristic terms are distributed among 8 different sections of papers encoded in MEDLINE fields for each of 10 disciplinary journals, the California literature and the random literature: text (comprising title and abstract fields); ti (title); ab (abstract); lastsen (last sentence of the abstract); ti + ab (present both in the title and in the abstract of at least one paper in the literature, though not necessarily the same paper); tiab (in the title and the abstract of the same paper, for at least one paper in the literature); ti + lastsen (in title and last sentence of the abstract, for at least one paper in the literature, though not necessarily the same paper); and tiab + lastsen (in tiab and in last sentence of the abstract for at least one paper in the literature).
One basic measure is the -density‖-this is the percentage of all terms in each section that are comprised of characteristic terms.Those sections that are high in density are relatively rich in characteristic terms.Another measure is the -coverage‖-defined as the number of characteristic terms found in each section, as a percentage of the total characteristic terms for that journal.Those sections that are high in coverage have the most characteristic terms overall.
The average density value varied significantly from journal to journal within our set of 10 disciplinary journals (Table 2), presumably due to different journal policies such as limits on abstract length and structured vs. unstructured abstracts.However, after normalizing the density values for each journal, one could readily observe systematic section-related differences in density and coverage that were similar across journals (Figure 4).Title and last sentence of the abstract had significantly more density than the abstract field, whereas terms that appeared in multiple sections had significantly more density than those appearing in a single section (Figure 4).Not only were these fields progressively richer in characteristic terms, but the characteristic terms that they contained had higher average frequency of occurrence than the overall set of characteristic terms, and were more specific insofar as they had lower average p-values (Figure 5).Interestingly, the set of -regularized‖ characteristic terms (tiab) appearing in the title and abstract of the same paper had significantly higher average frequency and lower p-values than terms which appeared in titles and abstract of different papers (ti + ab) (Figure 5) (each parameter significantly different at p < 0.00001, using paired t-test).
Regularized terms (tiab) that also appeared in the last sentence of at least one paper (tiab + lastsen) had the highest average frequency and lowest average p-value of all (Figure 5), suggesting that this subset of characteristic terms comprises, in some sense, the most important terms associated with the journal.Of the 20 characteristic terms having the lowest p-values overall in one journal, International Journal of Food Microbiology, all were found in the tiab + l astsen set as well.Thus, two independent methods-lowest p-value vs. presence in multiple sections of papers-agree in giving the most -important‖ characteristic terms.) We also considered whether, given two characteristic terms with equal p-values, the term appearing in the greater number of papers in the literature should be considered the more important.For the characteristic terms in International Journal of Food Microbiology, we calculated a -corrected‖ p-value score by dividing the raw p-value by the fraction of papers in the journal containing the term; however, this correction did not alter the top 20 characteristic terms and had only a very minor effect on their relative ranking (Table 3).Thus, at least for the task of choosing the few most important characteristic terms, it does not seem to be necessary to take this factor into account as a separate variable.

Experimental Methods
The universe of terms was defined in the following manner, consistent with the larger aims of the Arrowsmith Project [8].Specifically, the titles of all papers in MEDLINE were extracted, stemmed and stoplisted using the short PubMed 364-word stoplist [9].Words were kept only if they appeared in the abstract of at least three papers in MEDLINE, and up to three word phrases were kept only if they appeared in at least 10 abstracts.Finally, terms were mapped through the NIH MetaMap program keeping only those terms that mapped to at least one UMLS semantic category.(This removes most of the nonsensical phrases but includes many that do not correspond exactly to UMLS concepts.)After filtering, the total number of words = 52,997, two word phrases = 747,484, and three word phrases = 429,566.(Note that if a term occurred at all within an abstract, it was scored as 1 occurrence regardless of how many times the term occurred within the same abstract.)For each occurrence of a term within a MEDLINE record, we noted its location within title, abstract, or last sentence in abstract.Sentence boundaries were identified using the Sentence Splitter [10].
Modeling the expected term occurrence in a literature: Think of all the N papers in MEDLINE as a collection of N balls in an urn, where f 1 black balls correspond to papers that contain a certain term, and the remaining N − f 1 balls are white (do not contain the term.)In constructing a random literature of f 2 papers, we randomly select f 2 distinct balls from the urn.The number of black balls selected, X, is a random variable that follows a hypergeometric distribution defined by: In other words, if a literature and a given term are independent of each other, then the number of papers within that literature that contain the term should follow the hypergeometric distribution.
The Poisson distribution is a good approximation when N is large relative to f 1 and f 2 : x!   , for x = 0,1,2, … Where λ = f 1 f 2 /N is the expected value of X.We have verified that the Poisson distribution is an extremely close approximation for the hypergeometric distribution in the full range of literature sizes and term frequencies considered in this paper.

Conclusions
In the present paper, we have calculated and empirically validated statistical criteria for saying that a term occurs in a given literature more often than by chance, and have analyzed the resulting set of -characteristic‖ terms (having p-values < 0.001) in some detail.Note that the characteristic terms for a literature are not necessarily the most frequent in that literature.Nor, for topically-defined literatures, do they need to have any semantic relation to the query term that generated the literature.
Characteristic terms of a literature have proven useful for different information-processing tasks.In the Anne O'Tate tool [11] that combines PubMed literature retrieval with additional post-retrieval analyses, the set of characteristic terms gives a thumbnail annotation of any retrieved literature.For example, in the case of papers describing diabetes research, the set of characteristic terms (restricted to the semantic category of gene names) gives a thumbnail annotation of the genes that have been studied in this field.In the Author-ity author name disambiguation tool [12], characteristic terms provide a thumbnail annotation of any given author's research output.Other possible uses for characteristic terms occur in post-processing of a PubMed query, to replace or supplement other language resources such as Medical Subject Headings, UMLS concepts or keyword thesauri, e.g., to expand the query automatically to include highly related papers [13,14], to cluster the retrieved papers by theme [15], or to reformulate the query in a manner that permits cross-disciplinary retrieval [16].For example, to expand an original query automatically, one could replace the original terms used in the search with a new Boolean query made up of a small number of characteristic terms.These would not necessarily be the terms with the lowest p-values, but rather would be the set of the terms that (when combined with appropriate AND and OR operations) cover the original literature most accurately and with least redundancy.
The characteristic terms with the lowest p-values are likely to be most useful for annotation; this is similar to the log-entropy term weighting approach taken by Homayouni et al. [17].Other annotation methods are possible-for example, Erkan and Radev [18] used a graph theoretic-approach to obtain the -most important‖ terms within document sets-but this is far more computationally complex than the method proposed here, and would not scale well to large literatures.The terms with lowest p-values are likely to be the most important as well, especially since these terms appeared in multiple sections of the papers.
Finally, characteristic terms have been useful for assisting in literature-based discovery.In the Arrowsmith two-node search tool [19,20], the user seeks to assess a possible relationship between literatures A and C; the computer interface presents a list of terms (the -B-list‖) in common between the literatures to serve as a conceptual bridge.However, not all B-terms are likely to be of equal value in discovering significant implicit links.Characteristic terms expressed in each literature are computed as a feature in the quantitative model that allows us to rank the B-terms in order of predicted relevance to linking the two literatures in a meaningful way [19].Moreover, B-terms that are not characteristic in either literature A or C are unlikely to indicate important concepts in either literature, whereas B-terms that are characteristic in both A and C may represent concepts that are already well known.Thus, we are currently exploring the hypothesis that the B-terms most likely to point to new discoveries in two node searches are those that are characteristic in one literature, but not both.

Figure 1 .
Figure 1.Distribution of term occurrence frequencies in text fields for a journal literature (Journal of Biomedical Materials Research, 1967-2002, 4,824 articles) vs. a randomly selected literature (5,000 articles chosen across MEDLINE).Error bars show 95% confidence intervals around the regression curves.The journal literature contains more highly frequent terms, and therefore its curve extends beyond that of the random curve.

Figure 2 .
Figure 2. Distribution of p-value scores determined using the Poisson distribution.The p-value score was computed with the formula p-value = P(X ≥ frq-lit), where frq-lit is the number of times a term occurs within the literature.The affiliation-defined literature was chosen as the set of articles published in 2000 having the word -California‖ in the affiliation field. 0

Figure 3 .
Figure 3. (a) The minimum number of occurrences of a term within a literature (for a given term frequency F within MEDLINE and a given literature size) needed to call the term -characteristic‖ of that literature; (b) The minimum ratio of occurrences in a literature vs. MEDLINE needed to call the term -characteristic‖.

Figure 4 .Figure 5 .
Figure 4. Density and coverage of characteristic terms in 8 different sections of articles averaged over 10 disciplinary journals.Ellipses show one standard error around the mean values.

Table 1 .
Characteristic terms extracted from the International Journal of Food Microbiology showing those with the 10 lowest p-value scores, 10 having moderate scores (~8.6 × 10 −5 ) and 10 having p-values near 0.001.

Table 3 .
Top 20 characteristic terms extracted from the International Journal of Food Microbiology, ranked by raw p-value vs. by corrected p-value (see text for details).F is the number of times the term occurs within text fields in MEDLINE and f is the number of occurrences in the journal.