^{1}

^{*}

^{2}

^{3}

This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

Given the occurrence frequency of any term within any set of articles within MEDLINE, we define “characteristic” terms as words and phrases that occur in that literature more frequently than expected by chance (at p < 0.001 or better). In this report, we studied how the cut-off criterion varied as a function of literature size and term frequency in MEDLINE as a whole, and have compared the distribution of characteristic terms within a number of journal-defined, affiliation-defined and random literatures. We also investigated how the characteristic terms were distributed among MEDLINE titles, abstracts, and last sentence of abstracts, including “regularized” terms that appear both in the title and abstract of the same paper for at least one paper in the literature. For a set of 10 disciplinary journals, the characteristic terms comprised 18% of the total terms on average. Characteristic terms are utilized in several of our web-based services (Anne O'Tate and Arrowsmith), and should be useful for a variety of other information-processing tasks designed to improve text mining in MEDLINE.

Terms occurring in a given set of articles (

In the present paper, we have computed empirical occurrence frequencies of terms within a number of journal-defined, affiliation-defined and random literatures. We derived statistical criteria for asserting that a single term occurs more often within any given literature than expected by chance, and denote the set of terms that occur more than expected by chance (at p < 0.001) as the “characteristic” terms for that literature. Finally, we have studied their distribution across MEDLINE titles, abstracts, and last sentences of abstracts, including “regularized” characteristic terms that appear both in the title and abstract of the same paper for at least one paper in the literature. These studies set the stage for utilizing characteristic terms as features in text mining models, and in creating thumbnail annotations of the literatures.

We examined 10 different disciplinary journals published in English, containing abstracts, which comprised 2,000-10,000 papers each (average 5,132 papers), and characterized the distribution of term frequencies within the journal set

To identify individual terms that were significantly more frequent than expected by chance, we computed p-value scores for each term across 10 disciplinary journals and plotted the average p-value scores in comparison to the California set and to a random set of 5,000 articles (

For the set of 10 disciplinary journals, the set of characteristic terms comprise, on average, 18% of the total terms in that literature. The cut-off criteria for deeming a term as “characteristic” vary systematically as functions both of literature size and term frequency within MEDLINE (

To illustrate the types of terms that are characteristic for a specific literature, we show results from

Several previous studies have emphasized that specific terms or MeSH concepts may be enriched in particular sections of scientific papers [

One basic measure is the “density”—this is the percentage of all terms in each section that are comprised of characteristic terms. Those sections that are high in density are relatively rich in characteristic terms. Another measure is the “coverage”—defined as the number of characteristic terms found in each section, as a percentage of the total characteristic terms for that journal. Those sections that are high in coverage have the most characteristic terms overall.

The average density value varied significantly from journal to journal within our set of 10 disciplinary journals (

Regularized terms (tiab) that also appeared in the last sentence of at least one paper (tiab + lastsen) had the highest average frequency and lowest average p-value of all (

We also considered whether, given two characteristic terms with equal p-values, the term appearing in the greater number of papers in the literature should be considered the more important. For the characteristic terms in

The universe of terms was defined in the following manner, consistent with the larger aims of the Arrowsmith Project [

Think of all the _{1} black balls correspond to papers that contain a certain term, and the remaining _{1} balls are white (do not contain the term.) In constructing a random literature of _{2} papers, we randomly select _{2} distinct balls from the urn. The number of black balls selected,

In other words, if a literature and a given term are independent of each other, then the number of papers within that literature that contain the term should follow the hypergeometric distribution.

The Poisson distribution is a good approximation when _{1} and _{2}:

Where λ = _{1}_{2}/

In the present paper, we have calculated and empirically validated statistical criteria for saying that a term occurs in a given literature more often than by chance, and have analyzed the resulting set of “characteristic” terms (having p-values < 0.001) in some detail. Note that the characteristic terms for a literature are not necessarily the most frequent in that literature. Nor, for topically-defined literatures, do they need to have any semantic relation to the query term that generated the literature.

Characteristic terms of a literature have proven useful for different information-processing tasks. In the Anne O'Tate tool [

The characteristic terms with the lowest p-values are likely to be most useful for annotation; this is similar to the log-entropy term weighting approach taken by Homayouni

Finally, characteristic terms have been useful for assisting in literature-based discovery. In the Arrowsmith two-node search tool [

Distribution of term occurrence frequencies in text fields for a journal literature (

Distribution of p-value scores determined using the Poisson distribution. The p-value score was computed with the formula p-value = P(

(

Density and coverage of characteristic terms in 8 different sections of articles averaged over 10 disciplinary journals. Ellipses show one standard error around the mean values.

Average frequency and p-value for characteristic terms in 8 different sections of articles averaged over 10 disciplinary journals; (

Characteristic terms extracted from the ^{−5}) and 10 having p-values near 0.001.

food | ph ethanol | strain x | |

listeria | recurrent neural network | tbg | |

strain | shigella yersinia | or h | |

listeria monocytogene | disinfection or | mytilus galloprovincialis | |

degree c | growth environmental | gene coding | |

meat | mold growth | fever vomiting | |

l monocytogene | staphylococcal strain isolated | sandwich | |

lactic acid | yeast high | growth effect | |

lactobacillus | longitudinally | density nm | |

lactic acid bacteria | pathogen human | reliable method |

Density and coverage of the characteristic terms in 8 different article fields across 10 disciplinary journals, an affiliation-defined literature and a random literature (see text). Jrn1:

18.31 | 9.01 | 11.82 | 20.74 | 15.54 | 19.94 | 24.65 | 17.26 | 29.29 | 19.01 | 18.55 | 7.39 | 0.37 | |

19.94 | 9.14 | 12.71 | 21.65 | 16.41 | 20.82 | 25.65 | 18.14 | 30.68 | 20.19 | 19.53 | 7.87 | 0.40 | |

34.92 | 18.61 | 21.79 | 39.69 | 31.68 | 38.21 | 43.99 | 35.26 | 53.27 | 37.27 | 35.46 | 19.71 | 0.48 | |

34.08 | 21.5 | 23.14 | 44.76 | 34.11 | 43.52 | 51.61 | 37.02 | 53.89 | 38.02 | 38.16 | 18.53 | 0.60 | |

48.54 | 33.74 | 34.73 | 55.57 | 45.81 | 51.98 | 62.33 | 48.01 | 63.79 | 50.43 | 49.49 | 24.87 | 0.93 | |

57.11 | 36.94 | 37.51 | 64.49 | 53.84 | 60.45 | 69.06 | 58.76 | 72.19 | 59.82 | 57.01 | 36.51 | 0.65 | |

47.69 | 37.06 | 36.41 | 54.87 | 44.84 | 50.08 | 59.64 | 43.61 | 62.31 | 45.59 | 48.21 | 25.66 | 0.42 | |

57.89 | 42.57 | 41.97 | 65.11 | 55.15 | 59.92 | 68.83 | 57.62 | 72.19 | 57.85 | 57.91 | 37.34 | 0.37 | |

100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |

95.33 | 85.51 | 89.89 | 98.37 | 98.38 | 98.47 | 98.87 | 98.63 | 98.64 | 98.79 | 96.08 | 99.01 | 93.83 | |

48.46 | 41.76 | 41.7 | 49.3 | 50.1 | 55.99 | 49.3 | 50.18 | 52.75 | 47.21 | 48.67 | 59.98 | 27.01 | |

66.73 | 77.47 | 75.77 | 57.01 | 53.6 | 68.34 | 55.29 | 52.9 | 61.71 | 52.83 | 62.16 | 66.23 | 53.55 | |

62.07 | 62.98 | 65.67 | 55.39 | 51.99 | 66.81 | 62.33 | 48.01 | 63.79 | 50.43 | 58.94 | 65.24 | 47.39 | |

34.6 | 33.48 | 33.53 | 33.32 | 33.93 | 42.28 | 30.66 | 31.52 | 36.89 | 29.75 | 33.99 | 45.57 | 15.16 | |

38.04 | 27.89 | 38.49 | 39.07 | 36.79 | 44.94 | 39.39 | 35.04 | 39.4 | 34.12 | 37.31 | 50.38 | 16.11 | |

25.59 | 19.77 | 24.76 | 26.49 | 28.08 | 31.84 | 25.3 | 24.67 | 28.21 | 22.78 | 25.74 | 38.14 | 7.10 |

Top 20 characteristic terms extracted from the

| |||||||
---|---|---|---|---|---|---|---|

food | 718 | 102,805 | 1.04 × 10^{−847} |
food | 718 | 102,805 | 3.01 × 10^{−847} |

listeria | 384 | 6,879 | 2.11 × 10^{−797} |
listeria | 384 | 6,879 | 1.14 × 10^{−796} |

strain | 775 | 246,958 | 6.76 × 10^{−656} |
strain | 775 | 246,958 | 1.81 × 10^{−655} |

listeria monocytogene | 310 | 5,648 | 6.90 × 10^{−642} |
listeria monocytogene | 310 | 5,648 | 4.62 × 10^{−641} |

degree c | 625 | 139,694 | 3.84 × 10^{−621} |
degree c | 625 | 139,694 | 1.28 × 10^{−620} |

meat | 309 | 11,582 | 1.81 × 10^{−543} |
meat | 309 | 11,582 | 1.22 × 10^{−542} |

l monocytogene | 231 | 2,514 | 2.05 × 10^{−530} |
l monocytogene | 231 | 2,514 | 1.84 × 10^{−529} |

lactic acid | 257 | 7,008 | 1.11 × 10^{−487} |
lactic acid | 257 | 7,008 | 9.03 × 10^{0−487} |

lactobacillus | 238 | 5,441 | 6.39 × 10^{−470} |
lactobacillus | 238 | 5,441 | 5.57 × 10^{−469} |

lactic acid bacteria | 188 | 1,324 | 1.52 × 10^{−467} |
lactic acid bacteria | 188 | 1,324 | 1.67 × 10^{−466} |

lactic | 270 | 13,490 | 1.07 × 10^{−441} |
lactic | 270 | 13,490 | 8.22 × 10^{−441} |

temperature | 467 | 156,235 | 8.00 × 10^{−387} |
temperature | 467 | 156,235 | 3.56 × 10^{−386} |

degree | 661 | 405,979 | 3.17 × 10^{−386} |
degree | 661 | 405,979 | 9.95 × 10^{−386} |

salmonella | 294 | 32,199 | 6.87 × 10^{−382} |
salmonella | 294 | 32,199 | 4.85 × 10^{−381} |

spp | 245 | 15,673 | 5.88 × 10^{−375} |
growth | 662 | 426,410 | 3.91 × 10^{−374} |

growth | 662 | 426,410 | 1.25 × 10^{−374} |
spp | 245 | 15,673 | 4.98 × 10^{−374} |

ph | 413 | 157,692 | 3.99 × 10^{−320} |
ph | 413 | 157,692 | 2.00 × 10^{−319} |

isolate | 331 | 81,127 | 2.75 × 10^{−317} |
isolate | 331 | 81,127 | 1.73 × 10^{−316} |

storage | 265 | 47,338 | 1.63 × 10^{−289} |
storage | 265 | 47,338 | 1.28 × 10^{−288} |

foodborne | 113 | 1,198 | 8.98 × 10^{−262} |
foodborne | 113 | 1,198 | 1.65 × 10^{−260} |

This Human Brain Project/Neuroinformatics research (LM007292 and LM08364) is funded jointly by the National Library of Medicine and the National Institute of Mental Health. The Medline database and the MMTx program (a Java implementation of the MetaMap algorithm) were graciously provided by the National Library of Medicine.