<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xml:lang="en" article-type="research-article" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Information</journal-id>
<journal-title>Information</journal-title>
<issn pub-type="epub">2078-2489</issn>
<publisher>
<publisher-name>Molecular Diversity Preservation International (MDPI)</publisher-name></publisher></journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3390/info2020266</article-id>
<article-id pub-id-type="publisher-id">information-02-00266</article-id>
<article-categories>
<subj-group>
<subject>Article</subject></subj-group></article-categories>
<title-group>
<article-title>Distribution of &#x0201C;Characteristic&#x0201D; Terms in MEDLINE Literatures</article-title></title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Smalheiser</surname><given-names>Neil R.</given-names></name><xref ref-type="aff" rid="af1-information-02-00266"><sup>1</sup></xref><xref ref-type="corresp" rid="c1-information-02-00266"><sup>&#x0002A;</sup></xref></contrib>
<contrib contrib-type="author">
<name><surname>Zhou</surname><given-names>Wei</given-names></name><xref ref-type="aff" rid="af2-information-02-00266"><sup>2</sup></xref></contrib>
<contrib contrib-type="author">
<name><surname>Torvik</surname><given-names>Vetle I.</given-names></name><xref ref-type="aff" rid="af3-information-02-00266"><sup>3</sup></xref></contrib></contrib-group>
<aff id="af1-information-02-00266">
<label>1</label> Department of Psychiatry, MC912, University of Illinois at Chicago, 1601 W. Taylor Street, Chicago, IL 60612, USA</aff>
<aff id="af2-information-02-00266">
<label>2</label> Ingenuity Systems, Inc., 1700 Seaport Blvd. Third Floor, Redwood City, CA 94063, USA; E-Mail: <email>wzhou@ingenuity.com</email></aff>
<aff id="af3-information-02-00266">
<label>3</label> Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign, 501 E. Daniel St., Champaign, IL 61820, USA; E-Mail: <email>vtorvik@illinois.edu</email></aff>
<author-notes>
<corresp id="c1-information-02-00266">
<label>&#x0002A;</label> Author to whom correspondence should be addressed; E-Mail: <email>smalheiser@psych.uic.edu</email>.</corresp></author-notes>
<pub-date pub-type="collection">
<year>2011</year></pub-date>
<pub-date pub-type="epub">
<day>30</day>
<month>03</month>
<year>2011</year></pub-date>
<volume>2</volume>
<issue>2</issue>
<fpage>266</fpage>
<lpage>276</lpage>
<history>
<date date-type="received">
<day>03</day>
<month>03</month>
<year>2011</year></date>
<date date-type="accepted">
<day>28</day>
<month>03</month>
<year>2011</year></date></history>
<permissions>
<copyright-statement>&#x000A9; 2011 by the authors; licensee MDPI, Basel, Switzerland.</copyright-statement>
<copyright-year>2011</copyright-year>
<license>
<p>This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).</p></license></permissions>
<abstract>
<p>Given the occurrence frequency of any term within any set of articles within MEDLINE, we define &#x0201C;characteristic&#x0201D; terms as words and phrases that occur in that literature more frequently than expected by chance (at p &#x0003C; 0.001 or better). In this report, we studied how the cut-off criterion varied as a function of literature size and term frequency in MEDLINE as a whole, and have compared the distribution of characteristic terms within a number of journal-defined, affiliation-defined and random literatures. We also investigated how the characteristic terms were distributed among MEDLINE titles, abstracts, and last sentence of abstracts, including &#x0201C;regularized&#x0201D; terms that appear both in the title and abstract of the same paper for at least one paper in the literature. For a set of 10 disciplinary journals, the characteristic terms comprised 18&#x00025; of the total terms on average. Characteristic terms are utilized in several of our web-based services (Anne O&#x00027;Tate and Arrowsmith), and should be useful for a variety of other information-processing tasks designed to improve text mining in MEDLINE.</p></abstract>
<kwd-group>
<kwd>information retrieval</kwd>
<kwd>term occurrence</kwd>
<kwd>text mining</kwd>
<kwd>annotation</kwd>
<kwd>literature based discovery</kwd></kwd-group></article-meta></front>
<body>
<sec sec-type="intro">
<label>1.</label>
<title>Introduction</title>
<p>Terms occurring in a given set of articles (<italic>i.e.</italic>, a literature) more than expected by chance form a literature-specific vocabulary that is similar to the concept of a domain &#x0201C;sublanguage&#x0201D; &#x0005B;<xref ref-type="bibr" rid="b1-information-02-00266">1</xref>-<xref ref-type="bibr" rid="b3-information-02-00266">3</xref>&#x0005D;. They differ from the keywords extracted from a particular literature &#x0005B;<xref ref-type="bibr" rid="b4-information-02-00266">4</xref>,<xref ref-type="bibr" rid="b5-information-02-00266">5</xref>&#x0005D;, insofar as keywords occur frequently relative to other terms in that literature, whereas a literature-specific term may occur only a few times (as long as it is more frequent in that literature than in MEDLINE as a whole).</p>
<p>In the present paper, we have computed empirical occurrence frequencies of terms within a number of journal-defined, affiliation-defined and random literatures. We derived statistical criteria for asserting that a single term occurs more often within any given literature than expected by chance, and denote the set of terms that occur more than expected by chance (at p &#x0003C; 0.001) as the &#x0201C;characteristic&#x0201D; terms for that literature. Finally, we have studied their distribution across MEDLINE titles, abstracts, and last sentences of abstracts, including &#x0201C;regularized&#x0201D; characteristic terms that appear both in the title and abstract of the same paper for at least one paper in the literature. These studies set the stage for utilizing characteristic terms as features in text mining models, and in creating thumbnail annotations of the literatures.</p></sec>
<sec sec-type="results">
<label>2.</label>
<title>Results</title>
<sec>
<label>2.1.</label>
<title>Delineating Characteristic Terms</title>
<p>We examined 10 different disciplinary journals published in English, containing abstracts, which comprised 2,000-10,000 papers each (average 5,132 papers), and characterized the distribution of term frequencies within the journal set <italic>vs.</italic> within MEDLINE as a whole. This distribution was compared with the distribution of terms in an affiliation-defined literature consisting of all articles published in 2000 having the word &#x0201C;California&#x0201D; in the affiliation field, and with a set of 5,000 articles chosen at random within MEDLINE. In each case, the Poisson approximation was used to define the distribution of term occurrence that would be expected by chance.</p>
<p><xref ref-type="fig" rid="f1-information-02-00266">Figure 1</xref> shows the raw distribution of term occurrence frequencies in the text fields (<italic>i.e.</italic>, title or abstract) for <italic>Journal of Biomedical Materials Research</italic> compared to a random literature of similar size. Term occurrence frequency was almost exactly linear for the journal when plotted on a log-log scale, indicating that frequencies followed a regular Zipf distribution. The frequencies for the random set followed a parallel curve and were significantly different from that of the journal.</p>
<p>To identify individual terms that were significantly more frequent than expected by chance, we computed p-value scores for each term across 10 disciplinary journals and plotted the average p-value scores in comparison to the California set and to a random set of 5,000 articles (<xref ref-type="fig" rid="f2-information-02-00266">Figure 2</xref>). Although terms associated with p-values &#x0003C; 0.05 are nominally significant, that does not take into account the fact that multiple tests are carried out. <xref ref-type="fig" rid="f2-information-02-00266">Figure 2</xref> emphasizes that the difference between journal-defined sets and random sets is most striking at p-values below 0.001. Thus, we have chosen p &#x0003C; 0.001 as our preferred cut-off value, and the set of terms in a literature with p-values below 0.001 will be called the characteristic terms of that literature.</p>
<p>For the set of 10 disciplinary journals, the set of characteristic terms comprise, on average, 18&#x00025; of the total terms in that literature. The cut-off criteria for deeming a term as &#x0201C;characteristic&#x0201D; vary systematically as functions both of literature size and term frequency within MEDLINE (<xref ref-type="fig" rid="f3-information-02-00266">Figure 3</xref>). Among the entire set of characteristic terms for the 10 disciplinary journals, average term occurrence is 23 times within the set of journal articles, which is 88 times more frequent in that literature than in MEDLINE.</p>
<p>To illustrate the types of terms that are characteristic for a specific literature, we show results from <italic>International Journal of Food Microbiology</italic>. <xref ref-type="table" rid="t1-information-02-00266">Table 1</xref> shows the 10 characteristic terms with the lowest p-values, 10 having moderate p-values and 10 having p-values near 0.001. Clearly, the top ten terms are closely related to the journal topic (food, listeria, meat, <italic>etc.</italic>), as are the moderate set (ethanol, shigella, mold, <italic>etc.</italic>), whereas those at the margin of significance are still relevant but less specific (tbg, gene coding, fever vomiting, sandwich, <italic>etc.</italic>).</p></sec>
<sec>
<label>2.2.</label>
<title>Distribution of Characteristic Terms within Individual Article Records</title>
<p>Several previous studies have emphasized that specific terms or MeSH concepts may be enriched in particular sections of scientific papers &#x0005B;<xref ref-type="bibr" rid="b6-information-02-00266">6</xref>,<xref ref-type="bibr" rid="b7-information-02-00266">7</xref>&#x0005D;. We examined how the set of characteristic terms are distributed among 8 different sections of papers encoded in MEDLINE fields for each of 10 disciplinary journals, the California literature and the random literature: <bold>text</bold> (comprising title and abstract fields); <bold>ti</bold> (title); <bold>ab</bold> (abstract); <bold>lastsen</bold> (last sentence of the abstract); <bold>ti</bold> &#x0002B; <bold>ab</bold> (present both in the title and in the abstract of at least one paper in the literature, though not necessarily the same paper); <bold>tiab</bold> (in the title and the abstract of the same paper, for at least one paper in the literature); <bold>ti</bold> &#x0002B; <bold>lastsen</bold> (in title and last sentence of the abstract, for at least one paper in the literature, though not necessarily the same paper); and <bold>tiab</bold> &#x0002B; <bold>lastsen</bold> (in tiab and in last sentence of the abstract for at least one paper in the literature).</p>
<p>One basic measure is the &#x0201C;density&#x0201D;&#x02014;this is the percentage of all terms in each section that are comprised of characteristic terms. Those sections that are high in density are relatively rich in characteristic terms. Another measure is the &#x0201C;coverage&#x0201D;&#x02014;defined as the number of characteristic terms found in each section, as a percentage of the total characteristic terms for that journal. Those sections that are high in coverage have the most characteristic terms overall.</p>
<p>The average density value varied significantly from journal to journal within our set of 10 disciplinary journals (<xref ref-type="table" rid="t2-information-02-00266">Table 2</xref>), presumably due to different journal policies such as limits on abstract length and structured <italic>vs.</italic> unstructured abstracts. However, after normalizing the density values for each journal, one could readily observe systematic section-related differences in density and coverage that were similar across journals (<xref ref-type="fig" rid="f4-information-02-00266">Figure 4</xref>). Title and last sentence of the abstract had significantly more density than the abstract field, whereas terms that appeared in multiple sections had significantly more density than those appearing in a single section (<xref ref-type="fig" rid="f4-information-02-00266">Figure 4</xref>). Not only were these fields progressively richer in characteristic terms, but the characteristic terms that they contained had higher average frequency of occurrence than the overall set of characteristic terms, and were more specific insofar as they had lower average p-values (<xref ref-type="fig" rid="f5-information-02-00266">Figure 5</xref>). Interestingly, the set of &#x0201C;regularized&#x0201D; characteristic terms (tiab) appearing in the title and abstract of the same paper had significantly higher average frequency and lower p-values than terms which appeared in titles and abstract of different papers (ti &#x0002B; ab) (<xref ref-type="fig" rid="f5-information-02-00266">Figure 5</xref>) (each parameter significantly different at p &#x0003C; 0.00001, using paired t-test).</p>
<p>Regularized terms (tiab) that also appeared in the last sentence of at least one paper (tiab &#x0002B; lastsen) had the highest average frequency and lowest average p-value of all (<xref ref-type="fig" rid="f5-information-02-00266">Figure 5</xref>), suggesting that this subset of characteristic terms comprises, in some sense, the most important terms associated with the journal. Of the 20 characteristic terms having the lowest p-values overall in one journal, <italic>International Journal of Food Microbiology</italic>, all were found in the tiab &#x0002B; l astsen set as well. Thus, two independent methods&#x02014;lowest p-value <italic>vs.</italic> presence in multiple sections of papers&#x02014;agree in giving the most &#x0201C;important&#x0201D; characteristic terms.</p>
<p>We also considered whether, given two characteristic terms with equal p-values, the term appearing in the greater number of papers in the literature should be considered the more important. For the characteristic terms in <italic>International Journal of Food Microbiology</italic>, we calculated a &#x0201C;corrected&#x0201D; p-value score by dividing the raw p-value by the fraction of papers in the journal containing the term; however, this correction did not alter the top 20 characteristic terms and had only a very minor effect on their relative ranking (<xref ref-type="table" rid="t3-information-02-00266">Table 3</xref>). Thus, at least for the task of choosing the few most important characteristic terms, it does not seem to be necessary to take this factor into account as a separate variable.</p></sec></sec>
<sec sec-type="methods">
<label>3.</label>
<title>Experimental Methods</title>
<p>The universe of terms was defined in the following manner, consistent with the larger aims of the Arrowsmith Project &#x0005B;<xref ref-type="bibr" rid="b8-information-02-00266">8</xref>&#x0005D;. Specifically, the titles of all papers in MEDLINE were extracted, stemmed and stoplisted using the short PubMed 364-word stoplist &#x0005B;<xref ref-type="bibr" rid="b9-information-02-00266">9</xref>&#x0005D;. Words were kept only if they appeared in the abstract of at least three papers in MEDLINE, and up to three word phrases were kept only if they appeared in at least 10 abstracts. Finally, terms were mapped through the NIH MetaMap program keeping only those terms that mapped to at least one UMLS semantic category. (This removes most of the nonsensical phrases but includes many that do not correspond exactly to UMLS concepts.) After filtering, the total number of words &#x0003D; 52,997, two word phrases &#x0003D; 747,484, and three word phrases &#x0003D; 429,566. (Note that if a term occurred at all within an abstract, it was scored as 1 occurrence regardless of how many times the term occurred within the same abstract.) For each occurrence of a term within a MEDLINE record, we noted its location within title, abstract, or last sentence in abstract. Sentence boundaries were identified using the Sentence Splitter &#x0005B;<xref ref-type="bibr" rid="b10-information-02-00266">10</xref>&#x0005D;.</p>
<sec>
<title>Modeling the expected term occurrence in a literature</title>
<p>Think of all the <italic>N</italic> papers in MEDLINE as a collection of <italic>N</italic> balls in an urn, where <italic>f</italic><sub>1</sub> black balls correspond to papers that contain a certain term, and the remaining <italic>N</italic> &#x02212; <italic>f</italic><sub>1</sub> balls are white (do not contain the term.) In constructing a random literature of <italic>f</italic><sub>2</sub> papers, we randomly select <italic>f</italic><sub>2</sub> distinct balls from the urn. The number of black balls selected, <italic>X</italic>, is a random variable that follows a hypergeometric distribution defined by:
<disp-formula id="FD1">
<mml:math id="mm1" display="block">
<mml:semantics id="sm1">
<mml:mrow>
<mml:mi>Pr</mml:mi>
<mml:mo stretchy="false">&#x0007B;</mml:mo>
<mml:mi mathvariant="italic">X</mml:mi>
<mml:mo>&#x0003D;</mml:mo>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:mo stretchy="false">&#x0007D;</mml:mo>
<mml:mo>&#x0003D;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mrow>
<mml:mo>&#x00028;</mml:mo>
<mml:mrow>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="italic">f</mml:mi>
<mml:mn>1</mml:mn></mml:msub></mml:mrow></mml:mtd></mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="italic">x</mml:mi></mml:mtd></mml:mtr></mml:mtable></mml:mrow>
<mml:mo>&#x00029;</mml:mo></mml:mrow>
<mml:mrow>
<mml:mo>&#x00028;</mml:mo>
<mml:mrow>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:mi mathvariant="italic">N</mml:mi>
<mml:mo>&#x02212;</mml:mo>
<mml:msub>
<mml:mi mathvariant="italic">f</mml:mi>
<mml:mn>1</mml:mn></mml:msub></mml:mrow></mml:mtd></mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="italic">f</mml:mi>
<mml:mn>2</mml:mn></mml:msub>
<mml:mo>&#x02212;</mml:mo>
<mml:mi mathvariant="italic">x</mml:mi></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow>
<mml:mo>&#x00029;</mml:mo></mml:mrow></mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo>&#x00028;</mml:mo>
<mml:mrow>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="italic">N</mml:mi></mml:mtd></mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="italic">x</mml:mi></mml:mtd></mml:mtr></mml:mtable></mml:mrow>
<mml:mo>&#x00029;</mml:mo></mml:mrow></mml:mrow></mml:mfrac>
<mml:mo>&#x0002C;</mml:mo>
<mml:mtext>for</mml:mtext>
<mml:mspace width="0.2em"/>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:mo>&#x0003D;</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>&#x0002C;</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>&#x0002C;</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo>&#x0002C;</mml:mo>
<mml:mo>&#x02026;</mml:mo>
<mml:mo>&#x0002C;</mml:mo>
<mml:mo>min</mml:mo>
<mml:mo stretchy="false">&#x0007B;</mml:mo>
<mml:msub>
<mml:mi mathvariant="italic">f</mml:mi>
<mml:mn>1</mml:mn></mml:msub>
<mml:mo>&#x0002C;</mml:mo>
<mml:msub>
<mml:mi mathvariant="italic">f</mml:mi>
<mml:mn>2</mml:mn></mml:msub>
<mml:mo stretchy="false">&#x0007D;</mml:mo></mml:mrow></mml:semantics></mml:math></disp-formula></p>
<p>In other words, if a literature and a given term are independent of each other, then the number of papers within that literature that contain the term should follow the hypergeometric distribution.</p>
<p>The Poisson distribution is a good approximation when <italic>N</italic> is large relative to <italic>f</italic><sub>1</sub> and <italic>f</italic><sub>2</sub>:
<disp-formula id="FD2">
<mml:math id="mm2" display="block">
<mml:semantics id="sm2">
<mml:mrow>
<mml:mi>Pr</mml:mi>
<mml:mo stretchy="false">&#x0007B;</mml:mo>
<mml:mi mathvariant="italic">X</mml:mi>
<mml:mo>&#x0003D;</mml:mo>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:mo stretchy="false">&#x0007D;</mml:mo>
<mml:mo>&#x02248;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:msup>
<mml:mtext>e</mml:mtext>
<mml:mrow>
<mml:mo>&#x02212;</mml:mo>
<mml:mi mathvariant="italic">&#x003BB;</mml:mi></mml:mrow></mml:msup>
<mml:msup>
<mml:mi mathvariant="italic">&#x003BB;</mml:mi>
<mml:mi mathvariant="italic">x</mml:mi></mml:msup></mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mtext>x</mml:mtext>
<mml:mo>&#x00021;</mml:mo></mml:mrow></mml:mrow></mml:mfrac>
<mml:mo>&#x0002C;</mml:mo>
<mml:mtext>for</mml:mtext>
<mml:mspace width="0.2em"/>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:mo>&#x0003D;</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>&#x0002C;</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>&#x0002C;</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo>&#x0002C;</mml:mo>
<mml:mo>&#x02026;</mml:mo></mml:mrow></mml:semantics></mml:math></disp-formula></p>
<p>Where &#x003BB; &#x0003D; <italic>f</italic><sub>1</sub><italic>f</italic><sub>2</sub>/<italic>N</italic> is the expected value of <italic>X</italic>. We have verified that the Poisson distribution is an extremely close approximation for the hypergeometric distribution in the full range of literature sizes and term frequencies considered in this paper.</p></sec></sec>
<sec sec-type="conclusions">
<label>4.</label>
<title>Conclusions</title>
<p>In the present paper, we have calculated and empirically validated statistical criteria for saying that a term occurs in a given literature more often than by chance, and have analyzed the resulting set of &#x0201C;characteristic&#x0201D; terms (having p-values &#x0003C; 0.001) in some detail. Note that the characteristic terms for a literature are not necessarily the most frequent in that literature. Nor, for topically-defined literatures, do they need to have any semantic relation to the query term that generated the literature.</p>
<p>Characteristic terms of a literature have proven useful for different information-processing tasks. In the Anne O&#x00027;Tate tool &#x0005B;<xref ref-type="bibr" rid="b11-information-02-00266">11</xref>&#x0005D; that combines PubMed literature retrieval with additional post-retrieval analyses, the set of characteristic terms gives a thumbnail annotation of any retrieved literature. For example, in the case of papers describing diabetes research, the set of characteristic terms (restricted to the semantic category of gene names) gives a thumbnail annotation of the genes that have been studied in this field. In the Author-ity author name disambiguation tool &#x0005B;<xref ref-type="bibr" rid="b12-information-02-00266">12</xref>&#x0005D;, characteristic terms provide a thumbnail annotation of any given author&#x00027;s research output. Other possible uses for characteristic terms occur in post-processing of a PubMed query, to replace or supplement other language resources such as Medical Subject Headings, UMLS concepts or keyword thesauri, e.g., to expand the query automatically to include highly related papers &#x0005B;<xref ref-type="bibr" rid="b13-information-02-00266">13</xref>,<xref ref-type="bibr" rid="b14-information-02-00266">14</xref>&#x0005D;, to cluster the retrieved papers by theme &#x0005B;<xref ref-type="bibr" rid="b15-information-02-00266">15</xref>&#x0005D;, or to reformulate the query in a manner that permits cross-disciplinary retrieval &#x0005B;<xref ref-type="bibr" rid="b16-information-02-00266">16</xref>&#x0005D;. For example, to expand an original query automatically, one could replace the original terms used in the search with a new Boolean query made up of a small number of characteristic terms. These would not necessarily be the terms with the lowest p-values, but rather would be the set of the terms that (when combined with appropriate AND and OR operations) cover the original literature most accurately and with least redundancy.</p>
<p>The characteristic terms with the lowest p-values are likely to be most useful for annotation; this is similar to the log-entropy term weighting approach taken by Homayouni <italic>et al.</italic> &#x0005B;<xref ref-type="bibr" rid="b17-information-02-00266">17</xref>&#x0005D;. Other annotation methods are possible&#x02014;for example, Erkan and Radev &#x0005B;<xref ref-type="bibr" rid="b18-information-02-00266">18</xref>&#x0005D; used a graph theoretic-approach to obtain the &#x0201C;most important&#x0201D; terms within document sets&#x02014;but this is far more computationally complex than the method proposed here, and would not scale well to large literatures. The terms with lowest p-values are likely to be the most important as well, especially since these terms appeared in multiple sections of the papers.</p>
<p>Finally, characteristic terms have been useful for assisting in literature-based discovery. In the Arrowsmith two-node search tool &#x0005B;<xref ref-type="bibr" rid="b19-information-02-00266">19</xref>,<xref ref-type="bibr" rid="b20-information-02-00266">20</xref>&#x0005D;, the user seeks to assess a possible relationship between literatures A and C; the computer interface presents a list of terms (the &#x0201C;B-list&#x0201D;) in common between the literatures to serve as a conceptual bridge. However, not all B-terms are likely to be of equal value in discovering significant implicit links. Characteristic terms expressed in each literature are computed as a feature in the quantitative model that allows us to rank the B-terms in order of predicted relevance to linking the two literatures in a meaningful way &#x0005B;<xref ref-type="bibr" rid="b19-information-02-00266">19</xref>&#x0005D;. Moreover, B-terms that are not characteristic in either literature A or C are unlikely to indicate important concepts in either literature, whereas B-terms that are characteristic in both A and C may represent concepts that are already well known. Thus, we are currently exploring the hypothesis that the B-terms most likely to point to new discoveries in two node searches are those that are characteristic in one literature, but not both.</p></sec></body>
<back>
<sec sec-type="display-objects">
<title>Figures and Tables</title>
<fig id="f1-information-02-00266" position="float">
<label>Figure 1.</label>
<caption>
<p>Distribution of term occurrence frequencies in text fields for a journal literature (<italic>Journal of Biomedical Materials Research</italic>, 1967&#x02013;2002, 4,824 articles) <italic>vs.</italic> a randomly selected literature (5,000 articles chosen across MEDLINE). Error bars show 95&#x00025; confidence intervals around the regression curves. The journal literature contains more highly frequent terms, and therefore its curve extends beyond that of the random curve.</p></caption>
<graphic xlink:href="information-02-00266f1.gif"/></fig>
<fig id="f2-information-02-00266" position="float">
<label>Figure 2.</label>
<caption>
<p>Distribution of p-value scores determined using the Poisson distribution. The p-value score was computed with the formula p-value &#x0003D; P(<italic>X</italic> &#x02265; <italic>frq-lit</italic>), where <italic>frq-lit</italic> is the number of times a term occurs within the literature. The affiliation-defined literature was chosen as the set of articles published in 2000 having the word &#x0201C;California&#x0201D; in the affiliation field.</p></caption>
<graphic xlink:href="information-02-00266f2.gif"/></fig>
<fig id="f3-information-02-00266" position="float">
<label>Figure 3.</label>
<caption>
<p>(<bold>a</bold>) The minimum number of occurrences of a term within a literature (for a given term frequency <italic>F</italic> within MEDLINE and a given literature size) needed to call the term &#x0201C;characteristic&#x0201D; of that literature; (<bold>b</bold>) The minimum ratio of occurrences in a literature <italic>vs.</italic> MEDLINE needed to call the term &#x0201C;characteristic&#x0201D;.</p></caption>
<graphic xlink:href="information-02-00266f3.gif"/></fig>
<fig id="f4-information-02-00266" position="float">
<label>Figure 4.</label>
<caption>
<p>Density and coverage of characteristic terms in 8 different sections of articles averaged over 10 disciplinary journals. Ellipses show one standard error around the mean values.</p></caption>
<graphic xlink:href="information-02-00266f4.gif"/></fig>
<fig id="f5-information-02-00266" position="float">
<label>Figure 5.</label>
<caption>
<p>Average frequency and p-value for characteristic terms in 8 different sections of articles averaged over 10 disciplinary journals; (<bold>a</bold>) Average frequencies; error bars indicate 1 standard error of the mean; (<bold>b</bold>) Average p-value scores.</p></caption>
<graphic xlink:href="information-02-00266f5.gif"/></fig>
<table-wrap id="t1-information-02-00266" position="float">
<label>Table 1.</label>
<caption>
<p>Characteristic terms extracted from the <italic>International Journal of Food Microbiology</italic> showing those with the 10 lowest p-value scores, 10 having moderate scores (&#x0223C;8.6 &#x000D7; 10<sup>&#x02212;5</sup>) and 10 having p-values near 0.001.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th colspan="2" align="right" valign="top"><bold>Terms (lowest p-value)</bold></th>
<th align="right" valign="top"><bold>Terms (moderate p-value)</bold></th>
<th align="right" valign="top"><bold>Terms (p-value near 0.001)</bold></th></tr></thead>
<tbody>
<tr>
<td align="right" valign="top"><bold>1</bold></td>
<td align="left" valign="top">food</td>
<td align="left" valign="top">ph ethanol</td>
<td align="left" valign="top">strain x</td></tr>
<tr>
<td align="right" valign="top"><bold>2</bold></td>
<td align="left" valign="top">listeria</td>
<td align="left" valign="top">recurrent neural network</td>
<td align="left" valign="top">tbg</td></tr>
<tr>
<td align="right" valign="top"><bold>3</bold></td>
<td align="left" valign="top">strain</td>
<td align="left" valign="top">shigella yersinia</td>
<td align="left" valign="top">or h</td></tr>
<tr>
<td align="right" valign="top"><bold>4</bold></td>
<td align="left" valign="top">listeria monocytogene</td>
<td align="left" valign="top">disinfection or</td>
<td align="left" valign="top">mytilus galloprovincialis</td></tr>
<tr>
<td align="right" valign="top"><bold>5</bold></td>
<td align="left" valign="top">degree c</td>
<td align="left" valign="top">growth environmental</td>
<td align="left" valign="top">gene coding</td></tr>
<tr>
<td align="right" valign="top"><bold>6</bold></td>
<td align="left" valign="top">meat</td>
<td align="left" valign="top">mold growth</td>
<td align="left" valign="top">fever vomiting</td></tr>
<tr>
<td align="right" valign="top"><bold>7</bold></td>
<td align="left" valign="top">l monocytogene</td>
<td align="left" valign="top">staphylococcal strain isolated</td>
<td align="left" valign="top">sandwich</td></tr>
<tr>
<td align="right" valign="top"><bold>8</bold></td>
<td align="left" valign="top">lactic acid</td>
<td align="left" valign="top">yeast high</td>
<td align="left" valign="top">growth effect</td></tr>
<tr>
<td align="right" valign="top"><bold>9</bold></td>
<td align="left" valign="top">lactobacillus</td>
<td align="left" valign="top">longitudinally</td>
<td align="left" valign="top">density nm</td></tr>
<tr>
<td align="right" valign="top"><bold>10</bold></td>
<td align="left" valign="top">lactic acid bacteria</td>
<td align="left" valign="top">pathogen human</td>
<td align="left" valign="top">reliable method</td></tr></tbody></table></table-wrap>
<table-wrap id="t2-information-02-00266" position="float">
<label>Table 2.</label>
<caption>
<p>Density and coverage of the characteristic terms in 8 different article fields across 10 disciplinary journals, an affiliation-defined literature and a random literature (see text). Jrn1: <italic>Acta. Physiol. Scand.</italic>; 2: <italic>Clin. Obstet Gynecol.</italic>; 3: <italic>Int. J. Dermatol.</italic>; 4: <italic>J. Biomed. Mater. Res.</italic>; 5: <italic>JPEN. J. Parenter Enteral Nutr.</italic>; 6: <italic>Am. J. Med. Genet</italic>; 7: <italic>Int. J. Food Microbiol.</italic>; 8: <italic>Cytometry</italic>; 9: <italic>J. Am. Coll. Cardiol.</italic>; 10: <italic>Int. Arch. Allergy Immunol</italic>.</p></caption>
<table frame="hsides" rules="all">
<thead>
<tr>
<th colspan="14" align="center" valign="top"><bold>Density (</bold>&#x00025;<bold>)</bold></th></tr>
<tr>
<th align="center" valign="top"/>
<th align="center" valign="middle"><bold>Jrn1</bold></th>
<th align="center" valign="middle"><bold>Jrn2</bold></th>
<th align="center" valign="middle"><bold>Jrn3</bold></th>
<th align="center" valign="middle"><bold>Jrn4</bold></th>
<th align="center" valign="middle"><bold>Jrn5</bold></th>
<th align="center" valign="middle"><bold>Jrn6</bold></th>
<th align="center" valign="middle"><bold>Jrn7</bold></th>
<th align="center" valign="middle"><bold>Jrn8</bold></th>
<th align="center" valign="middle"><bold>Jrn9</bold></th>
<th align="center" valign="middle"><bold>Jrn10</bold></th>
<th align="center" valign="middle"><bold>Average of 10 Jrns</bold></th>
<th align="center" valign="middle"><bold>California literature</bold></th>
<th align="center" valign="middle"><bold>Random literature</bold></th></tr></thead>
<tbody>
<tr>
<td align="right" valign="bottom"><bold>Text</bold></td>
<td align="right" valign="bottom">18.31</td>
<td align="right" valign="bottom">9.01</td>
<td align="right" valign="bottom">11.82</td>
<td align="right" valign="bottom">20.74</td>
<td align="right" valign="bottom">15.54</td>
<td align="right" valign="bottom">19.94</td>
<td align="right" valign="bottom">24.65</td>
<td align="right" valign="bottom">17.26</td>
<td align="right" valign="bottom">29.29</td>
<td align="right" valign="bottom">19.01</td>
<td align="right" valign="bottom">18.55</td>
<td align="right" valign="bottom">7.39</td>
<td align="right" valign="bottom">0.37</td></tr>
<tr>
<td align="right" valign="bottom"><bold>Ab</bold></td>
<td align="right" valign="bottom">19.94</td>
<td align="right" valign="bottom">9.14</td>
<td align="right" valign="bottom">12.71</td>
<td align="right" valign="bottom">21.65</td>
<td align="right" valign="bottom">16.41</td>
<td align="right" valign="bottom">20.82</td>
<td align="right" valign="bottom">25.65</td>
<td align="right" valign="bottom">18.14</td>
<td align="right" valign="bottom">30.68</td>
<td align="right" valign="bottom">20.19</td>
<td align="right" valign="bottom">19.53</td>
<td align="right" valign="bottom">7.87</td>
<td align="right" valign="bottom">0.40</td></tr>
<tr>
<td align="right" valign="bottom"><bold>Lastsen</bold></td>
<td align="right" valign="bottom">34.92</td>
<td align="right" valign="bottom">18.61</td>
<td align="right" valign="bottom">21.79</td>
<td align="right" valign="bottom">39.69</td>
<td align="right" valign="bottom">31.68</td>
<td align="right" valign="bottom">38.21</td>
<td align="right" valign="bottom">43.99</td>
<td align="right" valign="bottom">35.26</td>
<td align="right" valign="bottom">53.27</td>
<td align="right" valign="bottom">37.27</td>
<td align="right" valign="bottom">35.46</td>
<td align="right" valign="bottom">19.71</td>
<td align="right" valign="bottom">0.48</td></tr>
<tr>
<td align="right" valign="bottom"><bold>Ti</bold></td>
<td align="right" valign="bottom">34.08</td>
<td align="right" valign="bottom">21.5</td>
<td align="right" valign="bottom">23.14</td>
<td align="right" valign="bottom">44.76</td>
<td align="right" valign="bottom">34.11</td>
<td align="right" valign="bottom">43.52</td>
<td align="right" valign="bottom">51.61</td>
<td align="right" valign="bottom">37.02</td>
<td align="right" valign="bottom">53.89</td>
<td align="right" valign="bottom">38.02</td>
<td align="right" valign="bottom">38.16</td>
<td align="right" valign="bottom">18.53</td>
<td align="right" valign="bottom">0.60</td></tr>
<tr>
<td align="right" valign="bottom"><bold>Ti</bold> &#x0002B; <bold>Ab</bold></td>
<td align="right" valign="bottom">48.54</td>
<td align="right" valign="bottom">33.74</td>
<td align="right" valign="bottom">34.73</td>
<td align="right" valign="bottom">55.57</td>
<td align="right" valign="bottom">45.81</td>
<td align="right" valign="bottom">51.98</td>
<td align="right" valign="bottom">62.33</td>
<td align="right" valign="bottom">48.01</td>
<td align="right" valign="bottom">63.79</td>
<td align="right" valign="bottom">50.43</td>
<td align="right" valign="bottom">49.49</td>
<td align="right" valign="bottom">24.87</td>
<td align="right" valign="bottom">0.93</td></tr>
<tr>
<td align="right" valign="bottom"><bold>Ti</bold> &#x0002B; <bold>Lastsen</bold></td>
<td align="right" valign="bottom">57.11</td>
<td align="right" valign="bottom">36.94</td>
<td align="right" valign="bottom">37.51</td>
<td align="right" valign="bottom">64.49</td>
<td align="right" valign="bottom">53.84</td>
<td align="right" valign="bottom">60.45</td>
<td align="right" valign="bottom">69.06</td>
<td align="right" valign="bottom">58.76</td>
<td align="right" valign="bottom">72.19</td>
<td align="right" valign="bottom">59.82</td>
<td align="right" valign="bottom">57.01</td>
<td align="right" valign="bottom">36.51</td>
<td align="right" valign="bottom">0.65</td></tr>
<tr>
<td align="right" valign="bottom"><bold>Tiab</bold></td>
<td align="right" valign="bottom">47.69</td>
<td align="right" valign="bottom">37.06</td>
<td align="right" valign="bottom">36.41</td>
<td align="right" valign="bottom">54.87</td>
<td align="right" valign="bottom">44.84</td>
<td align="right" valign="bottom">50.08</td>
<td align="right" valign="bottom">59.64</td>
<td align="right" valign="bottom">43.61</td>
<td align="right" valign="bottom">62.31</td>
<td align="right" valign="bottom">45.59</td>
<td align="right" valign="bottom">48.21</td>
<td align="right" valign="bottom">25.66</td>
<td align="right" valign="bottom">0.42</td></tr>
<tr>
<td align="right" valign="bottom"><bold>Tiab</bold> &#x0002B; <bold>Lastsen</bold></td>
<td align="right" valign="bottom">57.89</td>
<td align="right" valign="bottom">42.57</td>
<td align="right" valign="bottom">41.97</td>
<td align="right" valign="bottom">65.11</td>
<td align="right" valign="bottom">55.15</td>
<td align="right" valign="bottom">59.92</td>
<td align="right" valign="bottom">68.83</td>
<td align="right" valign="bottom">57.62</td>
<td align="right" valign="bottom">72.19</td>
<td align="right" valign="bottom">57.85</td>
<td align="right" valign="bottom">57.91</td>
<td align="right" valign="bottom">37.34</td>
<td align="right" valign="bottom">0.37</td></tr></tbody>
<tbody>
<tr>
<td colspan="14" align="center" valign="top"><bold>Coverage (</bold>&#x00025;<bold>)</bold></td></tr>
<tr>
<td align="right" valign="middle"/>
<td align="right" valign="middle"><bold>Jrn1</bold></td>
<td align="right" valign="middle"><bold>Jrn2</bold></td>
<td align="right" valign="middle"><bold>Jrn3</bold></td>
<td align="right" valign="middle"><bold>Jrn4</bold></td>
<td align="right" valign="middle"><bold>Jrn5</bold></td>
<td align="right" valign="middle"><bold>Jrn6</bold></td>
<td align="right" valign="middle"><bold>Jrn7</bold></td>
<td align="right" valign="middle"><bold>Jrn8</bold></td>
<td align="right" valign="middle"><bold>Jrn9</bold></td>
<td align="right" valign="middle"><bold>Jrn10</bold></td>
<td align="right" valign="middle"><bold>Average of 10 Jrns</bold></td>
<td align="right" valign="middle"><bold>California literature</bold></td>
<td align="right" valign="middle"><bold>Random literature</bold></td></tr>
<tr>
<td align="right" valign="bottom"><bold>Text</bold></td>
<td align="right" valign="bottom">100</td>
<td align="right" valign="bottom">100</td>
<td align="right" valign="bottom">100</td>
<td align="right" valign="bottom">100</td>
<td align="right" valign="bottom">100</td>
<td align="right" valign="bottom">100</td>
<td align="right" valign="bottom">100</td>
<td align="right" valign="bottom">100</td>
<td align="right" valign="bottom">100</td>
<td align="right" valign="bottom">100</td>
<td align="right" valign="bottom">100</td>
<td align="right" valign="bottom">100</td>
<td align="right" valign="bottom">100</td></tr>
<tr>
<td align="right" valign="bottom"><bold>Ab</bold></td>
<td align="right" valign="bottom">95.33</td>
<td align="right" valign="bottom">85.51</td>
<td align="right" valign="bottom">89.89</td>
<td align="right" valign="bottom">98.37</td>
<td align="right" valign="bottom">98.38</td>
<td align="right" valign="bottom">98.47</td>
<td align="right" valign="bottom">98.87</td>
<td align="right" valign="bottom">98.63</td>
<td align="right" valign="bottom">98.64</td>
<td align="right" valign="bottom">98.79</td>
<td align="right" valign="bottom">96.08</td>
<td align="right" valign="bottom">99.01</td>
<td align="right" valign="bottom">93.83</td></tr>
<tr>
<td align="right" valign="bottom"><bold>Lastsen</bold></td>
<td align="right" valign="bottom">48.46</td>
<td align="right" valign="bottom">41.76</td>
<td align="right" valign="bottom">41.7</td>
<td align="right" valign="bottom">49.3</td>
<td align="right" valign="bottom">50.1</td>
<td align="right" valign="bottom">55.99</td>
<td align="right" valign="bottom">49.3</td>
<td align="right" valign="bottom">50.18</td>
<td align="right" valign="bottom">52.75</td>
<td align="right" valign="bottom">47.21</td>
<td align="right" valign="bottom">48.67</td>
<td align="right" valign="bottom">59.98</td>
<td align="right" valign="bottom">27.01</td></tr>
<tr>
<td align="right" valign="bottom"><bold>Ti</bold></td>
<td align="right" valign="bottom">66.73</td>
<td align="right" valign="bottom">77.47</td>
<td align="right" valign="bottom">75.77</td>
<td align="right" valign="bottom">57.01</td>
<td align="right" valign="bottom">53.6</td>
<td align="right" valign="bottom">68.34</td>
<td align="right" valign="bottom">55.29</td>
<td align="right" valign="bottom">52.9</td>
<td align="right" valign="bottom">61.71</td>
<td align="right" valign="bottom">52.83</td>
<td align="right" valign="bottom">62.16</td>
<td align="right" valign="bottom">66.23</td>
<td align="right" valign="bottom">53.55</td></tr>
<tr>
<td align="right" valign="bottom"><bold>Ti</bold> &#x0002B; <bold>Ab</bold></td>
<td align="right" valign="bottom">62.07</td>
<td align="right" valign="bottom">62.98</td>
<td align="right" valign="bottom">65.67</td>
<td align="right" valign="bottom">55.39</td>
<td align="right" valign="bottom">51.99</td>
<td align="right" valign="bottom">66.81</td>
<td align="right" valign="bottom">62.33</td>
<td align="right" valign="bottom">48.01</td>
<td align="right" valign="bottom">63.79</td>
<td align="right" valign="bottom">50.43</td>
<td align="right" valign="bottom">58.94</td>
<td align="right" valign="bottom">65.24</td>
<td align="right" valign="bottom">47.39</td></tr>
<tr>
<td align="right" valign="bottom"><bold>Ti</bold> &#x0002B; <bold>Lastsen</bold></td>
<td align="right" valign="bottom">34.6</td>
<td align="right" valign="bottom">33.48</td>
<td align="right" valign="bottom">33.53</td>
<td align="right" valign="bottom">33.32</td>
<td align="right" valign="bottom">33.93</td>
<td align="right" valign="bottom">42.28</td>
<td align="right" valign="bottom">30.66</td>
<td align="right" valign="bottom">31.52</td>
<td align="right" valign="bottom">36.89</td>
<td align="right" valign="bottom">29.75</td>
<td align="right" valign="bottom">33.99</td>
<td align="right" valign="bottom">45.57</td>
<td align="right" valign="bottom">15.16</td></tr>
<tr>
<td align="right" valign="bottom"><bold>Tiab</bold></td>
<td align="right" valign="bottom">38.04</td>
<td align="right" valign="bottom">27.89</td>
<td align="right" valign="bottom">38.49</td>
<td align="right" valign="bottom">39.07</td>
<td align="right" valign="bottom">36.79</td>
<td align="right" valign="bottom">44.94</td>
<td align="right" valign="bottom">39.39</td>
<td align="right" valign="bottom">35.04</td>
<td align="right" valign="bottom">39.4</td>
<td align="right" valign="bottom">34.12</td>
<td align="right" valign="bottom">37.31</td>
<td align="right" valign="bottom">50.38</td>
<td align="right" valign="bottom">16.11</td></tr>
<tr>
<td align="right" valign="bottom"><bold>Tiab</bold> &#x0002B; <bold>Lastsen</bold></td>
<td align="right" valign="bottom">25.59</td>
<td align="right" valign="bottom">19.77</td>
<td align="right" valign="bottom">24.76</td>
<td align="right" valign="bottom">26.49</td>
<td align="right" valign="bottom">28.08</td>
<td align="right" valign="bottom">31.84</td>
<td align="right" valign="bottom">25.3</td>
<td align="right" valign="bottom">24.67</td>
<td align="right" valign="bottom">28.21</td>
<td align="right" valign="bottom">22.78</td>
<td align="right" valign="bottom">25.74</td>
<td align="right" valign="bottom">38.14</td>
<td align="right" valign="bottom">7.10</td></tr></tbody></table></table-wrap>
<table-wrap id="t3-information-02-00266" position="float">
<label>Table 3.</label>
<caption>
<p>Top 20 characteristic terms extracted from the <italic>International Journal of Food Microbiology</italic>, ranked by raw p-value <italic>vs.</italic> by corrected p-value (see text for details). F is the number of times the term occurs within text fields in MEDLINE and f is the number of occurrences in the journal.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th colspan="4" align="center" valign="top"><bold>Top 20 ranked by raw p-value score</bold></th>
<th colspan="4" align="center" valign="top"><bold>Top 20 ranked by corrected p-value score</bold></th></tr>
<tr>
<th colspan="8" valign="bottom">
<hr/></th></tr>
<tr>
<th align="center" valign="middle"><bold>Term</bold></th>
<th align="left" valign="middle"><bold>f</bold></th>
<th align="left" valign="middle"><bold>F</bold></th>
<th align="center" valign="middle"><bold>raw p-value score</bold></th>
<th align="center" valign="middle"><bold>Term</bold></th>
<th align="left" valign="middle"><bold>f</bold></th>
<th align="left" valign="middle"><bold>F</bold></th>
<th align="center" valign="middle"><bold>corrected p-value score</bold></th></tr></thead>
<tbody>
<tr>
<td align="left" valign="top">food</td>
<td align="left" valign="top">718</td>
<td align="right" valign="top">102,805</td>
<td align="center" valign="top">1.04 &#x000D7; 10<sup>&#x02212;847</sup></td>
<td align="left" valign="top">food</td>
<td align="left" valign="top">718</td>
<td align="right" valign="top">102,805</td>
<td align="center" valign="top">3.01 &#x000D7; 10<sup>&#x02212;847</sup></td></tr>
<tr>
<td align="left" valign="top">listeria</td>
<td align="left" valign="top">384</td>
<td align="right" valign="top">6,879</td>
<td align="center" valign="top">2.11 &#x000D7; 10<sup>&#x02212;797</sup></td>
<td align="left" valign="top">listeria</td>
<td align="left" valign="top">384</td>
<td align="right" valign="top">6,879</td>
<td align="center" valign="top">1.14 &#x000D7; 10<sup>&#x02212;796</sup></td></tr>
<tr>
<td align="left" valign="top">strain</td>
<td align="left" valign="top">775</td>
<td align="right" valign="top">246,958</td>
<td align="center" valign="top">6.76 &#x000D7; 10<sup>&#x02212;656</sup></td>
<td align="left" valign="top">strain</td>
<td align="left" valign="top">775</td>
<td align="right" valign="top">246,958</td>
<td align="center" valign="top">1.81 &#x000D7; 10<sup>&#x02212;655</sup></td></tr>
<tr>
<td align="left" valign="top">listeria monocytogene</td>
<td align="left" valign="top">310</td>
<td align="right" valign="top">5,648</td>
<td align="center" valign="top">6.90 &#x000D7; 10<sup>&#x02212;642</sup></td>
<td align="left" valign="top">listeria monocytogene</td>
<td align="left" valign="top">310</td>
<td align="right" valign="top">5,648</td>
<td align="center" valign="top">4.62 &#x000D7; 10<sup>&#x02212;641</sup></td></tr>
<tr>
<td align="left" valign="top">degree c</td>
<td align="left" valign="top">625</td>
<td align="right" valign="top">139,694</td>
<td align="center" valign="top">3.84 &#x000D7; 10<sup>&#x02212;621</sup></td>
<td align="left" valign="top">degree c</td>
<td align="left" valign="top">625</td>
<td align="right" valign="top">139,694</td>
<td align="center" valign="top">1.28 &#x000D7; 10<sup>&#x02212;620</sup></td></tr>
<tr>
<td align="left" valign="top">meat</td>
<td align="left" valign="top">309</td>
<td align="right" valign="top">11,582</td>
<td align="center" valign="top">1.81 &#x000D7; 10<sup>&#x02212;543</sup></td>
<td align="left" valign="top">meat</td>
<td align="left" valign="top">309</td>
<td align="right" valign="top">11,582</td>
<td align="center" valign="top">1.22 &#x000D7; 10<sup>&#x02212;542</sup></td></tr>
<tr>
<td align="left" valign="top">l monocytogene</td>
<td align="left" valign="top">231</td>
<td align="right" valign="top">2,514</td>
<td align="center" valign="top">2.05 &#x000D7; 10<sup>&#x02212;530</sup></td>
<td align="left" valign="top">l monocytogene</td>
<td align="left" valign="top">231</td>
<td align="right" valign="top">2,514</td>
<td align="center" valign="top">1.84 &#x000D7; 10<sup>&#x02212;529</sup></td></tr>
<tr>
<td align="left" valign="top">lactic acid</td>
<td align="left" valign="top">257</td>
<td align="right" valign="top">7,008</td>
<td align="center" valign="top">1.11 &#x000D7; 10<sup>&#x02212;487</sup></td>
<td align="left" valign="top">lactic acid</td>
<td align="left" valign="top">257</td>
<td align="right" valign="top">7,008</td>
<td align="center" valign="top">9.03 &#x000D7; 10<sup>0&#x02212;487</sup></td></tr>
<tr>
<td align="left" valign="top">lactobacillus</td>
<td align="left" valign="top">238</td>
<td align="right" valign="top">5,441</td>
<td align="center" valign="top">6.39 &#x000D7; 10<sup>&#x02212;470</sup></td>
<td align="left" valign="top">lactobacillus</td>
<td align="left" valign="top">238</td>
<td align="right" valign="top">5,441</td>
<td align="center" valign="top">5.57 &#x000D7; 10<sup>&#x02212;469</sup></td></tr>
<tr>
<td align="left" valign="top">lactic acid bacteria</td>
<td align="left" valign="top">188</td>
<td align="right" valign="top">1,324</td>
<td align="center" valign="top">1.52 &#x000D7; 10<sup>&#x02212;467</sup></td>
<td align="left" valign="top">lactic acid bacteria</td>
<td align="left" valign="top">188</td>
<td align="right" valign="top">1,324</td>
<td align="center" valign="top">1.67 &#x000D7; 10<sup>&#x02212;466</sup></td></tr>
<tr>
<td align="left" valign="top">lactic</td>
<td align="left" valign="top">270</td>
<td align="right" valign="top">13,490</td>
<td align="center" valign="top">1.07 &#x000D7; 10<sup>&#x02212;441</sup></td>
<td align="left" valign="top">lactic</td>
<td align="left" valign="top">270</td>
<td align="right" valign="top">13,490</td>
<td align="center" valign="top">8.22 &#x000D7; 10<sup>&#x02212;441</sup></td></tr>
<tr>
<td align="left" valign="top">temperature</td>
<td align="left" valign="top">467</td>
<td align="right" valign="top">156,235</td>
<td align="center" valign="top">8.00 &#x000D7; 10<sup>&#x02212;387</sup></td>
<td align="left" valign="top">temperature</td>
<td align="left" valign="top">467</td>
<td align="right" valign="top">156,235</td>
<td align="center" valign="top">3.56 &#x000D7; 10<sup>&#x02212;386</sup></td></tr>
<tr>
<td align="left" valign="top">degree</td>
<td align="left" valign="top">661</td>
<td align="right" valign="top">405,979</td>
<td align="center" valign="top">3.17 &#x000D7; 10<sup>&#x02212;386</sup></td>
<td align="left" valign="top">degree</td>
<td align="left" valign="top">661</td>
<td align="right" valign="top">405,979</td>
<td align="center" valign="top">9.95 &#x000D7; 10<sup>&#x02212;386</sup></td></tr>
<tr>
<td align="left" valign="top">salmonella</td>
<td align="left" valign="top">294</td>
<td align="right" valign="top">32,199</td>
<td align="center" valign="top">6.87 &#x000D7; 10<sup>&#x02212;382</sup></td>
<td align="left" valign="top">salmonella</td>
<td align="left" valign="top">294</td>
<td align="right" valign="top">32,199</td>
<td align="center" valign="top">4.85 &#x000D7; 10<sup>&#x02212;381</sup></td></tr>
<tr>
<td align="left" valign="top">spp</td>
<td align="left" valign="top">245</td>
<td align="right" valign="top">15,673</td>
<td align="center" valign="top">5.88 &#x000D7; 10<sup>&#x02212;375</sup></td>
<td align="left" valign="top">growth</td>
<td align="left" valign="top">662</td>
<td align="right" valign="top">426,410</td>
<td align="center" valign="top">3.91 &#x000D7; 10<sup>&#x02212;374</sup></td></tr>
<tr>
<td align="left" valign="top">growth</td>
<td align="left" valign="top">662</td>
<td align="right" valign="top">426,410</td>
<td align="center" valign="top">1.25 &#x000D7; 10<sup>&#x02212;374</sup></td>
<td align="left" valign="top">spp</td>
<td align="left" valign="top">245</td>
<td align="right" valign="top">15,673</td>
<td align="center" valign="top">4.98 &#x000D7; 10<sup>&#x02212;374</sup></td></tr>
<tr>
<td align="left" valign="top">ph</td>
<td align="left" valign="top">413</td>
<td align="right" valign="top">157,692</td>
<td align="center" valign="top">3.99 &#x000D7; 10<sup>&#x02212;320</sup></td>
<td align="left" valign="top">ph</td>
<td align="left" valign="top">413</td>
<td align="right" valign="top">157,692</td>
<td align="center" valign="top">2.00 &#x000D7; 10<sup>&#x02212;319</sup></td></tr>
<tr>
<td align="left" valign="top">isolate</td>
<td align="left" valign="top">331</td>
<td align="right" valign="top">81,127</td>
<td align="center" valign="top">2.75 &#x000D7; 10<sup>&#x02212;317</sup></td>
<td align="left" valign="top">isolate</td>
<td align="left" valign="top">331</td>
<td align="right" valign="top">81,127</td>
<td align="center" valign="top">1.73 &#x000D7; 10<sup>&#x02212;316</sup></td></tr>
<tr>
<td align="left" valign="top">storage</td>
<td align="left" valign="top">265</td>
<td align="right" valign="top">47,338</td>
<td align="center" valign="top">1.63 &#x000D7; 10<sup>&#x02212;289</sup></td>
<td align="left" valign="top">storage</td>
<td align="left" valign="top">265</td>
<td align="right" valign="top">47,338</td>
<td align="center" valign="top">1.28 &#x000D7; 10<sup>&#x02212;288</sup></td></tr>
<tr>
<td align="left" valign="top">foodborne</td>
<td align="left" valign="top">113</td>
<td align="right" valign="top">1,198</td>
<td align="center" valign="top">8.98 &#x000D7; 10<sup>&#x02212;262</sup></td>
<td align="left" valign="top">foodborne</td>
<td align="left" valign="top">113</td>
<td align="right" valign="top">1,198</td>
<td align="center" valign="top">1.65 &#x000D7; 10<sup>&#x02212;260</sup></td></tr></tbody></table></table-wrap></sec>
<ack>
<p>This Human Brain Project/Neuroinformatics research (LM007292 and LM08364) is funded jointly by the National Library of Medicine and the National Institute of Mental Health. The Medline database and the MMTx program (a Java implementation of the MetaMap algorithm) were graciously provided by the National Library of Medicine.</p></ack>
<ref-list>
<title>References and Notes</title>
<ref id="b1-information-02-00266"><label>1.</label><citation citation-type="book"><person-group person-group-type="author"><name><surname>Grishman</surname><given-names>R.</given-names></name><name><surname>Kittredge</surname><given-names>R.</given-names></name></person-group><source>Analyzing Language in Restricted Domains: Sublanguage Description and Processing</source><publisher-name>Lawrence Erlbaum Associates</publisher-name><publisher-loc>Mahwah, NJ, USA</publisher-loc><year>1986</year><fpage>19</fpage><lpage>38</lpage></citation></ref>
<ref id="b2-information-02-00266"><label>2.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liu</surname><given-names>Y.</given-names></name><name><surname>Brandon</surname><given-names>M.</given-names></name><name><surname>Navathe</surname><given-names>S.</given-names></name><name><surname>Dingledine</surname><given-names>R.</given-names></name><name><surname>Ciliax</surname><given-names>B.J.</given-names></name></person-group><article-title>Text Mining Functional Keywords Associated with Genes</article-title><source>Medinfo</source><year>2004</year><volume>107</volume><fpage>292</fpage><lpage>296</lpage></citation></ref>
<ref id="b3-information-02-00266"><label>3.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Tudor</surname><given-names>C.O.</given-names></name><name><surname>Vijay-Shanker</surname><given-names>K.</given-names></name><name><surname>Schmidt</surname><given-names>C.J.</given-names></name></person-group><article-title>Mining the Biomedical Literature for Genic Information</article-title><conf-name>Proceedings of BioNLP Workshop in Conjunction with ACL-2008</conf-name><conf-loc>Columbus Ohio</conf-loc><conf-date>28-29 June 2008</conf-date></citation></ref>
<ref id="b4-information-02-00266"><label>4.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Andrade</surname><given-names>M.A.</given-names></name><name><surname>Valencia</surname><given-names>A.</given-names></name></person-group><article-title>Automatic Extraction of Keywords from Scientific Text: Application to the Knowledge Domain of Protein Families</article-title><source>BMC Bioinform.</source><year>1998</year><volume>14</volume><fpage>600</fpage><lpage>607</lpage></citation></ref>
<ref id="b5-information-02-00266"><label>5.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kostoff</surname><given-names>R.N.</given-names></name><name><surname>Block</surname><given-names>J.A.</given-names></name><name><surname>Stump</surname><given-names>J.A.</given-names></name><name><surname>Pfeil</surname><given-names>K.M.</given-names></name></person-group><article-title>Information content in Medline record fields</article-title><source>Int. J. Med. Inform.</source><year>2004</year><volume>73</volume><fpage>515</fpage><lpage>527</lpage></citation></ref>
<ref id="b6-information-02-00266"><label>6.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Schuemie</surname><given-names>M.J.</given-names></name><name><surname>Weeber</surname><given-names>M.</given-names></name><name><surname>Schijvenaars</surname><given-names>B.J.</given-names></name><name><surname>van Mulligen</surname><given-names>E.M.</given-names></name><name><surname>van der Eijk</surname><given-names>C.C.</given-names></name><name><surname>Jelier</surname><given-names>R.</given-names></name><name><surname>Mons</surname><given-names>B.</given-names></name><name><surname>Kors</surname><given-names>J.A.</given-names></name></person-group><article-title>Distribution of Information in Biomedical Abstracts and Full-text Publications</article-title><source>Bioinformation</source><year>2004</year><volume>20</volume><fpage>2597</fpage><lpage>2604</lpage></citation></ref>
<ref id="b7-information-02-00266"><label>7.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Shah</surname><given-names>P.K.</given-names></name><name><surname>Perez-Iratxeta</surname><given-names>C.</given-names></name><name><surname>Bork</surname><given-names>P.</given-names></name><name><surname>Andrade</surname><given-names>M.A.</given-names></name></person-group><article-title>Information Extraction from Full Text Scientific Articles: Where Are the Keywords?</article-title><source>BMC Bioinform.</source><year>2003</year><volume>4</volume><fpage>20</fpage></citation></ref>
<ref id="b8-information-02-00266"><label>8.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Smalheiser</surname><given-names>N.R.</given-names></name><name><surname>Torvik</surname><given-names>V.I.</given-names></name><name><surname>Bischoff-Grethe</surname><given-names>A.</given-names></name><name><surname>Burhans</surname><given-names>L.B.</given-names></name><name><surname>Gabriel</surname><given-names>M.</given-names></name><name><surname>Homayouni</surname><given-names>R.</given-names></name><name><surname>Kashef</surname><given-names>A.</given-names></name><name><surname>Martone</surname><given-names>M.E.</given-names></name><name><surname>Perkins</surname><given-names>G.A.</given-names></name><name><surname>Price</surname><given-names>D.L.</given-names></name><name><surname>Talk</surname><given-names>A.C.</given-names></name><name><surname>West</surname><given-names>R.</given-names></name></person-group><article-title>Collaborative Development of the Arrowsmith Two Node Search Interface Designed For Laboratory Investigators</article-title><source>J. Biomed. Discov. Collab.</source><year>2006</year><volume>1</volume><fpage>8</fpage></citation></ref>
<ref id="b9-information-02-00266"><label>9.</label><citation citation-type="web"><comment><ext-link xlink:href="http://www.ncbi.nlm.nih.gov/books/NBK3827/table/pubmedhelp.T43/" ext-link-type="uri">http://www.ncbi.nlm.nih.gov/books/NBK3827/table/pubmedhelp.T43/</ext-link>.</comment></citation></ref>
<ref id="b10-information-02-00266"><label>10.</label><citation citation-type="web"><comment><ext-link xlink:href="http://l2r.cs.uiuc.edu/&#x0223C;cogcomp/tools.php" ext-link-type="uri">http://l2r.cs.uiuc.edu/&#x0223C;cogcomp/tools.php</ext-link>.</comment></citation></ref>
<ref id="b11-information-02-00266"><label>11.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Smalheiser</surname><given-names>N.R.</given-names></name><name><surname>Zhou</surname><given-names>W.</given-names></name><name><surname>Torvik</surname><given-names>V.I.</given-names></name></person-group><article-title>Anne O&#x00027;Tate: A Tool to Support User-Driven Summarization, Drill-Down And Browsing Of Pubmed Search Results</article-title><source>J. Biomed. Discov. Collab.</source><year>2008</year><volume>3</volume><fpage>2</fpage></citation></ref>
<ref id="b12-information-02-00266"><label>12.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Torvik</surname><given-names>V.I.</given-names></name><name><surname>Smalheiser</surname><given-names>N.R.</given-names></name></person-group><article-title>Author Name Disambiguation in MEDLINE</article-title><source>ACM Trans. Knowl. Discov. Data</source><year>2009</year><volume>3</volume><fpage>11</fpage></citation></ref>
<ref id="b13-information-02-00266"><label>13.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hersh</surname><given-names>W.</given-names></name><name><surname>Price</surname><given-names>S.</given-names></name><name><surname>Donohoe</surname><given-names>L.</given-names></name></person-group><article-title>Assessing Thesaurus-based Query Expansion Using the UMLS Metathesaurus</article-title><source>Proc. AMIA Symp.</source><year>2000</year><volume>73</volume><fpage>344</fpage><lpage>348</lpage></citation></ref>
<ref id="b14-information-02-00266"><label>14.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wilbur</surname><given-names>W.J.</given-names></name><name><surname>Yang</surname><given-names>Y.</given-names></name></person-group><article-title>An Analysis of Statistical Term Strength and Its Use in the Indexing and Retrieval of Molecular Biology Texts</article-title><source>Comput. Biol. Med.</source><year>1996</year><volume>26</volume><fpage>209</fpage><lpage>222</lpage></citation></ref>
<ref id="b15-information-02-00266"><label>15.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wilbur</surname><given-names>W.J.</given-names></name></person-group><article-title>A Thematic Analysis of the AIDS Literature</article-title><source>Pac. Symp. Biocomput.</source><year>2002</year><volume>73</volume><fpage>386</fpage><lpage>397</lpage></citation></ref>
<ref id="b16-information-02-00266"><label>16.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chen</surname><given-names>H.</given-names></name><name><surname>Ng</surname><given-names>T.D.</given-names></name><name><surname>Martinez</surname><given-names>J.</given-names></name><name><surname>Schatz</surname><given-names>B.R.</given-names></name></person-group><article-title>A Concept Space Approach to Addressing the Vocabulary Problem in Scientific Information Retrieval: an Experiment on the Worm Community System</article-title><source>J. Am. Soc. Inf. Sci.</source><year>1997</year><volume>48</volume><fpage>17</fpage><lpage>31</lpage></citation></ref>
<ref id="b17-information-02-00266"><label>17.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Homayouni</surname><given-names>R.</given-names></name><name><surname>Heinrich</surname><given-names>K.</given-names></name><name><surname>Wei</surname><given-names>L.</given-names></name><name><surname>Berry</surname><given-names>M.W.</given-names></name></person-group><article-title>Gene Clustering By Latent Semantic Indexing of MEDLINE Abstracts</article-title><source>Bioinformatics</source><year>2004</year><volume>73</volume><fpage>515</fpage><lpage>527</lpage></citation></ref>
<ref id="b18-information-02-00266"><label>18.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Erkan</surname><given-names>G.</given-names></name><name><surname>Radev</surname><given-names>D.R.</given-names></name></person-group><article-title>LexRank: Graph-based Centrality as Salience in Text Summarization</article-title><source>J. Artif. Intell. Res.</source><year>2004</year><volume>22</volume><fpage>457</fpage><lpage>479</lpage></citation></ref>
<ref id="b19-information-02-00266"><label>19.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Torvik</surname><given-names>V.I.</given-names></name><name><surname>Smalheiser</surname><given-names>N.R.</given-names></name></person-group><article-title>A Quantitative Model for Linking Two Disparate Sets of Articles in MEDLINE</article-title><source>Bioinformatics</source><year>2007</year><volume>23</volume><fpage>1658</fpage><lpage>1565</lpage></citation></ref>
<ref id="b20-information-02-00266"><label>20.</label><citation citation-type="web"><comment><ext-link xlink:href="http://arrowsmith.psych.uic.edu" ext-link-type="uri">http://arrowsmith.psych.uic.edu</ext-link>.</comment></citation></ref></ref-list></back></article>
