<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xml:lang="en" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">101238455</journal-id>
<journal-title>International Journal of Environmental Research and Public Health</journal-title>
<issn pub-type="ppub">1661-7827</issn>
<issn pub-type="epub">1660-4601</issn>
<publisher>
<publisher-name>Molecular Diversity Preservation International (MDPI)</publisher-name></publisher></journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3390/ijerph7020596</article-id>
<article-id pub-id-type="publisher-id">ijerph-07-00596</article-id>
<article-categories>
<subj-group>
<subject>Article</subject></subj-group></article-categories>
<title-group>
<article-title>Text and Structural Data Mining of Influenza Mentions in Web and Social Media</article-title></title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Corley</surname><given-names>Courtney D.</given-names></name><xref ref-type="aff" rid="af1-ijerph-07-00596">1</xref><xref ref-type="corresp" rid="c1-ijerph-07-00596">*</xref></contrib>
<contrib contrib-type="author">
<name><surname>Cook</surname><given-names>Diane J.</given-names></name><xref ref-type="aff" rid="af2-ijerph-07-00596">2</xref></contrib>
<contrib contrib-type="author">
<name><surname>Mikler</surname><given-names>Armin R.</given-names></name><xref ref-type="aff" rid="af3-ijerph-07-00596">3</xref></contrib>
<contrib contrib-type="author">
<name><surname>Singh</surname><given-names>Karan P.</given-names></name><xref ref-type="aff" rid="af4-ijerph-07-00596">4</xref></contrib></contrib-group>
<aff id="af1-ijerph-07-00596">
<label>1</label> Pacific Northwest National Laboratory, 902 Battelle Blvd., Richland, WA 99352, USA</aff>
<aff id="af2-ijerph-07-00596">
<label>2</label> School of Electrical Engineering and Computer Science, Washington State University, PO Box 642752 Pullman, Washington 99164, USA; E-Mail: 
<email>cook@eecs.wsu.edu</email></aff>
<aff id="af3-ijerph-07-00596">
<label>3</label> Department of Computer Science and Engineering, University of North Texas, 1155 Union Circle #311366 Denton, TX 76203, USA; E-Mail: 
<email>mikler@unt.edu</email></aff>
<aff id="af4-ijerph-07-00596">
<label>4</label> Department of Biostatistics, University of North Texas Health Science Center, 3500 Camp Bowie Blvd. Fort Worth, TX 76107, USA; E-Mail: 
<email>ksingh@hsc.unt.edu</email></aff>
<author-notes>
<corresp id="c1-ijerph-07-00596">
<label>*</label> Author to whom correspondence should be addressed; E-Mail: 
<email>court@pnl.gov</email>; Tel.: +1-509-375-2326; Fax: +1-509-375-2443.</corresp></author-notes>
<pub-date pub-type="epub">
<day>22</day>
<month>2</month>
<year>2010</year></pub-date>
<pub-date pub-type="ppub">
<month>2</month>
<year>2010</year></pub-date>
<volume>7</volume>
<issue>2</issue>
<fpage>596</fpage>
<lpage>615</lpage>
<history>
<date date-type="received">
<day>9</day>
<month>11</month>
<year>2009</year></date>
<date date-type="accepted">
<day>10</day>
<month>2</month>
<year>2010</year></date></history>
<permissions>
<copyright-statement>© 2010 by the authors; licensee Molecular Diversity Preservation International, Basel, Switzerland.</copyright-statement>
<copyright-year>2010</copyright-year>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/3.0">
<p>This article is an open-access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).</p></license></permissions>
<abstract>
<p>Text and structural data mining of web and social media (WSM) provides a novel disease surveillance resource and can identify online communities for targeted public health communications (PHC) to assure wide dissemination of pertinent information. WSM that mention influenza are harvested over a 24-week period, 5 October 2008 to 21 March 2009. Link analysis reveals communities for targeted PHC. Text mining is shown to identify trends in flu posts that correlate to real-world influenza-like illness patient report data. We also bring to bear a graph-based data mining technique to detect anomalies among flu blogs connected by publisher type, links, and user-tags.</p></abstract>
<kwd-group>
<kwd>disease surveillance</kwd>
<kwd>public health epidemiology</kwd>
<kwd>health informatics</kwd>
<kwd>graph-based data mining</kwd>
<kwd>web and social media</kwd>
<kwd>social network analysis</kwd></kwd-group></article-meta></front>
<body>
<sec sec-type="intro">
<label>1.</label>
<title>Introduction</title>
<p>Influenza diagnosis based solely on the presentation of symptoms is limited as these symptoms may be associated with many other diseases. Serologic and antigen tests require that a patient with influenza-like illness (ILI) be examined by a physician who can either conduct a rapid diagnostic test or take blood samples in a laboratory testing. This suggests that many cases of influenza remain undiagnosed. While the presence of influenza in an individual can be confirmed through specific diagnostic tests, the influenza prevalence in the population at any given time is unknown and can only be estimated. In the past, such estimates have relied solely on the extrapolation of diagnosed cases, making it difficult to identify the various phases of seasonal influenza or to identify a more serious manifestation of a flu epidemic.</p>
<p>Web and social media (WSM) provide a resource to detect increases in ILI. This paper evaluates blog posts, a type of WSM, that discuss influenza and the analyses show a significant correlation with patient reporting of ILI during the US 2008–2009 influenza season. Preliminary experimental results on data covering two months in 2008 have been published in conference proceedings [<xref ref-type="bibr" rid="b1-ijerph-07-00596">1</xref>]. In this article, we present comprehensive analysis, covering 24 months of data. A well-defined response strategy to an outbreak may make use of WSM to reduce population and human impact of the disease. We suggest a possible response that identifies WSM influenza-related communities that share flu-related postings. These community or crowd sources could broker and disseminate important intervention information in the case of an infectious disease outbreak. Our proposed framework, in <xref ref-type="fig" rid="f1-ijerph-07-00596">Figure 1</xref>, visually describes this approach to detecting and responding to influenza epidemics.</p>
<p>We briefly discuss a history of infectious disease outbreaks and recent approaches in online public health surveillance of influenza. We also discuss the value of social community with regard to outbreak responses. Next, the data set used in our analysis is presented and the methodology for information extraction and trend analysis is outlined. Through discovery and verification of trends in influenza-related blogs, we verify a correlation to Centers for Disease Control and Prevention (CDC) ILI patient reporting at sentinel healthcare providers. Additionally, categories, frequency, and influenza-post persistence qualitatively assist ILI trend identification in blogs. Strongly connected communities are evaluated and influential bloggers identified that should be part of a WSM outbreak response. Then we leverage graph-based data mining to further identify structural anomalies in the flu blogosphere that correspond to increases in ILI.</p>
<sec>
<title>Using Web and Social Media for Biosurveillance</title>
<p>The pervasiveness and ubiquity of internet resources provide individuals with access to many information sources that facilitate self-diagnosis and provide means for nontraditional biosurveillance; for example, one can combine specific disease symptoms to form search queries. The results of such search queries often lead to sites that may help diagnose the illness and offer medical advice (e.g., <ext-link xlink:href="PeopleLikeMe.com" ext-link-type="uri">PeopleLikeMe.com</ext-link>, <ext-link xlink:href="WebMD.com" ext-link-type="uri">WebMD.com</ext-link>). Recently, Google™ has addressed this issue by capturing the query keywords and identifying specific searches involving search terms that indicate ILI [<xref ref-type="bibr" rid="b2-ijerph-07-00596">2</xref>]. Published research on influenza internet surveillance also includes search “advertisement click-through” [<xref ref-type="bibr" rid="b3-ijerph-07-00596">3</xref>], using a set of Yahoo search queries containing the words “flu” or “influenza” [<xref ref-type="bibr" rid="b4-ijerph-07-00596">4</xref>], and health website access logs [<xref ref-type="bibr" rid="b5-ijerph-07-00596">5</xref>,<xref ref-type="bibr" rid="b6-ijerph-07-00596">6</xref>]. Other information sources, such as telephone triage services, can be useful for ILI detection. The findings in Yih <italic>et al</italic>. [<xref ref-type="bibr" rid="b7-ijerph-07-00596">7</xref>] show that telephone triage service is not a reliable measure for influenza surveillance due to service coverage; however, it may be beneficial in certain situations where other surveillance measures are inadequate. We envision several applications that leverage automatic open source document analytics for biosurveillance: such a system could provide lagging indicators of a disease outbreak to a component of a US port and border’s biosurveillance system (Personal communication with Dr. Andrew Plummer, Centers for Disease Control and Prevention, National Center for Preparedness, Detection, and Control of Infectious Diseases, Division of Global Migration and Quarantine); a second application in development is the recently EU-funded project Medical Ecosystem Personalized Event-Based, and a hypothetical third application could provide workflow in existing global surveillance systems (such as Argus Global at Georgetown University) that must employ linguists to curate bio-event notices.</p></sec></sec>
<sec sec-type="methods">
<label>2.</label>
<title>Data and Methods</title>
<sec sec-type="methods">
<label>2.1.</label>
<title>Data</title>
<p>Spinn3r [<xref ref-type="bibr" rid="b8-ijerph-07-00596">8</xref>] is a WSM indexing service that conducts real-time indexing of all blogs, with a throughput of over 100,000 new blogs indexed per hour. Blog posts are accessed through an open source Java application programming interface (API). Metadata available with this data set (see <xref ref-type="fig" rid="f2-ijerph-07-00596">Figure 2</xref>. Example XML encoding of social media post that mentions flu.) includes the following (if reported by source): blog title, blog URL, post title, post URL, date posted (accurate to seconds), description, full HTML encoded content, subject tags annotated by author, and language.</p>
<p>Data are selected from an arbitrary time period of 24 weeks, beginning 5 October and ending 21 March 2009. A total of 158,497,700 WSM items were pulled from Spinn3r RSS and ATOM feeds. We identified a significant increase in blog coverage resulting from the success of Spinn3r service and subsequent expansion of web crawlers in addition to organic growth of WSM publishing, as shown in <xref ref-type="fig" rid="f3-ijerph-07-00596">Figure 3</xref>. It is evident from the average number of blogs posted per day of week summarized in <xref ref-type="fig" rid="f4-ijerph-07-00596">Figure 4b</xref> that most WSM in this data are published during the week and less so on the weekends. A majority of the articles we analyzed were weblogs (labeled by Spinn3r); mainstream media accounts for 20% of the data and the remaining types include forums and classified ads (see <xref ref-type="fig" rid="f4-ijerph-07-00596">Figure 4a</xref>). In the analysis reported here, we select English language WSM items indexed by Spinn3r when a lexical match exists to the terms <italic>influenza</italic> and <italic>flu</italic> anywhere in its content (misspellings and synonyms are not considered). The blog items are grouped by month, week (Sunday to Saturday), and day of week. The extracted blog items containing influenza keywords are herein termed flu-content posts or <bold>FC-posts</bold>. Missing from our data are more recent evolutions of WSM such as micro-blogs, wikis, and deep-web communities that many times are gated and not indexed in shallow web crawls.</p>
<p>Indexing, parsing, and link extraction code was written in Python, parallelized using pyMPI and openmpi and executed on an eight-node cluster (2.66 GHz Quad Core Xeon processors), with 64 core, 256 GB memory, 30 TB of network storage [<xref ref-type="bibr" rid="b9-ijerph-07-00596">9</xref>,<xref ref-type="bibr" rid="b10-ijerph-07-00596">10</xref>]. This compute resource is housed at the University of North Texas Center for Computational Epidemiology and Response Analysis.</p></sec>
<sec sec-type="methods">
<label>2.2.</label>
<title>Analysis</title>
<sec>
<label>2.2.1.</label>
<title>Text mining to monitor influenza trends</title>
<p>Text mining is the process of discovering information in large text collections and automatically identifying interesting patterns and relationships in textual data [<xref ref-type="bibr" rid="b11-ijerph-07-00596">11</xref>]. Text mining is particularly related to data mining, an older research area focused on the extraction of significant information from data records. However, text mining has proven to be more difficult than data mining, as the source data consists of unstructured collections of documents rather than structured databases. A large number of applications now utilize text mining, including question-answering applications, automatic construction of databases on job postings, and dictionary construction. Feldman and Sanger [<xref ref-type="bibr" rid="b12-ijerph-07-00596">12</xref>] have recently published a thorough survey of research work in the area of text mining.</p>
<p>Influenza WSM item trends can be monitored using the social media mining methodology presented in this paper. This methodology facilitates identification of outbreaks and increases of influenza infection in the population. We posit a strong correlation exists between the frequency of FC-posts per week and CDC ILI surveillance data. Qualitative assessment of category tags, prevalence of FC-posts on a blog site, and persistent posting of flu-related posts also suggest ILI trends.</p>
<p>We hypothesize that the frequency of blog-world flu posts correlate with a patient reporting ILI and the US flu season. To verify this hypothesis, we compare our data to CDC surveillance reports from sentinel healthcare providers. The CDC website states the Outpatient Influenza-like-illness Surveillance Network (ILINet) consists of about 2,400 healthcare providers in 50 states reporting approximately 16 million patient visits each year. Each provider reports data to CDC on the total number of patients seen and the number of those patients with ILI by age group. For this system, ILI is defined as fever (temperature of 100 °F [37.8 °C] or greater) and a cough and/or a sore throat in the absence of a known cause other than influenza [<xref ref-type="bibr" rid="b13-ijerph-07-00596">13</xref>].</p></sec>
<sec>
<label>2.2.2.</label>
<title>Graph-based structure mining to discover blog flu communities and anomaly detection</title>
<p>WSM communities will play a vital role in any public health response to an outbreak. Influential bloggers can disseminate and broker response strategies and interventions in their blog communities. These bloggers could be first responders to a disease outbreak, in an information sense. Their readers will hopefully trigger an information cascade, spreading public health communications (<italic>i.e.</italic>, to vaccinate, quarantine, close schools, <italic>etc</italic>.). Although considerably less costly than a mainstream media campaign, a WSM targeted response must be cost-effective and optimized to achieve maximum strategy penetration. Any blogger participating in a public health campaign needs to have influence in their community and the ability to disseminate information to other WSM. Closeness and betweenness centrality measures and Google’s PageRank (eigenvector centrality) will rank influenza community blog sites in order to target key actors. Additionally, the Girvan-Newman community finding algorithm will identify communities of interest.</p>
<p>Moreover, graph-based algorithms can be leveraged not only to identify communities but also to facilitate bio-event detection by searching for anomalies in the link-structure of WSM. Numerous approaches have been developed for discovering concepts in linear, attribute-value databases. Current data mining research focuses primarily on algorithms to discover sets of attributes that can discriminate data entities into classes, such as shopping or banking trends for a particular demographic group. These approaches are difficult when key concepts involve relationships between the data points. In contrast, we are developing data mining techniques to discover patterns consisting of complex relationships between entities. We have introduced a method for discovering substructures in structural databases implemented in the Subdue system [<xref ref-type="bibr" rid="b14-ijerph-07-00596">14</xref>]. In contrast with alternative approaches, Subdue is devised for general-purpose automated discovery, concept learning, and hierarchical clustering. Hence, the method can be applied to many structural domains. Subdue is leveraged in our analysis to identify non-obvious patterns in blog posts that may serve as lagging-indicators of an influenza outbreak.</p></sec>
<sec>
<label>2.2.3.</label>
<title>Potential problems and associated risks</title>
<p>There are associated risks with using open source documents obtained through WSM primarily due to sample bias because of limited access to technology and truthfulness of blogger statements. People that can afford home access to computers and the internet are usually educated and literate [<xref ref-type="bibr" rid="b15-ijerph-07-00596">15</xref>,<xref ref-type="bibr" rid="b16-ijerph-07-00596">16</xref>]. However, the ubiquity of wireless internet access in public places such as libraries, restaurants, and cafes enables users from a variety of social and educational levels to engage with and contribute to WSM. Second, the validity of self-reported health diagnoses or behaviors such as voluntary quarantine (staying home when sick), vaccination, and increased hygiene is unknown. This situation raises concerns and uncertainty as to whether these self-disclosures reflect intended, false, or actual behavior (diagnoses). Previous studies of self-reported behaviors via the internet have shown that computer use encourages high levels of self-disclosure and uninhibited personal expression. This finding supports the validity of internet self-reporting; however, formal study is needed to verify the accuracy of self-reported diagnoses and behaviors in WSM.</p></sec></sec></sec>
<sec sec-type="results|discussion">
<label>3.</label>
<title>Results and Discussion</title>
<p>The CDC ILINet surveillance and FC-post per week data are plotted in. CDC ILI symptoms per visit at sentinel US healthcare providers label the primary Y-axis. The secondary Y-axis marks the FC-post per week frequency normalized by the total number of posts in the 24-week period. Correlation between the two data series is measured with a Pearson correlation coefficient, r. To prove our hypothesis that a correlation exists between CDC ILINet reports and mined WSM FC-post frequency, Pearson's correlation statistic is evaluated between the two data series. The Pearson statistic evaluates to unity if the two data series are exactly matching, r = 1. If no correlation exists between the data series, the Pearson statistic evaluates to zero, r = 0. In our analysis, the 24 ILI and FC-post data points correlate strongly with a high Pearson statistic, r = 0.545, and the correlation is significant with 95% confidence. Notice the deviation in the time series at 1 February to 21 March 2009. After close inspection of the data provided by Spinn3r, we identified a significant increase in blog coverage resulting from the success of their service and subsequent expansion of web crawlers, thereby biasing the influenza blog presence normalization. Moreover, graph-base data mining discovered substantial presence of MySpace blogs in the last three weeks of data. We manually inspected the blogs and discovered many of the MySpace blogs were discussing the health of American Idol contestants, several of whom were sick with the flu.</p>
<p>Each WSM item has rich metadata that can be leveraged for content analysis. A folksonomy is defined from WSM by associated author “tags” extracted from category metadata. Moreover, a folksonomy is a type of classification system for online content, created by an individual user who tags information with freely chosen keywords. The Porter stemming algorithm [<xref ref-type="bibr" rid="b17-ijerph-07-00596">17</xref>] is used to find the morphological root from the author-tagged labels. Duplicate author-tagged labels are only counted once per blogger. <xref ref-type="table" rid="t1-ijerph-07-00596">Table 1</xref> lists the top 45 categories and how often they appear. <xref ref-type="fig" rid="f6-ijerph-07-00596">Figure 6</xref> is a tag-cloud graphic, called a <italic>Wordle</italic> (see <ext-link xlink:href="www.wordle.net" ext-link-type="uri">www.wordle.net</ext-link>), that visually depicts the frequency of categories in the data. The top categories (e.g., flu, health, bird, avian, influenza) are intuitive; however, one could monitor categories that imply self or close-proxy infection such as family, sick, symptom, home, school, and other representative terms.</p>
<p>Monitoring self-identification and secondhand FC-post trends can mark increases in ILI. It can be said that bloggers that post often about influenza are more likely to a) be an authority on influenza (perhaps not an expert, however) where its readers find information on influenza or b) the blogger is frequently sick with influenza. How often or how persistently bloggers author FC-posts indicates trends as well; a blog-site that has FC-posts for a limited time is more likely to be a first- or secondhand identification of ILI. The cumulative probability distribution for how many posts a blogger writes about influenza is graphically summarized in <xref ref-type="fig" rid="f7-ijerph-07-00596">Figure 7</xref>. The number of posts with influenza keywords per blogger is plotted on the X-axis and the associated probability of a blogger posting X number of flu posts labels the Y-axis; the data is plotted on a log-log scale. We see from that over 95% of bloggers only post one “flu” post, whereas the most frequent flu post blogger authored 1,897 posts. The probability of a blogger posting 1,897 flu posts is approximately 0.0000388%. The heavy-tailed distribution is indicative of social processes (e.g., number of intimate partners [<xref ref-type="bibr" rid="b18-ijerph-07-00596">18</xref>]) and is present in the distribution of flu posts per blogger. The cumulative probability distribution also supports the hypothesis that most bloggers do not frequently author influenza content. Content analysis supporting the claim that less frequent posters self-identify ILI is left for future work.</p>
<p>To study the link-based structure of bloggers authoring influenza-related content, we extract the URLs linked in each blog post. These URLs and blog permalinks (829,662 URLs) are truncated to the network location and path resulting in 694,388 unique URLs. A link graph is then constructed from the blogger source URL and out-links from the influenza posts, removing self-references and parallel out-links and the largest weak component producing an aggregate graph of 694,388 nodes (bloggers) and 3,529,362 directed edges (unique blogger to blogger links).</p>
<p><xref ref-type="table" rid="t2-ijerph-07-00596">Table 2</xref> lists the seven most prolific flu bloggers and their degree (In, Out, and Total). The relatively low In degree supports the claim that frequent posters are news- and opinion-oriented and not always the most influential in online communities. Centrality metrics are evaluated on the same most frequent bloggers. The results are listed in <xref ref-type="table" rid="t3-ijerph-07-00596">Table 3</xref>. Three of the top posters have no in-links, implying they are spam blogs and have no influence in the “flu” blogosphere. We verify this statement with a quick hand-check of the URL. The RSS feed BirdFluMonitor has the highest out closeness centrality, but no In degree, implying they are adept at publishing links to popular blogs but are not influential themselves. Three blogs (h5n1, a flu diary, fluwikie2) are interesting hubs of the flu blogosphere. The blog “A Flu Diary” has the largest betweenness centrality (interpersonal influence) with high In and Out degree and demonstrates the capability to broker influential information in the target blogosphere. The blog h5n1 has the greatest PageRank and in-closeness centrality; moreover, it has the most published items of the most frequent bloggers and is influential in disseminating h5n1 information.</p>
<p>Using the most frequent flu bloggers is a naïve approach to finding target WSM communities to be leveraged for public health response. To advance our approach, we target strongly connected components within our flu link graph community identification (Flake, Lawrence, and Gilles definition of community [<xref ref-type="bibr" rid="b19-ijerph-07-00596">19</xref>,<xref ref-type="bibr" rid="b20-ijerph-07-00596">20</xref>]). The link graph’s largest strongly connected component is over 17,000 unique URLs. However, its nodes are spam bloggers; specifically, they were all LiveJournal blogs, and each post had exactly eight out-links The uniformity and a manual inspection of these as spam blogs show they were written for the purpose of search engine optimization. Therefore, we cluster the second largest strongly connect component, which consists of 2,306 blogs, 26,768 edges, and an average degree of 23. The Girvan-Newman community finding algorithm (recursively removes the node with the highest betweenness centrality) identifies 11 communities.</p>
<p><xref ref-type="table" rid="t4-ijerph-07-00596">Table 4</xref> reports centralities and size for the six largest communities. An interesting finding is that these communities are clustered not only by publisher types but also by parent company. Not surprisingly, the largest community comprises personal blogs and general reporting newspapers; the remaining consist of mainstream and local news outlets, international audience media, LiveJournal, and entertainment industry (e.g., Viacom, Reed), large news conglomerates (e.g., News Corp, Disney), and commentary, opinion and editorial content. A successful WSM public health campaign should have a presence and influence in each of the reported blog communities to ensure wide coverage and dissemination of pertinent information.</p>
<p>Detecting anomalies in various data sets is an important endeavor. We define an anomaly as a surprising or unusual occurrence. Using statistical approaches has led to various successes such as detecting computer and network intrusions. Recent research in graph-based anomaly detection has paved the way for new approaches that not only complement the non-graph-methods but also provide mechanisms for handling data that cannot be easily analyzed with traditional statistical approaches [<xref ref-type="bibr" rid="b21-ijerph-07-00596">21</xref>]. Again, Subdue can be used to address this challenge. Subdue examines an entire graph and reports unusual substructures, or substructures that occur infrequently, within it [<xref ref-type="bibr" rid="b14-ijerph-07-00596">14</xref>]. Subdue also takes into account the regularity of the data to determine how likely it is for a substructure to occur given the predictability of the structural data surrounding the substructure. These ideas have been tested in applications including intrusion detection and terrorist activity analysis.</p>
<p>To facilitate identification of ILI through graph-based data mining of influenza blogs, we base the representation on the link graph. Multiple posts by the same author are aggregated, representing a unique blogger; similarly, multiple tags and out and in links are only counted once per blogger. To structurally enrich the link graph, we connect the blogger URL and tags to a node labeled by the publisher type (e.g., blog, forum, mainstream media, external link) as depicted in <xref ref-type="fig" rid="f8-ijerph-07-00596">Figure 8</xref> Graph structures are created from weekly influenza blog posts to facilitate anomaly detection and correlation to CDC ILI patient reports. <xref ref-type="fig" rid="f8-ijerph-07-00596">Figure 8</xref> demonstrates how URLs are disaggregated from their WSM article, thereby creating a relationship between two entities (the WSM article and the URL). This allows Subdue to find informative subgraphs of blogs with differing content (news, personal blogs) in addition to traditional URL structures. The structurally enriched data and temporal format facilitate anomaly detection by Subdue (this is in contrast to the 24-week aggregate link graph used for community identification).</p>
<p><xref ref-type="table" rid="t5-ijerph-07-00596">Table 5</xref> lists the substructure features discovered by Subdue and identifies if they correspond to an anomaly for the purpose of outbreak detection. An analyst can then review the reported substructures for outbreak information. The first discovery of interest is during the week beginning 7 December 2008 identifying the UK Yahoo Answers site. During the same time frame, the United Kingdom was in the middle of its worst flu season in eight years. While correlating influenza post frequency to CDC ILINet data was unsuccessful in February and March 2009, Subdue is able to identify novel substructures in personal blogs that mention influenza. The third anomaly discovered by Subdue shows a high number of substructures occurrences, composed of MySpace blog posts discussing several American Idol contestants that contracted influenza and were unable to perform at their best during the weekly performance competition.</p></sec>
<sec sec-type="materials|methods">
<label>4.</label>
<title>Methods and Materials</title>
<p>Wasserman and Faust state closeness can be productive in communicating information to other actors. It is defined in <xref ref-type="disp-formula" rid="FD1">Equation 1</xref> as the average shortest paths or geodesics distance from actor <italic>v</italic> and all reachable actors (<italic>t</italic> in <italic>V</italic>\<italic>v</italic>) in [<xref ref-type="bibr" rid="b22-ijerph-07-00596">22</xref>]:
<disp-formula id="FD1">
<label>(1)</label>
<mml:math display="block">
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mrow>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mi>v</mml:mi></mml:msub></mml:mrow></mml:msub>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mo>∑</mml:mo>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>∈</mml:mo>
<mml:mi>V</mml:mi>
<mml:mo>\</mml:mo>
<mml:mi>v</mml:mi></mml:mrow></mml:msub>
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mi>G</mml:mi></mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>v</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn></mml:mrow></mml:mfrac></mml:math></disp-formula></p>
<p>Betweenness centrality (<xref ref-type="disp-formula" rid="FD2">Eqn. 2</xref>) measures interpersonal influence. Specifically, a blog is central if it lies between other blogs on their geodesics—the blog is “between” many others, where <italic>g</italic><sub>jk</sub> is the number of geodesics linking blog <italic>j</italic> and blog <italic>k</italic> [<xref ref-type="bibr" rid="b22-ijerph-07-00596">22</xref>] :
<disp-formula id="FD2">
<label>(2)</label>
<mml:math display="block">
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mrow>
<mml:msub>
<mml:mi>B</mml:mi>
<mml:mi>v</mml:mi></mml:msub></mml:mrow></mml:msub>
<mml:mo>=</mml:mo>
<mml:munder>
<mml:mo>∑</mml:mo>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>&lt;</mml:mo>
<mml:mi>k</mml:mi></mml:mrow></mml:munder>
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mi>k</mml:mi></mml:mrow></mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>n</mml:mi>
<mml:mi>i</mml:mi></mml:msub>
<mml:mi>v</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">)</mml:mo></mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mfrac></mml:mrow></mml:math></disp-formula></p>
<p>Page Rank is an example of eigenvector centrality and measures the importance of a <bold>node</bold> by assuming links from more central nodes contribute more to its ranking than less central nodes [<xref ref-type="bibr" rid="b23-ijerph-07-00596">23</xref>]. Let <italic>d</italic> be a damping factor (usually 0.85), <italic>n</italic> be the index to the node of interest, <italic>p<sub>n</sub></italic> be the node, M(<italic>p</italic><sub>i</sub>) be the set of nodes linking to <italic>p</italic><sub>n</sub> and L(<italic>p</italic><sub>j</sub>) be the out-link counts on page <italic>p</italic><sub>j</sub>:
<disp-formula id="FD3">
<label>(3)</label>
<mml:math display="block">
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mi>n</mml:mi></mml:msub></mml:mrow></mml:msub>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>−</mml:mo>
<mml:mi>d</mml:mi></mml:mrow>
<mml:mi>N</mml:mi></mml:mfrac>
<mml:mo>+</mml:mo>
<mml:mi>d</mml:mi>
<mml:munder>
<mml:mo>∑</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mi>j</mml:mi></mml:msub>
<mml:mo>∈</mml:mo>
<mml:mi>M</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mi>n</mml:mi></mml:msub>
<mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:munder>
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:mi>P</mml:mi>
<mml:mi>R</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mi>j</mml:mi></mml:msub>
<mml:mo stretchy="false">)</mml:mo></mml:mrow>
<mml:mrow>
<mml:mi>L</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mi>j</mml:mi></mml:msub>
<mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mfrac></mml:mrow></mml:math></disp-formula></p>
<p>We take an intuitive and simple definition of WSM community and identify possible first responder bloggers by link analysis. Blog ranking enhances the idea that these communities can disseminate information as part of a broader public health response triggered by anomalies in ILINet and WSM surveillance. Community herein is defined similar to Flake, Lawrence, and Giles where there are more edges between member nodes than edges to external nodes. Formally, a community is a vertex subset <italic>C in V</italic>, such that for all vertices <italic>v</italic> ∈ <italic>C</italic>, <italic>v</italic> has at least as many edges connecting to vertices in <italic>C</italic> as it does to vertices in (V-C) [<xref ref-type="bibr" rid="b19-ijerph-07-00596">19</xref>,<xref ref-type="bibr" rid="b20-ijerph-07-00596">20</xref>]. Links from a non FC-post to an FC-post and vice versa are not defined in this community definition. The Girvan-Newman algorithm is used to identify communities in our data. The general form of this community structure finding algorithm is enumerated below, components remaining in the graph at the end of each iteration are the communities [<xref ref-type="bibr" rid="b24-ijerph-07-00596">24</xref>]:
<list list-type="order">
<list-item>
<p>Calculate betweenness scores for all edges in the network.</p></list-item>
<list-item>
<p>Find the edge with the highest score and remove it from the network. If two or more edges tie for highest score, choose one of them at random and remove that edge.</p></list-item>
<list-item>
<p>Recalculate betweenness for all remaining edges.</p></list-item>
<list-item>
<p>Repeat from step 2 until the desired number (if known a priori) of communities is reached, otherwise repeat from step 2 until no edges remain.</p></list-item></list></p>
<p>Subdue accepts as input directed or undirected graphs with labeled vertices (nodes) and edges (links), and outputs graphs representing the discovered pattern or learned concept. Formally, Subdue uses a labeled graph G = (V,E,L) as both input and output, where V = {v<sub>1</sub>, v<sub>2</sub>, …, v<sub>n</sub>} is a set of vertices, E = {(v<sub>i</sub>, v<sub>j</sub>) | v<sub>i</sub>, v<sub>j</sub> ∈ V} is a set of edges, and L is a set of labels that can appear on vertices and edges. The graph G can contain directed edges, undirected edges, self-edges, and multi-edges. As an unsupervised algorithm, Subdue searches for a substructure, or subgraph of the input graph, that best compresses the input graph. Subdue uses a variant of beam search for its main search algorithm. A substructure in Subdue consists of a subgraph definition and all its occurrences throughout the graph.</p>
<p>Subdue uses a polynomial-time beam search for its discovery algorithm, as summarized in <xref ref-type="fig" rid="f9-ijerph-07-00596">Figure 9</xref>. The initial state of the search is the set of substructures consisting of all uniquely labeled vertices. Search progresses by applying the ExtendSubstructure operator to each substructure in the current state. As its name suggests, it extends a substructure in all possible ways by a single edge and a vertex or by only a single edge if both vertices are already in the subgraph. The resulting new substructures are ordered based on their compression (sometimes referred to as <italic>value</italic>) as calculated using the Minimum Description Length (MDL) [<xref ref-type="bibr" rid="b21-ijerph-07-00596">21</xref>] principle described below, and the top substructures (as determined by the beam) remain on the queue for further expansion.</p>
<p>Search terminates upon reaching a limit on the number of substructures extended or upon exhaustion of the search space. Once the search terminates and Subdue returns the list of best substructures, the graph can be compressed using the best substructure. The compression procedure replaces all instances of the substructure in the input graph by single vertices, which represent the substructure definition. Incoming and outgoing edges to and from the replaced instances will point to or originate from the new vertex that represents the instance. The Subdue algorithm can be invoked again on this compressed graph. As an example <xref ref-type="fig" rid="f9-ijerph-07-00596">Figure 9</xref> shows patterns that Subdue discovers in an example input graph and a compressed version of the graph.</p></sec>
<sec>
<label>5.</label>
<title>Future Work</title>
<p>Emerging infectious diseases continue to have an impact on the health, safety, and sustainable growth of our nation as shown by the 2009 novel influenza A/H1N1 strain. Upon initial identification of widespread H1N1 outbreaks in April 2009, the CDC participated in a global concerted effort to control transmission of influenza A/H1N1 and prevent pandemic outbreaks by issuing public health response recommendations. Future work will quantify the impact and validate the use of WSM to monitor seasonal influenza epidemics and global pandemics. Preliminary influenza blog harvests during this pandemic, not including mentions on micro-blogging platforms (Twitter), are reported in <xref ref-type="table" rid="t6-ijerph-07-00596">Table 6</xref>. Geo-location tagging is now implemented in blog, social network, and micro-blogging platforms and future research will leverage this new data in the next-generation WSM biosurveillance system; however, geo-location information was not available in analyses reported here. Research is continuing on both health blogs and health micro-blogs to inform a robust disease surveillance system using open source documents.</p>
<p>Once influenza WSM items have been extracted, one can further monitor influenza outbreaks by evaluating the perspective of blog authors. Bloggers with a direct knowledge of influenza infection are more valuable to disease surveillance than those who author objective or opinion items. Identifying the perspective of influenza keyword posts facilitates determining its contribution to disease surveillance. Three author perspectives are identified in <xref ref-type="fig" rid="f10-ijerph-07-00596">Figure 10</xref>. An FC-post can be (1) a self-identification of having ILI symptoms, (2) a secondhand (or by proxy) post about another individual having ILI, or (3) an opinion or objective article containing ILI keywords. Secondhand knowledge can be writing about a friend, schoolmate, family-member, or co-worker, but a blogger could also post details on a famous individual such as an athlete. The season opening of American football coincides with the data, and many FC-posts identify athletes who are unable to play because of an ILI. Automatic classification of the influenza post author’s perspective is ongoing research.</p></sec>
<sec sec-type="conclusions">
<label>6.</label>
<title>Conclusions</title>
<p>Text and structural data mining of WSM provides a novel disease surveillance resource and technique to identify online “flu” topic health information communities. Our proposed framework of complementary data-mining methods supports our hypothesis. We comprehensively evaluate blog posts containing influenza topic keywords through text, link, and structural data mining. Results from analysis show strong co-occurrence of flu blog posts during the US 2008–2009 flu season. That is, from 5 October 2008 to 21 March 2009, a high correlation exists between the frequency of posts, containing influenza keywords, per week and CDC ILI surveillance data. Frequency of flu posts per blogger follows a heavy-tailed distribution, and we show through graph metrics that the most prolific bloggers are not the most influential. Pertinent health information should have a presence in all identified WSM communities. The Girvan-Newman algorithm is leveraged to identify clusters of similar sites as potential target communities for online health information campaigns. The results show distinct WSM communities clustered by publisher and content type, such as News Corp &amp; Disney properties, international audiences, or personal blogs.</p>
<p>Harvesting WSM is a continuing challenge with the explosive growth of internet usage. To complement the text mining approach to ILI monitoring, we apply a graph-based data mining technique, Subdue, to detect anomalies and informative substructures among flu blogs connected by publisher type, links, and user-tags. This technique flags anomalies not discovered with content analysis that correspond to the United Kingdom’s worst influenza season in eight years and the emergence of strong personal blog communications during the U.S. seasonal influenza peak incidence.</p>
<p>Link analysis reveals communities, clustered by content and in many cases corporate ownership, which should be targeted in a successful public health communications campaign to assure wide dissemination of pertinent information. Text mining of influenza mentions in WSM is shown to identify trends in flu posts that correlate to real-world ILI patient reporting data. Moreover, graph-based data mining is able to identify significant anomalies in flu blogs that were not identified through text analysis and can be flagged for further investigation by an analyst.</p></sec></body>
<back>
<ack>
<p>We would like to thank the National Science Foundation (NSF) for partial support under grant NSF IIS-0505819 and the Technosocial Predictive Analytics Initiative, part of the Laboratory Directed Research and Development Program at Pacific Northwest National Laboratory (PNNL). PNNL is operated by Battelle Memorial Institute for the U.S. Department of Energy under contract DE-AC05-76RL01830. The contents of this publication are the responsibility of the authors and do not necessarily represent the official views of the NSF.</p></ack>
<ref-list>
<title>References and Notes</title>
<ref id="b1-ijerph-07-00596"><label>1.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Corley</surname><given-names>C</given-names></name><name><surname>Mikler</surname><given-names>A</given-names></name><name><surname>Cook</surname><given-names>D</given-names></name><name><surname>Singh</surname><given-names>K</given-names></name></person-group><article-title>Monitoring Influenza Trends through Mining Social Media</article-title><conf-name>Proceedings of the 2009 International Conference on Bioinformatics and Bioengineering (BIOCOMP09)</conf-name><conf-loc>Las Vegas, NV, USA</conf-loc><conf-date>July 2009</conf-date></citation></ref>
<ref id="b2-ijerph-07-00596"><label>2.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ginsberg</surname><given-names>J</given-names></name><name><surname>Mohebbi</surname><given-names>M</given-names></name><name><surname>Patel</surname><given-names>R</given-names></name><name><surname>Brammer</surname><given-names>L</given-names></name><name><surname>Smolinski</surname><given-names>M</given-names></name><name><surname>Brilliant</surname><given-names>L</given-names></name></person-group><article-title>Detecting influenza epidemics using search engine query data</article-title><source>Nature</source><year>2009</year><volume>457</volume><fpage>1012</fpage><lpage>1014</lpage><pub-id pub-id-type="doi">10.1038/nature07634</pub-id><pub-id pub-id-type="pmid">19020500</pub-id></citation></ref>
<ref id="b3-ijerph-07-00596"><label>3.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Eysenbach</surname><given-names>G</given-names></name></person-group><conf-name>Proceedings of the AMIA Annual Symposium</conf-name><conf-loc>Washington, DC, USA</conf-loc><year>2005</year><fpage>244</fpage><lpage>248</lpage></citation></ref>
<ref id="b4-ijerph-07-00596"><label>4.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Polgreen</surname><given-names>P</given-names></name><name><surname>Chen</surname><given-names>Y</given-names></name><name><surname>Pennock</surname><given-names>D</given-names></name><name><surname>Nelson</surname><given-names>F</given-names></name></person-group><article-title>Using internet searches for influenza surveillance</article-title><source>Clin. Infect. Dis</source><year>2008</year><volume>47</volume><fpage>1443</fpage><lpage>1448</lpage><pub-id pub-id-type="doi">10.1086/593098</pub-id><pub-id pub-id-type="pmid">18954267</pub-id></citation></ref>
<ref id="b5-ijerph-07-00596"><label>5.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hulth</surname><given-names>A</given-names></name><name><surname>Rydevik</surname><given-names>G</given-names></name><name><surname>Linde</surname><given-names>A</given-names></name><name><surname>Montgomery</surname><given-names>J</given-names></name></person-group><article-title>Web Queries as a Source for Syndromic Surveillance</article-title><source>PLoS ONE</source><year>2009</year><volume>4</volume><fpage>e4378</fpage><pub-id pub-id-type="doi">10.1371/journal.pone.0004378</pub-id><pub-id pub-id-type="pmid">19197389</pub-id></citation></ref>
<ref id="b6-ijerph-07-00596"><label>6.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Johnson</surname><given-names>H</given-names></name><name><surname>Wagner</surname><given-names>M</given-names></name><name><surname>Hogan</surname><given-names>W</given-names></name><name><surname>Chapman</surname><given-names>W</given-names></name><name><surname>Olszewski</surname><given-names>R</given-names></name><name><surname>Dowling</surname><given-names>J</given-names></name><name><surname>Barnas</surname><given-names>G</given-names></name></person-group><article-title>Analysis of web access logs for surveillance of influenza</article-title><source>St. Heal. T</source><year>2003</year><volume>107</volume><fpage>1202</fpage><lpage>1206</lpage></citation></ref>
<ref id="b7-ijerph-07-00596"><label>7.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yih</surname><given-names>W</given-names></name><name><surname>Teates</surname><given-names>K</given-names></name><name><surname>Abrams</surname><given-names>A</given-names></name><name><surname>Kleinman</surname><given-names>K</given-names></name><name><surname>Kulldorff</surname><given-names>M</given-names></name><name><surname>Pinner</surname><given-names>R</given-names></name><name><surname>Harmon</surname><given-names>R</given-names></name><name><surname>Wang</surname><given-names>S</given-names></name><name><surname>Platt</surname><given-names>R</given-names></name><name><surname>Montgomery</surname><given-names>J</given-names></name></person-group><article-title>Telephone triage service data for detection of influenza-like illness</article-title><source>PLoS ONE</source><year>2009</year><volume>4</volume><fpage>e5260</fpage><pub-id pub-id-type="doi">10.1371/journal.pone.0005260</pub-id><pub-id pub-id-type="pmid">19381342</pub-id></citation></ref>
<ref id="b8-ijerph-07-00596"><label>8.</label><citation citation-type="other">Spinn3r Weblog Crawling provided by Spinn3r.</citation></ref>
<ref id="b9-ijerph-07-00596"><label>9.</label><citation citation-type="book"><person-group person-group-type="author"><name><surname>van Rossum</surname><given-names>G</given-names></name></person-group><source>Python Language Reference Manual</source><person-group person-group-type="editor"><name><surname>Drake</surname><given-names>FL</given-names><suffix>Jr</suffix></name></person-group><publisher-name>Network Theory Ltd.</publisher-name><publisher-loc>UK</publisher-loc><year>2002</year></citation></ref>
<ref id="b10-ijerph-07-00596"><label>10.</label><citation citation-type="web"><person-group person-group-type="author"><name><surname>Miller</surname><given-names>P</given-names></name></person-group>pyMPI: An introduction to parallel Python using MPI. Available online: <ext-link xlink:href="http://heather.cs.ucdavis.edu/~matloff/145/ParScript/pyMPI.pdf" ext-link-type="uri">http://heather.cs.ucdavis.edu/~matloff/145/ParScript/pyMPI.pdf</ext-link> (accessed on February 2010).</citation></ref>
<ref id="b11-ijerph-07-00596"><label>11.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Mihalcea</surname><given-names>R</given-names></name></person-group><article-title>The Text Mining Handbook: Advanced Approaches to Analyzing Unstructured Data Ronen Feldman and James Sanger (Bar-Ilan University and ABS Ventures) Cambridge, England: Cambridge University Press, 2007, xii+410</article-title><source>Comput. Linguist</source><year>2008</year><volume>34</volume><fpage>125</fpage><lpage>127</lpage><pub-id pub-id-type="doi">10.1162/coli.2008.34.1.125</pub-id></citation></ref>
<ref id="b12-ijerph-07-00596"><label>12.</label><citation citation-type="book"><person-group person-group-type="author"><name><surname>Feldman</surname><given-names>R</given-names></name><name><surname>Sanger</surname><given-names>J</given-names></name></person-group><source>The Text Mining Handbook</source><publisher-name>Cambridge University Press</publisher-name><publisher-loc>Cambridge, UK</publisher-loc><year>2007</year></citation></ref>
<ref id="b13-ijerph-07-00596"><label>13.</label><citation citation-type="web">Centers for Disease Control and Prevention Influenza surveillance reports. Available online: <ext-link xlink:href="http://www.cdc.gov/flu/weekly/fluactivity.htm" ext-link-type="uri">http://www.cdc.gov/flu/weekly/fluactivity.htm</ext-link> (accessed in 2009).</citation></ref>
<ref id="b14-ijerph-07-00596"><label>14.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cook</surname><given-names>DJ</given-names></name><name><surname>Holder</surname><given-names>LB</given-names></name></person-group><article-title>Substructure discovery using minimum description length and background knowledge</article-title><source>J. Artif. Int. Res</source><year>1993</year><volume>1</volume><fpage>231</fpage><lpage>255</lpage></citation></ref>
<ref id="b15-ijerph-07-00596"><label>15.</label><citation citation-type="other">The Current Population Survey (CPS). U.S. Census Bureau, 2008.</citation></ref>
<ref id="b16-ijerph-07-00596"><label>16.</label><citation citation-type="other">Summary Health Statistics for the US Population: National Health Interview Survey (NHIS), 2007 report DHHS Publication No.(PHS) 2009—1564; Series 10, Number 238.</citation></ref>
<ref id="b17-ijerph-07-00596"><label>17.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Porter</surname><given-names>M</given-names></name></person-group><article-title>An algorithm for suffix stripping</article-title><source>Program</source><year>1980</year><volume>14</volume><fpage>130</fpage><lpage>137</lpage><pub-id pub-id-type="doi">10.1108/eb046814</pub-id></citation></ref>
<ref id="b18-ijerph-07-00596"><label>18.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liljeros</surname><given-names>F</given-names></name><name><surname>Edling</surname><given-names>C</given-names></name><name><surname>Amaral</surname><given-names>L</given-names></name><name><surname>Stanley</surname><given-names>H</given-names></name><name><surname>Aberg</surname><given-names>Y</given-names></name></person-group><article-title>The web of human sexual contacts</article-title><source>Nature</source><year>2001</year><volume>411</volume><fpage>907</fpage><lpage>908</lpage><pub-id pub-id-type="doi">10.1038/35082140</pub-id><pub-id pub-id-type="pmid">11418846</pub-id></citation></ref>
<ref id="b19-ijerph-07-00596"><label>19.</label><citation citation-type="other"><person-group person-group-type="author"><name><surname>Flake</surname><given-names>G</given-names></name><name><surname>Lawrence</surname><given-names>S</given-names></name><name><surname>Giles</surname><given-names>C</given-names></name></person-group><article-title>Efficient identification of Web communities</article-title><conf-name>Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</conf-name><conf-loc>Boston, MA, USA</conf-loc><conf-date>20–23 August, 2000</conf-date></citation></ref>
<ref id="b20-ijerph-07-00596"><label>20.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Flake</surname><given-names>G</given-names></name><name><surname>Lawrence</surname><given-names>S</given-names></name><name><surname>Giles</surname><given-names>C</given-names></name><name><surname>Coetzee</surname><given-names>F</given-names></name></person-group><article-title>Self-organization and identification of Web communities</article-title><source>Computer</source><year>2001</year><volume>35</volume><fpage>66</fpage><lpage>71</lpage></citation></ref>
<ref id="b21-ijerph-07-00596"><label>21.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Eberle</surname><given-names>W</given-names></name><name><surname>Holder</surname><given-names>L</given-names></name></person-group><article-title>Anomaly detection in data represented as graphs</article-title><source>Intell. Data Anal</source><year>2007</year><volume>11</volume><fpage>663</fpage><lpage>689</lpage></citation></ref>
<ref id="b22-ijerph-07-00596"><label>22.</label><citation citation-type="book"><person-group person-group-type="author"><name><surname>Wasserman</surname><given-names>S</given-names></name><name><surname>Faust</surname><given-names>K</given-names></name></person-group><source>Social Network Analysis: Methods and Applications</source><publisher-name>Cambridge University Press</publisher-name><publisher-loc>Cambridge, UK</publisher-loc><year>1994</year></citation></ref>
<ref id="b23-ijerph-07-00596"><label>23.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Brin</surname><given-names>S</given-names></name><name><surname>Page</surname><given-names>L</given-names></name></person-group><article-title>The anatomy of a large-scale hypertextual Web search engine</article-title><source>Comput. Networks ISDN Syst</source><year>1998</year><volume>30</volume><fpage>107</fpage><lpage>117</lpage><pub-id pub-id-type="doi">10.1016/S0169-7552(98)00110-X</pub-id></citation></ref>
<ref id="b24-ijerph-07-00596"><label>24.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Girvan</surname><given-names>M</given-names></name><name><surname>Newman</surname><given-names>M</given-names></name></person-group><article-title>Community structure in social and biological networks</article-title><source>Proc. Natl. Acad. Sci. USA</source><year>2002</year><volume>99</volume><fpage>7821</fpage><lpage>7826</lpage><pub-id pub-id-type="doi">10.1073/pnas.122653799</pub-id><pub-id pub-id-type="pmid">12060727</pub-id></citation></ref></ref-list>
<sec sec-type="display-objects">
<title>Figures and Tables</title>
<fig id="f1-ijerph-07-00596" position="float">
<label>Figure 1.</label>
<caption>
<p>Methodology to monitor influenza-like illness in social media and to identify possible web and social media communities to participate in a public health response.</p></caption><graphic xlink:href="ijerph-07-00596f1.gif"/></fig>
<fig id="f2-ijerph-07-00596" position="float">
<label>Figure 2.</label>
<caption>
<p>Example XML encoding of social media post that mentions <italic>flu</italic>.</p></caption><graphic xlink:href="ijerph-07-00596f2.gif"/></fig>
<fig id="f3-ijerph-07-00596" position="float">
<label>Figure 3.</label>
<caption>
<p>Blogs, forums, mainstream media, and other articles harvested using Spinn3r RSS/ATOM feeds, 5 October 2008 to 21 March 2009.</p></caption><graphic xlink:href="ijerph-07-00596f3.gif"/></fig>
<fig id="f4-ijerph-07-00596" position="float">
<label>Figure 4.</label>
<caption>
<p>(a) Web and social media publisher types (%) from 158,497,700 items posted over 24 weeks. (b) Per day of week blogs, forums, mainstream media and other items averaged over 24 weeks, 5 October 2008 to 21 March 2009.</p></caption><graphic xlink:href="ijerph-07-00596f4.gif"/></fig>
<fig id="f5-ijerph-07-00596" position="float">
<label>Figure 5.</label>
<caption>
<p>CDC ILINet <italic>vs.</italic> normalized blog post (with flu keywords) frequency per week. 5 October 2008 to 21 March 2009.</p></caption><graphic xlink:href="ijerph-07-00596f5.gif"/></fig>
<fig id="f6-ijerph-07-00596" position="float">
<label>Figure 6.</label>
<caption>
<p>Wordle of most frequent author-tagged categories (stemmed). [influenza posts: 5 October 2008 to 21 March 2009].</p></caption><graphic xlink:href="ijerph-07-00596f6.gif"/></fig>
<fig id="f7-ijerph-07-00596" position="float">
<label>Figure 7.</label>
<caption>
<p>Cumulative probability distribution of the number of influenza posts, per blogger, 5 October 2008 to 21 March 2009.</p></caption><graphic xlink:href="ijerph-07-00596f7.gif"/></fig>
<fig id="f8-ijerph-07-00596" position="float">
<label>Figure 8.</label>
<caption>
<p>Example graph representation of influenza bloggers used for anomaly detection by Subdue.</p></caption><graphic xlink:href="ijerph-07-00596f8.gif"/></fig>
<fig id="f9-ijerph-07-00596" position="float">
<label>Figure 9.</label>
<caption>
<p>Subdue’s discovery algorithm and an example. The figure shows the discovered pattern (S<sub>1</sub>) from the original graph, the substructure found during the second iteration (S<sub>2</sub>), and the final graph compressed using substructures S<sub>1</sub> and S<sub>2</sub>.</p></caption><graphic xlink:href="ijerph-07-00596f9.gif"/></fig>
<fig id="f10-ijerph-07-00596" position="float">
<label>Figure 10.</label>
<caption>
<p>Three perspective blog posts that mention influenza: self-identification, secondhand, and objective/editorial.</p></caption><graphic xlink:href="ijerph-07-00596f10.gif"/></fig>
<table-wrap id="t1-ijerph-07-00596" position="float">
<label>Table 1.</label>
<caption>
<p>Most frequent author-tagged categories (stemmed) [influenza posts: 5 October 2008 to 21 March 2009].</p></caption>
<table frame="box" rules="cols">
<tbody>
<tr>
<td valign="top" align="left">Flu</td>
<td valign="top" align="left">5605</td>
<td valign="top" align="left">medicin</td>
<td valign="top" align="left">697</td>
<td valign="top" align="left">shot</td>
<td valign="top" align="left">306</td></tr>
<tr>
<td valign="top" align="left">health</td>
<td valign="top" align="left">3946</td>
<td valign="top" align="left">gener</td>
<td valign="top" align="left">652</td>
<td valign="top" align="left">dai</td>
<td valign="top" align="left">297</td></tr>
<tr>
<td valign="top" align="left">bird</td>
<td valign="top" align="left">2030</td>
<td valign="top" align="left">polit</td>
<td valign="top" align="left">591</td>
<td valign="top" align="left">food</td>
<td valign="top" align="left">293</td></tr>
<tr>
<td valign="top" align="left">avian</td>
<td valign="top" align="left">1968</td>
<td valign="top" align="left">scienc</td>
<td valign="top" align="left">512</td>
<td valign="top" align="left">technolog</td>
<td valign="top" align="left">290</td></tr>
<tr>
<td valign="top" align="left">new</td>
<td valign="top" align="left">1903</td>
<td valign="top" align="left">world</td>
<td valign="top" align="left">502</td>
<td valign="top" align="left">random</td>
<td valign="top" align="left">290</td></tr>
<tr>
<td valign="top" align="left">influenza</td>
<td valign="top" align="left">1849</td>
<td valign="top" align="left">googl</td>
<td valign="top" align="left">442</td>
<td valign="top" align="left">infecti</td>
<td valign="top" align="left">277</td></tr>
<tr>
<td valign="top" align="left">relenza</td>
<td valign="top" align="left">1357</td>
<td valign="top" align="left">medic</td>
<td valign="top" align="left">422</td>
<td valign="top" align="left">home</td>
<td valign="top" align="left">265</td></tr>
<tr>
<td valign="top" align="left">pandem</td>
<td valign="top" align="left">1209</td>
<td valign="top" align="left">busi</td>
<td valign="top" align="left">418</td>
<td valign="top" align="left">viru</td>
<td valign="top" align="left">259</td></tr>
<tr>
<td valign="top" align="left">birdflu</td>
<td valign="top" align="left">851</td>
<td valign="top" align="left">symptom</td>
<td valign="top" align="left">409</td>
<td valign="top" align="left">daili</td>
<td valign="top" align="left">257</td></tr>
<tr>
<td valign="top" align="left">diseas</td>
<td valign="top" align="left">792</td>
<td valign="top" align="left">travel</td>
<td valign="top" align="left">385</td>
<td valign="top" align="left">children</td>
<td valign="top" align="left">250</td></tr>
<tr>
<td valign="top" align="left">life</td>
<td valign="top" align="left">789</td>
<td valign="top" align="left">music</td>
<td valign="top" align="left">373</td>
<td valign="top" align="left">care</td>
<td valign="top" align="left">250</td></tr>
<tr>
<td valign="top" align="left">vaccin</td>
<td valign="top" align="left">774</td>
<td valign="top" align="left">public</td>
<td valign="top" align="left">350</td>
<td valign="top" align="left">school</td>
<td valign="top" align="left">245</td></tr>
<tr>
<td valign="top" align="left">famili</td>
<td valign="top" align="left">739</td>
<td valign="top" align="left">person</td>
<td valign="top" align="left">349</td>
<td valign="top" align="left">govern</td>
<td valign="top" align="left">232</td></tr>
<tr>
<td valign="top" align="left">blog</td>
<td valign="top" align="left">739</td>
<td valign="top" align="left">obama</td>
<td valign="top" align="left">324</td>
<td valign="top" align="left">immun</td>
<td valign="top" align="left">230</td></tr>
<tr>
<td valign="top" align="left">cold</td>
<td valign="top" align="left">700</td>
<td valign="top" align="left">media</td>
<td valign="top" align="left">316</td>
<td valign="top" align="left">sport</td>
<td valign="top" align="left">223</td></tr></tbody></table></table-wrap>
<table-wrap id="t2-ijerph-07-00596" position="float">
<label>Table 2.</label>
<caption>
<p>Degree and frequency of most frequent flu post bloggers, 5 October 2008 to 21 March 2009.</p></caption>
<table frame="box" rules="cols">
<thead>
<tr>
<th valign="bottom" align="left" rowspan="2"><bold>Count</bold></th>
<th valign="bottom" align="center" rowspan="2"><bold>Blogger URL</bold></th>
<th valign="bottom" align="center" colspan="3"><bold>Degree</bold></th></tr>
<tr>
<th valign="bottom" align="center"><bold>In</bold></th>
<th valign="bottom" align="center"><bold>Out</bold></th>
<th valign="bottom" align="center"><bold>All</bold></th></tr>
<tr>
<th valign="bottom" align="center" colspan="5"><hr/></th></tr></thead>
<tbody>
<tr>
<td valign="top" align="right"><bold>1,897</bold></td>
<td valign="top" align="left"><ext-link xlink:href="crofsblogs.typepad.com/h5n1/" ext-link-type="uri">crofsblogs.typepad.com/h5n1/</ext-link></td>
<td valign="top" align="right"><bold>64</bold></td>
<td valign="top" align="right">581</td>
<td valign="top" align="right">645</td></tr>
<tr>
<td valign="top" align="right">1,230</td>
<td valign="top" align="left"><ext-link xlink:href="birdcauseflu.com" ext-link-type="uri">birdcauseflu.com</ext-link></td>
<td valign="top" align="right">1</td>
<td valign="top" align="right">1</td>
<td valign="top" align="right">2</td></tr>
<tr>
<td valign="top" align="right">929</td>
<td valign="top" align="left"><ext-link xlink:href="medblogs.org" ext-link-type="uri">medblogs.org</ext-link></td>
<td valign="top" align="right">0</td>
<td valign="top" align="right">6</td>
<td valign="top" align="right">6</td></tr>
<tr>
<td valign="top" align="right">912</td>
<td valign="top" align="left"><ext-link xlink:href="afludiary.blogspot.com" ext-link-type="uri">afludiary.blogspot.com</ext-link></td>
<td valign="top" align="right">30</td>
<td valign="top" align="right">659</td>
<td valign="top" align="right">689</td></tr>
<tr>
<td valign="top" align="right">359</td>
<td valign="top" align="left"><ext-link xlink:href="healthinform3.livejournal.com" ext-link-type="uri">healthinform3.livejournal.com</ext-link></td>
<td valign="top" align="right">0</td>
<td valign="top" align="right">4</td>
<td valign="top" align="right">4</td></tr>
<tr>
<td valign="top" align="right">330</td>
<td valign="top" align="left"><ext-link xlink:href="fluwikie2.com" ext-link-type="uri">fluwikie2.com</ext-link></td>
<td valign="top" align="right">35</td>
<td valign="top" align="right">839</td>
<td valign="top" align="right">874</td></tr>
<tr>
<td valign="top" align="right">204</td>
<td valign="top" align="left"><ext-link xlink:href="birdflumonitor.com" ext-link-type="uri">birdflumonitor.com</ext-link></td>
<td valign="top" align="right">0</td>
<td valign="top" align="right"><bold>1,012</bold></td>
<td valign="top" align="right">1,012</td></tr></tbody></table></table-wrap>
<table-wrap id="t3-ijerph-07-00596" position="float">
<label>Table 3.</label>
<caption>
<p>Closeness, betweenness and eigenvector centrality of the most frequent flu post bloggers, 5 October 2008 to 21 March 2009.</p></caption>
<table frame="box" rules="cols">
<thead>
<tr>
<th valign="bottom" align="center" rowspan="3"><bold>Blogger URL</bold></th>
<th valign="bottom" align="center" colspan="5"><bold>Centrality measures</bold></th></tr>
<tr>
<th valign="top" align="center" colspan="3"><bold>Closeness</bold></th>
<th valign="top" align="center" rowspan="2"><bold>Betweenness</bold></th>
<th valign="top" align="center" rowspan="2"><bold>Eigenvector Pagerank</bold></th></tr>
<tr>
<th valign="bottom" align="center"><bold>In</bold></th>
<th valign="bottom" align="center"><bold>Out</bold></th>
<th valign="bottom" align="center"><bold>All</bold></th></tr>
<tr>
<th valign="bottom" align="center" colspan="6"><hr/></th></tr></thead>
<tbody>
<tr>
<td valign="top" align="left"><ext-link xlink:href="crofsblogs.typepad.com/h5n1" ext-link-type="uri">crofsblogs.typepad.com/h5n1</ext-link></td>
<td valign="top" align="left"><bold>0.0001103</bold></td>
<td valign="top" align="left">0.00100130</td>
<td valign="top" align="left">0.00055580</td>
<td valign="top" align="left">0.00003249</td>
<td valign="top" align="left"><bold>0.00000373</bold></td></tr>
<tr>
<td valign="top" align="left"><ext-link xlink:href="birdcauseflu.com" ext-link-type="uri">birdcauseflu.com</ext-link></td>
<td valign="top" align="left">0.0000017</td>
<td valign="top" align="left">0.00000172</td>
<td valign="top" align="left">0.00000172</td>
<td valign="top" align="left">0.00000000</td>
<td valign="top" align="left">0.00000042</td></tr>
<tr>
<td valign="top" align="left"><ext-link xlink:href="medblogs.org" ext-link-type="uri">medblogs.org</ext-link></td>
<td valign="top" align="left">0.0000000</td>
<td valign="top" align="left">0.00001034</td>
<td valign="top" align="left">0.00000517</td>
<td valign="top" align="left">0.00000000</td>
<td valign="top" align="left">0.00000042</td></tr>
<tr>
<td valign="top" align="left"><ext-link xlink:href="afludiary.blogspot.com" ext-link-type="uri">afludiary.blogspot.com</ext-link></td>
<td valign="top" align="left">0.0000517</td>
<td valign="top" align="left">0.00113572</td>
<td valign="top" align="left">0.00059371</td>
<td valign="top" align="left"><bold>0.00005374</bold></td>
<td valign="top" align="left">0.00000149</td></tr>
<tr>
<td valign="top" align="left"><ext-link xlink:href="healthinform3.livejournal.com" ext-link-type="uri">healthinform3.livejournal.com</ext-link></td>
<td valign="top" align="left">0.0000000</td>
<td valign="top" align="left">0.00000689</td>
<td valign="top" align="left">0.00000345</td>
<td valign="top" align="left">0.00000000</td>
<td valign="top" align="left">0.00000042</td></tr>
<tr>
<td valign="top" align="left"><ext-link xlink:href="fluwikie2.com" ext-link-type="uri">fluwikie2.com</ext-link></td>
<td valign="top" align="left">0.0000603</td>
<td valign="top" align="left">0.00144593</td>
<td valign="top" align="left">0.00075313</td>
<td valign="top" align="left">0.00004468</td>
<td valign="top" align="left">0.00000042</td></tr>
<tr>
<td valign="top" align="left"><ext-link xlink:href="birdflumonitor.com" ext-link-type="uri">birdflumonitor.com</ext-link></td>
<td valign="top" align="left">0.0000000</td>
<td valign="top" align="left"><bold>0.00174408</bold></td>
<td valign="top" align="left">0.00087204</td>
<td valign="top" align="left">0.00000000</td>
<td valign="top" align="left">0.00000042</td></tr></tbody></table></table-wrap>
<table-wrap id="t4-ijerph-07-00596" position="float">
<label>Table 4.</label>
<caption>
<p>Six largest <italic>influenza</italic> content web and social media communities discovered by the Girvan-Newman community finding algorithm.</p></caption>
<table frame="box" rules="none">
<thead>
<tr>
<th valign="top" align="center" rowspan="2"><bold>Community Size</bold></th><th valign="top" align="center"/>
<th valign="top" align="center" colspan="3"><bold>Closeness</bold></th>
<th valign="top" align="center" rowspan="2"><bold>Betweenness</bold></th>
<th valign="top" align="center" rowspan="2"><bold>Eigenvector Page Rank</bold></th></tr>
<tr>
<th valign="top" align="center"><bold>URL</bold></th>
<th valign="top" align="center"><bold>IN</bold></th>
<th valign="top" align="center"><bold>OUT</bold></th>
<th valign="top" align="center"><bold>ALL</bold></th></tr></thead>
<tbody>
<tr><td valign="middle" align="left"/>
<td valign="middle" align="center" colspan="6"><bold>Personal blogs (google feeds and feedburner) &amp; general-reporting newspapers</bold></td></tr>
<tr>
<td valign="middle" align="center" rowspan="5">781</td>
<td valign="middle" align="left"><ext-link xlink:href="feeds.feedburner.com" ext-link-type="uri">feeds.feedburner.com</ext-link></td>
<td valign="middle" align="center">0.4532</td>
<td valign="middle" align="center">0.3802</td>
<td valign="middle" align="center">0.2661</td>
<td valign="middle" align="center">0.0533</td>
<td valign="middle" align="center"><bold>0.0347</bold></td></tr>
<tr>
<td valign="middle" align="left"><ext-link xlink:href="www.google.org" ext-link-type="uri">www.google.org</ext-link></td>
<td valign="middle" align="center">0.4446</td>
<td valign="middle" align="center">0.4098</td>
<td valign="middle" align="center">0.2102</td>
<td valign="middle" align="center">0.0127</td>
<td valign="middle" align="center">0.0261</td></tr>
<tr>
<td valign="middle" align="left"><ext-link xlink:href="www.nytimes.com" ext-link-type="uri">www.nytimes.com</ext-link></td>
<td valign="middle" align="center"><bold>0.5614</bold></td>
<td valign="middle" align="center"><bold>0.4954</bold></td>
<td valign="middle" align="center"><bold>0.3207</bold></td>
<td valign="middle" align="center"><bold>0.0931</bold></td>
<td valign="middle" align="center">0.0116</td></tr>
<tr>
<td valign="middle" align="left"><ext-link xlink:href="feedproxy.google.com" ext-link-type="uri">feedproxy.google.com</ext-link></td>
<td valign="middle" align="center">0.4751</td>
<td valign="middle" align="center">0.3857</td>
<td valign="middle" align="center">0.2852</td>
<td valign="middle" align="center">0.0281</td>
<td valign="middle" align="center">0.0089</td></tr>
<tr>
<td valign="middle" align="left"><ext-link xlink:href="www.washingtonpost.com" ext-link-type="uri">www.washingtonpost.com</ext-link></td>
<td valign="middle" align="center">0.5346</td>
<td valign="middle" align="center">0.4749</td>
<td valign="middle" align="center">0.3093</td>
<td valign="middle" align="center">0.0321</td>
<td valign="middle" align="center">0.0074</td></tr>
<tr><td valign="middle" align="left"/>
<td valign="middle" align="center" colspan="6"><bold>Mainstream network news, local news outlets</bold></td></tr>
<tr>
<td valign="middle" align="center" rowspan="5">599</td>
<td valign="middle" align="left"><ext-link xlink:href="www.reuters.com" ext-link-type="uri">www.reuters.com</ext-link></td>
<td valign="middle" align="center"><bold>0.5606</bold></td>
<td valign="middle" align="center"><bold>0.4887</bold></td>
<td valign="middle" align="center"><bold>0.3252</bold></td>
<td valign="middle" align="center"><bold>0.0767</bold></td>
<td valign="middle" align="center"><bold>0.0113</bold></td></tr>
<tr>
<td valign="middle" align="left"><ext-link xlink:href="news.xinhuanet.com" ext-link-type="uri">news.xinhuanet.com</ext-link></td>
<td valign="middle" align="center">0.5136</td>
<td valign="middle" align="center">0.4589</td>
<td valign="middle" align="center">0.3103</td>
<td valign="middle" align="center">0.0292</td>
<td valign="middle" align="center">0.0087</td></tr>
<tr>
<td valign="middle" align="left"><ext-link xlink:href="online.wsj.com" ext-link-type="uri">online.wsj.com</ext-link></td>
<td valign="middle" align="center">0.5315</td>
<td valign="middle" align="center">0.4766</td>
<td valign="middle" align="center">0.3105</td>
<td valign="middle" align="center">0.0306</td>
<td valign="middle" align="center">0.0076</td></tr>
<tr>
<td valign="middle" align="left"><ext-link xlink:href="www.bloomberg.com" ext-link-type="uri">www.bloomberg.com</ext-link></td>
<td valign="middle" align="center">0.5278</td>
<td valign="middle" align="center">0.4676</td>
<td valign="middle" align="center">0.3151</td>
<td valign="middle" align="center">0.0310</td>
<td valign="middle" align="center">0.0070</td></tr>
<tr>
<td valign="middle" align="left"><ext-link xlink:href="www.foxnews.com" ext-link-type="uri">www.foxnews.com</ext-link></td>
<td valign="middle" align="center">0.4952</td>
<td valign="middle" align="center">0.4566</td>
<td valign="middle" align="center">0.2894</td>
<td valign="middle" align="center">0.0121</td>
<td valign="middle" align="center">0.0061</td></tr>
<tr><td valign="middle" align="left"/>
<td valign="middle" align="center" colspan="6"><bold>Primary audience outside United States</bold></td></tr>
<tr>
<td valign="middle" align="center" rowspan="5">397</td>
<td valign="middle" align="left"><ext-link xlink:href="news.bbc.co.uk" ext-link-type="uri">news.bbc.co.uk</ext-link></td>
<td valign="middle" align="center"><bold>0.5440</bold></td>
<td valign="middle" align="center"><bold>0.4855</bold></td>
<td valign="middle" align="center"><bold>0.3064</bold></td>
<td valign="middle" align="center">0.0480</td>
<td valign="middle" align="center"><bold>0.0122</bold></td></tr>
<tr>
<td valign="middle" align="left"><ext-link xlink:href="www.guardian.co.uk" ext-link-type="uri">www.guardian.co.uk</ext-link></td>
<td valign="middle" align="center">0.5255</td>
<td valign="middle" align="center">0.4683</td>
<td valign="middle" align="center">0.3034</td>
<td valign="middle" align="center"><bold>0.0758</bold></td>
<td valign="middle" align="center">0.0082</td></tr>
<tr>
<td valign="middle" align="left"><ext-link xlink:href="www.telegraph.co.uk" ext-link-type="uri">www.telegraph.co.uk</ext-link></td>
<td valign="middle" align="center">0.5045</td>
<td valign="middle" align="center">0.4571</td>
<td valign="middle" align="center">0.2934</td>
<td valign="middle" align="center">0.0159</td>
<td valign="middle" align="center">0.0070</td></tr>
<tr>
<td valign="middle" align="left"><ext-link xlink:href="news.google.co.uk" ext-link-type="uri">news.google.co.uk</ext-link></td>
<td valign="middle" align="center">0.4629</td>
<td valign="middle" align="center">0.4072</td>
<td valign="middle" align="center">0.2887</td>
<td valign="middle" align="center">0.0079</td>
<td valign="middle" align="center">0.0052</td></tr>
<tr>
<td valign="middle" align="left"><ext-link xlink:href="www.timesonline.co.uk" ext-link-type="uri">www.timesonline.co.uk</ext-link></td>
<td valign="middle" align="center">0.4906</td>
<td valign="middle" align="center">0.4487</td>
<td valign="middle" align="center">0.2855</td>
<td valign="middle" align="center">0.0065</td>
<td valign="middle" align="center">0.0051</td></tr>
<tr><td valign="middle" align="left"/>
<td valign="middle" align="center" colspan="6"><bold>Livejournal community and Entertainment industry (Viacom, Reed)</bold></td></tr>
<tr>
<td valign="middle" align="center" rowspan="5">145</td>
<td valign="middle" align="left"><ext-link xlink:href="latimesblogs.latimes.com" ext-link-type="uri">latimesblogs.latimes.com</ext-link></td>
<td valign="middle" align="center">0.4836</td>
<td valign="middle" align="center"><bold>0.4213</bold></td>
<td valign="middle" align="center">0.2775</td>
<td valign="middle" align="center">0.0046</td>
<td valign="middle" align="center"><bold>0.0025</bold></td></tr>
<tr>
<td valign="middle" align="left"><ext-link xlink:href="community.livejournal.com" ext-link-type="uri">community.livejournal.com</ext-link></td>
<td valign="middle" align="center"><bold>0.4875</bold></td>
<td valign="middle" align="center">0.2454</td>
<td valign="middle" align="center"><bold>0.3224</bold></td>
<td valign="middle" align="center"><bold>0.0415</bold></td>
<td valign="middle" align="center">0.0017</td></tr>
<tr>
<td valign="middle" align="left"><ext-link xlink:href="www.people.com" ext-link-type="uri">www.people.com</ext-link></td>
<td valign="middle" align="center">0.4170</td>
<td valign="middle" align="center">0.3381</td>
<td valign="middle" align="center">0.2542</td>
<td valign="middle" align="center">0.0045</td>
<td valign="middle" align="center">0.0010</td></tr>
<tr>
<td valign="middle" align="left"><ext-link xlink:href="www.mtv.com" ext-link-type="uri">www.mtv.com</ext-link></td>
<td valign="middle" align="center">0.4491</td>
<td valign="middle" align="center">0.3612</td>
<td valign="middle" align="center">0.2784</td>
<td valign="middle" align="center">0.0034</td>
<td valign="middle" align="center">0.0006</td></tr>
<tr>
<td valign="middle" align="left"><ext-link xlink:href="www.variety.com" ext-link-type="uri">www.variety.com</ext-link></td>
<td valign="middle" align="center">0.3528</td>
<td valign="middle" align="center">0.2954</td>
<td valign="middle" align="center">0.2103</td>
<td valign="middle" align="center">0.0006</td>
<td valign="middle" align="center">0.0005</td></tr>
<tr><td valign="middle" align="left"/>
<td valign="middle" align="center" colspan="6"><bold>Large news conglomerates (News corp and Disney)</bold></td></tr>
<tr>
<td valign="middle" align="center" rowspan="5">144</td>
<td valign="middle" align="left"><ext-link xlink:href="www.youtube.com" ext-link-type="uri">www.youtube.com</ext-link></td>
<td valign="middle" align="center"><bold>0.5704</bold></td>
<td valign="middle" align="center"><bold>0.5463</bold></td>
<td valign="middle" align="center">0.1737</td>
<td valign="middle" align="center"><bold>0.0096</bold></td>
<td valign="middle" align="center"><bold>0.0220</bold></td></tr>
<tr>
<td valign="middle" align="left"><ext-link xlink:href="www.bloggingstocks.com" ext-link-type="uri">www.bloggingstocks.com</ext-link></td>
<td valign="middle" align="center">0.4023</td>
<td valign="middle" align="center">0.3506</td>
<td valign="middle" align="center"><bold>0.2619</bold></td>
<td valign="middle" align="center">0.0028</td>
<td valign="middle" align="center">0.0005</td></tr>
<tr>
<td valign="middle" align="left"><ext-link xlink:href="news.aol.com" ext-link-type="uri">news.aol.com</ext-link></td>
<td valign="middle" align="center">0.3379</td>
<td valign="middle" align="center">0.2731</td>
<td valign="middle" align="center">0.2410</td>
<td valign="middle" align="center">0.0034</td>
<td valign="middle" align="center">0.0005</td></tr>
<tr>
<td valign="middle" align="left"><ext-link xlink:href="sports.espn.go.com" ext-link-type="uri">sports.espn.go.com</ext-link></td>
<td valign="middle" align="center">0.4270</td>
<td valign="middle" align="center">0.3558</td>
<td valign="middle" align="center">0.2610</td>
<td valign="middle" align="center">0.0024</td>
<td valign="middle" align="center">0.0004</td></tr>
<tr>
<td valign="middle" align="left"><ext-link xlink:href="www.dailyfinance.com" ext-link-type="uri">www.dailyfinance.com</ext-link></td>
<td valign="middle" align="center">0.3996</td>
<td valign="middle" align="center">0.2959</td>
<td valign="middle" align="center">0.2652</td>
<td valign="middle" align="center">0.0005</td>
<td valign="middle" align="center">0.0004</td></tr>
<tr><td valign="middle" align="left"/>
<td valign="middle" align="center" colspan="6"><bold>Commentary, opinion, editorial</bold></td></tr>
<tr>
<td valign="middle" align="center" rowspan="5">127</td>
<td valign="middle" align="left"><ext-link xlink:href="whatreallyhappened.com" ext-link-type="uri">whatreallyhappened.com</ext-link></td>
<td valign="middle" align="center">0.4505</td>
<td valign="middle" align="center"><bold>0.4257</bold></td>
<td valign="middle" align="center">0.2348</td>
<td valign="middle" align="center">0.0374</td>
<td valign="middle" align="center"><bold>0.0159</bold></td></tr>
<tr>
<td valign="middle" align="left"><ext-link xlink:href="www.prisonplanet.com" ext-link-type="uri">www.prisonplanet.com</ext-link></td>
<td valign="middle" align="center"><bold>0.4837</bold></td>
<td valign="middle" align="center">0.3750</td>
<td valign="middle" align="center"><bold>0.3068</bold></td>
<td valign="middle" align="center"><bold>0.0622</bold></td>
<td valign="middle" align="center">0.0142</td></tr>
<tr>
<td valign="middle" align="left"><ext-link xlink:href="www.torontosun.com" ext-link-type="uri">www.torontosun.com</ext-link></td>
<td valign="middle" align="center">0.4081</td>
<td valign="middle" align="center">0.3617</td>
<td valign="middle" align="center">0.2336</td>
<td valign="middle" align="center">0.0001</td>
<td valign="middle" align="center">0.0012</td></tr>
<tr>
<td valign="middle" align="left"><ext-link xlink:href="www.legitgov.org" ext-link-type="uri">www.legitgov.org</ext-link></td>
<td valign="middle" align="center">0.4287</td>
<td valign="middle" align="center">0.3857</td>
<td valign="middle" align="center">0.2540</td>
<td valign="middle" align="center">0.0008</td>
<td valign="middle" align="center">0.0010</td></tr>
<tr>
<td valign="middle" align="left"><ext-link xlink:href="www.presstv.ir" ext-link-type="uri">www.presstv.ir</ext-link></td>
<td valign="middle" align="center">0.4186</td>
<td valign="middle" align="center">0.3813</td>
<td valign="middle" align="center">0.1902</td>
<td valign="middle" align="center">0.0004</td>
<td valign="middle" align="center">0.0007</td></tr></tbody></table></table-wrap>
<table-wrap id="t5-ijerph-07-00596" position="float">
<label>Table 5.</label>
<caption>
<p>Graph-based data mining using Subdue to detect structural anomalies that facilitate influenza-like-illness identification.</p></caption>
<table frame="hsides" rules="cols">
<thead>
<tr>
<th valign="top" align="center">Week of:</th>
<th valign="top" align="center">Anomaly Found?</th>
<th valign="top" align="center">Unusual frequent substructures of publisher types, categories, or URLs.</th></tr>
<tr>
<th valign="top" align="center" colspan="3"><hr/></th></tr></thead>
<tbody>
<tr>
<td valign="top" align="right">5-Oct-2008</td><td valign="top" align="center"/>
<td valign="top" align="left">NA</td></tr>
<tr>
<td valign="top" align="right">12-Oct-2008</td><td valign="top" align="center"/>
<td valign="top" align="left">MySpace (URL), Mainstream Media (publisher type), Public Health (category)</td></tr>
<tr>
<td valign="top" align="right">19-Oct-2008</td><td valign="top" align="center"/>
<td valign="top" align="left">Mainstream Media (publisher type), Flickr (URL), prisonplanet (URL)</td></tr>
<tr>
<td valign="top" align="right">26-Oct-2008</td><td valign="top" align="center"/>
<td valign="top" align="left">MySpace (URL), Mainstream News (publisher type)</td></tr>
<tr>
<td valign="top" align="right">2-Nov-2008</td><td valign="top" align="center"/>
<td valign="top" align="left">MySpace (URL), Barack Obama (category)</td></tr>
<tr>
<td valign="top" align="right">9-Nov-2008</td><td valign="top" align="center"/>
<td valign="top" align="left">Mainstream Media (publisher type), Google FluTrends (URL)</td></tr>
<tr>
<td valign="top" align="right">16-Nov-2008</td><td valign="top" align="center"/>
<td valign="top" align="left">Mainstream Media (publisher type), Amazon (URL), Google FluTrends (URL)</td></tr>
<tr>
<td valign="top" align="right">23-Nov-2008</td><td valign="top" align="center"/>
<td valign="top" align="left">NA</td></tr>
<tr>
<td valign="top" align="right">30-Nov-2008</td><td valign="top" align="center"/>
<td valign="top" align="left">NA</td></tr>
<tr>
<td valign="top" align="right">7-Dec-2008</td>
<td valign="top" align="center">Yes</td>
<td valign="top" align="left">Yahoo Answers UK (URL), MySpace (URL), Fox News (URL)</td></tr>
<tr>
<td valign="top" align="right">14-Dec-2008</td><td valign="top" align="center"/>
<td valign="top" align="left">NA</td></tr>
<tr>
<td valign="top" align="right">21-Dec-2008</td><td valign="top" align="center"/>
<td valign="top" align="left">Fox News (URL), MySpace (URL), BirdFluMonitor (URL)</td></tr>
<tr>
<td valign="top" align="right">28-Dec-2008</td><td valign="top" align="center"/>
<td valign="top" align="left">N/A</td></tr>
<tr>
<td valign="top" align="right">4-Jan-2009</td>
<td valign="top" align="center">Yes</td>
<td valign="top" align="left">Strong presence of personal blog to blog substructures.</td></tr>
<tr>
<td valign="top" align="right">11-Jan-2009</td>
<td valign="top" align="center">Yes</td>
<td valign="top" align="left">Strong presence of personal blog to blog substructures.</td></tr>
<tr>
<td valign="top" align="right">18-Jan-2009</td><td valign="top" align="center"/>
<td valign="top" align="left">NA</td></tr>
<tr>
<td valign="top" align="right">25-Jan-2009</td>
<td valign="top" align="center">Yes</td>
<td valign="top" align="left">Forums (publisher type)</td></tr>
<tr>
<td valign="top" align="right">1-Feb-2009</td><td valign="top" align="center"/>
<td valign="top" align="left">NA</td></tr>
<tr>
<td valign="top" align="right">8-Feb-2009</td>
<td valign="top" align="center">Yes</td>
<td valign="top" align="left">Strong presence of personal blog to blog substructures.</td></tr>
<tr>
<td valign="top" align="right">15-Feb-2009</td>
<td valign="top" align="center">Yes</td>
<td valign="top" align="left">High presence of personal blog to blog substructures.</td></tr>
<tr>
<td valign="top" align="right">22-Feb-2009</td>
<td valign="top" align="center">Yes</td>
<td valign="top" align="left">High presence of personal blog to blog substructures.</td></tr>
<tr>
<td valign="top" align="right">1-Mar-2009</td>
<td valign="top" align="center">Yes</td>
<td valign="top" align="left">MySpace (URL) &gt; 1500 substructures, Mainstream Media (publisher type)</td></tr>
<tr>
<td valign="top" align="right">8-Mar-2009</td>
<td valign="top" align="center">Yes</td>
<td valign="top" align="left">MySpace (URL) &gt; 330 substructures, Mainstream Media (publisher type)</td></tr>
<tr>
<td valign="top" align="right">15-Mar-2009</td>
<td valign="top" align="center">Yes</td>
<td valign="top" align="left">Very high presence of personal blog to blog substructures.</td></tr></tbody></table></table-wrap>
<table-wrap id="t6-ijerph-07-00596" position="float">
<label>Table 6.</label>
<caption>
<p>Novel Influenza H1N1/A Articles Posted per Week in 2009.</p></caption>
<table frame="box" rules="none">
<thead>
<tr>
<th valign="top" align="left">Week in 2009</th>
<th valign="top" align="center">17<sup>th</sup></th>
<th valign="top" align="center">18<sup>th</sup></th>
<th valign="top" align="center">19<sup>th</sup></th>
<th valign="top" align="center">20<sup>th</sup></th>
<th valign="top" align="center">21<sup>st</sup></th>
<th valign="top" align="center">22<sup>nd</sup></th>
<th valign="top" align="center">23<sup>rd</sup></th>
<th valign="top" align="center">24<sup>th</sup></th></tr>
<tr><th valign="top" align="center"/>
<th valign="top" align="center" colspan="8"><hr/></th></tr></thead>
<tbody>
<tr>
<td valign="top" align="left"># of Articles</td>
<td valign="top" align="center">5,591</td>
<td valign="top" align="center">108,038</td>
<td valign="top" align="center">61,341</td>
<td valign="top" align="center">26,256</td>
<td valign="top" align="center">19,224</td>
<td valign="top" align="center">37,938</td>
<td valign="top" align="center">14,393</td>
<td valign="top" align="center">27,502</td></tr></tbody></table></table-wrap></sec></back></article>
