<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xml:lang="en" article-type="research-article">
  <front>
    <journal-meta>
      <journal-id journal-id-type="publisher-id">futureinternet</journal-id>
      <journal-title>Future Internet</journal-title>
      <abbrev-journal-title abbrev-type="publisher">Future Internet</abbrev-journal-title>
      <abbrev-journal-title abbrev-type="pubmed">futureinternet</abbrev-journal-title>
      <issn pub-type="epub">1999-5903</issn>
      <publisher>
        <publisher-name>MDPI</publisher-name>
      </publisher>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.3390/fi4010238</article-id>
      <article-id pub-id-type="publisher-id">futureinternet-04-00238</article-id>
      <article-categories>
        <subj-group>
          <subject>Article</subject>
        </subj-group>
      </article-categories>
      <title-group>
        <article-title>Readability and the Web</article-title>
      </title-group>
      
      <contrib-group>
        <contrib contrib-type="author">
          <name>
            <surname>Martin</surname>
            <given-names>Ludger</given-names>
          </name>
          <xref rid="af1-futureinternet-04-00238" ref-type="aff">1</xref>
          <xref rid="c1-futureinternet-04-00238" ref-type="corresp">*</xref>
        </contrib>
        <contrib contrib-type="author">
          <name>
            <surname>Gottron</surname>
            <given-names>Thomas</given-names>
          </name>
          <xref rid="af2-futureinternet-04-00238" ref-type="aff">2</xref>
        </contrib>
      </contrib-group>
      <aff id="af1-futureinternet-04-00238"><label>1 </label>Institute of Computer Science, Johannes Gutenberg University Mainz, Mainz 55128, Germany</aff>
      <aff id="af2-futureinternet-04-00238"><label>2 </label>Institute for Web Science and Technologies, Universit¨at Koblenz-Landau, Koblenz 56070, Germany; Email:<email>gottron@uni-koblenz.de</email></aff>
      <author-notes>
        <corresp id="c1-futureinternet-04-00238"><label>*</label> Author  to whom correspondence should be addressed; Email: <email>martin@informatik.uni-mainz.de</email>.</corresp>
      </author-notes>
      <pub-date pub-type="epub">
        <day>12</day>
        <month>03</month>
        <year>2012</year>
      </pub-date>
      <pub-date pub-type="collection">
        <month>03</month>
        <year>2012</year>
      </pub-date>
      <volume>4</volume>
      <issue>1</issue>
      <fpage>238</fpage>
      <lpage>252</lpage>
      <history>
        <date date-type="received">
          <day>20</day>
          <month>12</month>
          <year>2011</year>
        </date>
        <date date-type="rev-recd">
          <day>07</day>
          <month>02</month>
          <year>2012</year>
        </date>
        <date date-type="accepted">
          <day>05</day>
          <month>03</month>
          <year>2012</year>
        </date>
      </history>
      <permissions>
        <copyright-statement>© 2012 by the authors; licensee MDPI, Basel, Switzerland.</copyright-statement>
        <copyright-year>2012</copyright-year>
        <license xmlns:xlink="http://www.w3.org/1999/xlink" license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/3.0/">
          <p>This article is an open-access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).</p>
        </license>
      </permissions>
      <abstract>
        <p>Readability indices measure how easy or difficult it is to read and comprehend a text. In this paper we look at the relation between readability indices and web documents from two different perspectives. On the one hand we analyse how to reliably measure the readability of web documents by applying content extraction techniques and incorporating a bias correction. On the other hand we investigate how web based corpus statistics can be used to measure readability in a novel and language independent way. </p>
      </abstract>
      <kwd-group>
        <kwd>web document readability</kwd>
        <kwd>content extraction</kwd>
        <kwd>corpus statistics</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec sec-type="intro">
      <title>1. Introduction</title>
      <p>Analysing a text for its readability, <italic>i.e.</italic>, the ease to read and comprehend the text, has a long tradition in literature. Since the nineteenth century, researchers in linguistics and literature science have been concerned with the question of how to measure the readability of a text. In recent years, the findings of this established field of research have found their application on web documents as well. Here, the notion of the readability of a document can serve several purposes. Readability metrics have been employed to assess the usability of web sites [<xref ref-type="bibr" rid="B1-futureinternet-04-00238">1</xref>], as a static quality metric for ranking web search results [<xref ref-type="bibr" rid="B2-futureinternet-04-00238">2</xref>] or to filter documents which match a user’s reading ability [<xref ref-type="bibr" rid="B3-futureinternet-04-00238">3</xref>,<xref ref-type="bibr" rid="B4-futureinternet-04-00238">4</xref>,<xref ref-type="bibr" rid="B5-futureinternet-04-00238">5</xref>,<xref ref-type="bibr" rid="B6-futureinternet-04-00238">6</xref>].</p>
      <p>However, the differences between classical print media and the web are often neglected in this transfer of readability metrics to the web. The differences of these media are twofold. On the one hand, text in the web is presented differently than in a printed book or magazine. This is motivated by technical restrictions on one or the other side, <italic>i.e.</italic>, a fixed page size <italic>vs.</italic> a scrollable screen viewport or the navigation via a table of contents and a look-up index with page numbers <italic>vs.</italic> the navigation via hyperlink structures and navigation menus. On the other hand, users consume text contents on the web differently compared to classical print media. Also here, the differences arise from usage patterns supported by the technologies, e.g., scanning and selective reading <italic>vs.</italic> linear and complete reading. In particular, user do not read all the text in a web document, but concentrate on the actually relevant parts while ignoring additional contents, such as navigation menus or advertisements.</p>
      <p>Given these differences in the presentation and consumption of text, the application of readability metrics for the purpose of web documents needs to be reconsidered. In this paper we address this topic under several aspects. First we consider the problem of text noise in web documents. This noise represents additional texts in a document, that are typically not part of the main content and are perceived differently by a user. Obviously this noise should not be considered when determining the readability of a web document. One way to eliminate the noise is to provide hand crafted filters for cleaning the documents of a particular web site, which are all based on a common template. This approach requires to track changes in the templates [<xref ref-type="bibr" rid="B7-futureinternet-04-00238">7</xref>] and to adapt the filters. Further, implementing hand crafted filters does not scale for an arbitrarily large number of web sites.</p>
      <p>Thus, here we analyse and compare generic methods for removing the text noise in web document, so called content extraction (CE) algorithms. CE algorithms are generally applicable to all web documents and use document structure, text density or layout features to identify the main text content. As all state-of-the-art CE algorithms are based on heuristics and typically are not perfect, we look in a second step at the bias these methods introduce in the computation of readability metrics. As this bias is systematic we finally propose corrective measures to counterbalance the bias. Finally, we investigate the potential of using the web itself to define metrics for document readability.</p>
      <p>Altogether in this paper we make several contributions.</p>
      <list list-type="bullet">
        <list-item>
          <p>We bring together the works of readability assessment on web documents with content extraction techniques.</p>
        </list-item>
        <list-item>
          <p>We qualitatively evaluate the impact of general purpose content extraction methods on the estimation of the readability of a web document.</p>
        </list-item>
        <list-item>
          <p>We propose a bias correction depending on the CE methods which leads to improvements in readability estimation.</p>
        </list-item>
        <list-item>
          <p>We investigate the potential of designing domain specific readability metrics by incorporating web based reference corpora.</p>
        </list-item>
      </list>
      <p>The rest of the paper is structured as follows: In <xref ref-type="sec" rid="sec2-futureinternet-04-00238">Section 2</xref> we introduce some of the more common readability formulae. We present related work in <xref ref-type="sec" rid="sec3-futureinternet-04-00238">Section 3</xref>, with particular focus on the application of readability metrics on web documents and content extraction techniques. In <xref ref-type="sec" rid="sec4-futureinternet-04-00238">Section 4</xref> we consider how readability can be determined accurately for web documents. <xref ref-type="sec" rid="sec5-futureinternet-04-00238">Section 5</xref> investigates the potential to use web resources to compute a readability metric. Finally, we conclude the paper with an outlook at future work.</p>
    </sec>
    <sec id="sec2-futureinternet-04-00238">
      <title>2. Readability Formulae</title>
      <p>A readability index is a measure to express the complexity of written text. Quite often they are based on simple features, such as sentence and word length, and indicate how easy it is to read and comprehend a text. While there is a wide variety of readability indices covered in the related literature, we focus here on three well established methods: Flesch Reading Ease, the SMOG grading index and the Gunning fog index.</p>
      <p>The <italic>Flesch Reading Ease (FRE)</italic> index [<xref ref-type="bibr" rid="B8-futureinternet-04-00238">8</xref>] is a long established index in this context. The score of FRE typically ranges between 0 and 100 [<xref ref-type="bibr" rid="B9-futureinternet-04-00238">9</xref>]. A higher score indicates a text that is easier to read and comprehend. For instance, a text with a score between 100 and 90 should be understandable for 11-year old students, while a score lower than 30 requires the reader to be at the level of a college graduate. Let us denote the total number of syllables in a text with <italic>y</italic>, the number of words with <italic>w</italic> and let <italic>s</italic> be the number of sentences, then the FRE index <inline-graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="futureinternet-04-00238-i004.tif"/> is defined by: </p>
      <disp-formula id="futureinternet-04-00238-i005">
<inline-graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="futureinternet-04-00238-i005.tif"/>

<label>(1)</label>
</disp-formula>
      <p>McLaughlin [<xref ref-type="bibr" rid="B10-futureinternet-04-00238">10</xref>] introduced a different parameter in his readability formula: the number of polysyllables. A polysyllable is a word made of three or more syllables. If we denote the number of polysyllables in a text with <italic>p</italic>, and use <italic>s</italic> again for the number of sentences, then the <italic>SMOG grading</italic> index <inline-graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="futureinternet-04-00238-i007.tif"/> is defined as: </p>
      <disp-formula id="futureinternet-04-00238-i008">
<inline-graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="futureinternet-04-00238-i008.tif"/>

<label>(2)</label>
</disp-formula>
      <p>The SMOG grading index indicates the educational level required to comprehend a text, <italic>i.e.</italic>, the years of school education, required to understand it. For instance, a SMOG reading index value of 5 indicates that a text can be understood after five years of school education. To calculate the index of a large text, McLaughlin stated that it is sufficient to use three text samples of 10 sentences each.</p>
      <p>Another metric with a similar intention is the <italic>Gunning fog</italic> index [<xref ref-type="bibr" rid="B11-futureinternet-04-00238">11</xref>]. Comparable to SMOG, also this index estimates the years of education required to understand a given text. To calculate the Gunning fog index, a passage of around 100 words needs to be analysed. Polysyllables are considered in the Gunning fog index, too, but only those which are not proper nouns, compound words, <italic>etc</italic>. The Gunning fog index <inline-graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="futureinternet-04-00238-i009.tif"/> is defined as: </p>
      <disp-formula id="futureinternet-04-00238-i010">
<inline-graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="futureinternet-04-00238-i010.tif"/>

<label>(3)</label>
</disp-formula>
      <p>The original Gunning fog formula was based on clauses and not on the number of sentences <italic>s</italic>. This rendered the Gunning fog too difficult to be calculated automatically. Thus, by now, that the formulation presented in Equation (3) is generally recommended.</p>
      <p>Both, the Gunning fog and the SMOG grading index use only small samples of a text to calculate the a readability score. This was originally simply motivated by practical reasons. Both formulae are relatively old. The Gunning fog index was published in 1952, the SMOG grading index in 1969. At this time the formulae needed to be calculated by hand. With the rise of electronic means to process texts both formulae have been implemented for automatic computation. Hence, their computation can easily be extended to a full text even for entire books. One question of interest is, whether there is a difference when considering the full text rather than just some parts and samples of a text. The chart in <xref ref-type="fig" rid="futureinternet-04-00238-f001">Figure 1</xref> shows the development of the SMOG grading index across sub-samples of 30 sentences over a full text (in this case a novel for children).</p>
      <fig id="futureinternet-04-00238-f001" position="anchor">
        <label>Figure 1</label>
        <caption>
          <p>Readability over sub-samples of a longer text.</p>
        </caption>
        <graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="futureinternet-04-00238-g001.tif"/>
      </fig>
      <p>The average of the SMOG grading index over all sub-samples is 4.00 and the SMOG grading index calculated with the whole text is 4.01. This difference can be neglected. The minimal index is 3.73 and the maximal index is 4.35. If one of these parts is selected randomly the readability might be miscalculated to the extent of half a school year.</p>
      <p>The SMOG grading index and the Gunning fog index both promise to calculate the years of school education. <xref ref-type="table" rid="futureinternet-04-00238-t001">Table 1</xref> shows the SMOG grading and Gunning fog index of four texts. The table also shows the grade differences <inline-graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="futureinternet-04-00238-i012.tif"/>. The difference ranges from 2.33 up to 4.35 years of school education.</p>
      <table-wrap id="futureinternet-04-00238-t001" position="anchor">
        <object-id pub-id-type="pii">futureinternet-04-00238-t001_Table 1</object-id>
        <label>Table 1</label>
        <caption>
          <p>Absolute difference Δ between SMOG and Gunning fog values for selected documents.</p>
        </caption>
        <table>
          <thead>
            <tr>
              <th align="left" valign="middle">Document </th>
              <th align="center" valign="middle">SMOG </th>
              <th align="center" valign="middle">Gunning fog </th>
              <th align="center" valign="middle">Δ</th>
            </tr>
          </thead>
          <tbody>
            <tr>
              <td align="left" valign="middle">Enid Blyton: Famouse Five—Five Are Together Again </td>
              <td align="center" valign="middle">4.01 </td>
              <td align="center" valign="middle">6.34 </td>
              <td align="center" valign="middle">2.33</td>
            </tr>
            <tr>
              <td align="left" valign="middle">William Shakespeare: Romeo and Juliet </td>
              <td align="center" valign="middle">4.05 </td>
              <td align="center" valign="middle">7.04 </td>
              <td align="center" valign="middle">2.99</td>
            </tr>
            <tr>
              <td align="left" valign="middle">Bible: Genesis 1–50 </td>
              <td align="center" valign="middle">4.08 </td>
              <td align="center" valign="middle">7.39 </td>
              <td align="center" valign="middle">3.31</td>
            </tr>
            <tr>
              <td align="left" valign="middle">Arthur Conan Doyle: The Hound of the Baskervilles </td>
              <td align="center" valign="middle">4.27 </td>
              <td align="center" valign="middle">8.62 </td>
              <td align="center" valign="middle">4.35</td>
            </tr>
          </tbody>
  </table>
      </table-wrap>
      <p>This fluctuation in the results as well as the need to avoid a sampling bias motivated us to compute readability scores always on the full text we were considering.</p>
      <p>While FRE and SMOG were defined for the English language, readability indices have also been developed for other languages. Amstad [<xref ref-type="bibr" rid="B12-futureinternet-04-00238">12</xref>] describes a variation of FRE with adapted weights for German: </p>
      <disp-formula id="futureinternet-04-00238-i015">
<inline-graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="futureinternet-04-00238-i015.tif"/>

<label>(4)</label>
</disp-formula>
      <p>The <italic>Wiener Sachtextformel</italic>, instead was directly developed to describe the ease of readability for German texts [<xref ref-type="bibr" rid="B13-futureinternet-04-00238">13</xref>]. It exists in different versions to cope with discontinuities and non-linear developments of the difficulty in reading German texts.</p>
      <p>For some other languages, there are no readability indices or at least none are well established. To overcome this problem Tanguy and Tulechki [<xref ref-type="bibr" rid="B14-futureinternet-04-00238">14</xref>] looked at an approach for identifying linguistic characteristics that might be most suitable to describe sentence complexity. Their approach was applied to French texts. However, they did not develop any formula nor did they investigate if the found features really correlate with readability.</p>
    </sec>
    <sec id="sec3-futureinternet-04-00238">
      <title>3. Related Work</title>
      <p>While the readability indices we recalled in <xref ref-type="sec" rid="sec2-futureinternet-04-00238">Section 2</xref> were originally developed for text documents, they are nowadays applied to web documents, too. Typical applications in this context are to assess how comprehensible legal statements are [<xref ref-type="bibr" rid="B1-futureinternet-04-00238">1</xref>] or to let users select documents most appropriate to their educational level. The latter application can be subdivided further into ranking web search results according to their readability [<xref ref-type="bibr" rid="B2-futureinternet-04-00238">2</xref>] or to filter documents that do not comply with the reading ability of a given user [<xref ref-type="bibr" rid="B3-futureinternet-04-00238">3</xref>,<xref ref-type="bibr" rid="B4-futureinternet-04-00238">4</xref>].</p>
      <p>So far, little work consider the differences in the calculation of readability metrics for web documents. Petersen and Ostendorf [<xref ref-type="bibr" rid="B15-futureinternet-04-00238">15</xref>] considered in particular the distinction between web documents that contain little text and for which it consequently is not suitable to compute a readability score at all. This distinction is based on a supervised learning approach using very common words as feature set for the documents. Few systems address the problem of identifying the actual main content in a web document when computing its readability. Nonaka <italic>et al</italic>. [<xref ref-type="bibr" rid="B2-futureinternet-04-00238">2</xref>] present a domain and language dependent approach to detect the particular structure of how-to manuals. The Read-X system [<xref ref-type="bibr" rid="B4-futureinternet-04-00238">4</xref>,<xref ref-type="bibr" rid="B16-futureinternet-04-00238">16</xref>] instead, uses a heuristic content extraction tool to improve the estimation of readability. The extraction module in Read-X is based on a screen scraper software library supporting the development of hand crafted extraction filters for template based documents. However, so far no system involves state-of-the-art generic content extraction algorithms.</p>
      <p>Generic content extraction algorithms are covered in various publications. Typically the algorithms are based on heuristics and follow some assumptions about the shape, structure and form of the main content. A thorough evaluation of different approaches is presented in [<xref ref-type="bibr" rid="B17-futureinternet-04-00238">17</xref>]. In this comparison the Document Slope Curve filter [<xref ref-type="bibr" rid="B18-futureinternet-04-00238">18</xref>] showed good efficiency and effectiveness. Newer methods have advanced the accuracy (e.g., Content Code Blurring [<xref ref-type="bibr" rid="B19-futureinternet-04-00238">19</xref>]), efficiency (e.g., the Density algorithm [<xref ref-type="bibr" rid="B20-futureinternet-04-00238">20</xref>]) or addressed language specific scenarios (e.g., the DANA approach for arabian languages [<xref ref-type="bibr" rid="B21-futureinternet-04-00238">21</xref>]). All modern CE methods are highly efficient, are capable of processing between 10 and 20 MB of web documents data per second on commodity hardware and are, thus, suitable for on-the-fly application for cleaning web documents from noise. State-of-the-art extraction algorithms achieve an extraction performance with an average F1 score between 0.86 and 0.96, depending on the application scenario. A detailed analysis also showed that some methods are more biased towards a higher precision, while others favour a high recall in extracting the main content [<xref ref-type="bibr" rid="B17-futureinternet-04-00238">17</xref>] .</p>
      <p>In [<xref ref-type="bibr" rid="B22-futureinternet-04-00238">22</xref>] we looked at the influence of noise in web documents—such as navigation menus, related links lists, headers, footers, disclaimers—on the automatically determined readability score. We found that content extraction algorithms like the Document Slope Curve filter [<xref ref-type="bibr" rid="B18-futureinternet-04-00238">18</xref>] or Content Code Blurring [<xref ref-type="bibr" rid="B19-futureinternet-04-00238">19</xref>] led to much better estimates of the readability of the document’s actual main content.</p>
      <p>Yan <italic>et al</italic>. [<xref ref-type="bibr" rid="B23-futureinternet-04-00238">23</xref>] investigate domain specific readability. They explain that domain specific texts have technical or professional terms. These terms cannot be measured by traditional syllable counting like SMOG or FRE. Common words will become technical terms in domain specific texts. They also propose to compute a relative readability score rather an absolute grade level metrics. They present several complex formulae to calculate their concept-based document readability. They include calculations on word level, consider a given knowledge base and document scope, and also calculate general words which are out of a common word list. Yan <italic>et al</italic>. state that traditional readability formulae are oversimplified. The drawback of their approach instead lies in the need for a domain expert to formulate the required knowledge base. A general purpose formula that automatically determines the characteristics of a domain specific language would be favourable.</p>
      <p>There are several approaches looking at alternative features for determining the readability of text. For instance, Rosa and Eskenazi [<xref ref-type="bibr" rid="B24-futureinternet-04-00238">24</xref>] consider word complexity. In the context of computing the readability of a text written in a language that is not the reader’s native language, they try to determine factors that make an individual word easier to learn. One such factor is word complexity. It can be measured by the word’s grapheme to phoneme ratio and the number of meanings a word has. Another approach in a similar study for French as a foreign language [<xref ref-type="bibr" rid="B25-futureinternet-04-00238">25</xref>] focusses on multi-word expressions (MWE) as the basis for a formula for measuring readability. The score of the formula is related to the <italic>Common European Framework of Reference for Languages</italic>. The most important variables are the proportion of the nominal MWEs to the number of words and the mean size of the nominal MWEs in the text. The first is significantly related to the difficulty of the text. The option of using word frequencies to determine readability is raised by Weir and Ritchie [<xref ref-type="bibr" rid="B26-futureinternet-04-00238">26</xref>]. In <xref ref-type="sec" rid="sec5dot1-futureinternet-04-00238">Section 5.1</xref> we follow this line of thought, and investigate a similar feature to determine the readability of a text.</p>
    </sec>
    <sec id="sec4-futureinternet-04-00238">
      <title>4. Readability of Web Documents</title>
      <p>Since the rise of the World Wide Web, more and more texts appear online and as part of web sites. Naturally, producers as well as consumers of online texts are interested in the readability of online documents. This has led to the application of readability formulae to HTML documents [<xref ref-type="bibr" rid="B1-futureinternet-04-00238">1</xref>]. The typical approach here was to simply take a full HTML document, strip off all the markup and parse the remainder of the document through a readability formula. The result of the formula is then interpreted as the readability of the document.</p>
      <p>While this is a straightforward solution to the problem, there is a conceptual mistake in the approach. As we have already mentioned before, web documents are designed and consumed differently than classical printed documents. In particular user do not read a document entirely, but rather scan the web page first to determine, where the main content is located and whether this content is relevant and of interest to them. Once the user has identified relevant information, they read the document selectively and focus on those parts comprising the main content. Other, additional contents, such as navigation menus, related links list, legal disclaimers, advertisements or header and footer elements with text are typically ignored when reading a document. So, essentially, as these text contents are not actually read, they should be ignored in the computation of readability metrics.</p>
      <sec>
        <title>4.1. Content Extraction</title>
        <p>Content Extraction (CE) is the process of determining those parts of an HTML document which represent its main text content. Hence, it is a suitable solution to address the problem described above. A qualitative evaluation of several CE approaches in [<xref ref-type="bibr" rid="B17-futureinternet-04-00238">17</xref>] showed that modern methods demonstrate a very good performance in terms of accuracy. State-of-the-art methods achieve F1 scores of 0.96. However, CE methods are typically not capable of perfectly extracting the main content. Some methods tend to be too restrictive and discard some parts of the main content during the extraction process, others instead are too lax and extract also additional content. This bias in the methods needs to be taken into account when designing applications incorporating CE.</p>
        <p>In the context of this paper we apply two established and well performing methods: <italic>Adapted Content Code Blurring (ACCB)</italic> [<xref ref-type="bibr" rid="B19-futureinternet-04-00238">19</xref>] and <italic>Document Slope Curves (DSC)</italic> [<xref ref-type="bibr" rid="B18-futureinternet-04-00238">18</xref>]. The details of these algorithms are beyond the scope of this paper and can be found in the original publications.</p>
      </sec>
      <sec>
        <title>4.2. Experimental Setup</title>
        <p>We base our work on data and initial experiments we conducted in [<xref ref-type="bibr" rid="B22-futureinternet-04-00238">22</xref>]. To evaluate and quantify the impact of noise in web documents we crawled 1114 web documents from five different web sources. All sources provided English news articles with a text based main content. The length of these main content typically ranged between 300 to 1100 words, with a very few outliers of significantly shorter or longer documents.</p>
        <p>To provide a gold standard we manually determined the actual main content in each of the documents and calculated FRE and SMOG on these texts. As baseline, we applied the same readability metrics of the full documents including all the text noise, such as navigation menus, headers, footers, advertisements, legal disclaimers, <italic>etc</italic>. Finally, we automatically cleaned the documents from additional content by using the ACCB and DSC filters and evaluated the readability of the remainder of the document.</p>
        <p>For the computation of the readability metrics we determined sentence boundaries based on end-of-sentence characters and text structure. We paid attention to the context of end-of-sentence characters, e.g., by requesting a subsequent white space character and by checking against a list of common abbreviations for not detecting a premature end of a sentence. Additionally we considered a sentence to start at the beginning of each paragraph and to stop at the end of a paragraph. We tokenized sentences into words at white space characters and further special characters, like colon, comma, quotation marks, <italic>etc</italic>. We did not employ compound splitters or other more sophisticated methods. For the decomposition of words into syllables we relied on the hyphenation of the LaTeX package, which can easily be incorporated into other programs. This provided us all the features necessary for the computation of SMOG and FRE.</p>
      </sec>
      <sec>
        <title>4.3. Results</title>
        <p><xref ref-type="table" rid="futureinternet-04-00238-t002">Table 2</xref> shows the values we obtained for the SMOG index, when calculating it on our gold standard (the actual main content), on the full document and after having cleaned the documents using ACCB and DSC. The values indicate clearly that employing CE (columns ACCB and DSC) in the course of determining readability provides more accurate results. However, the aggregated values do not show the variations and fluctuations of the values on individual documents, which brought us to analyse the results in more detail.</p>
        <table-wrap id="futureinternet-04-00238-t002" position="anchor">
          <object-id pub-id-type="pii">futureinternet-04-00238-t002_Table 2</object-id>
          <label>Table 2</label>
          <caption>
            <p>Average SMOG index value for web documents based on the actual article text, the full text and after cleaning the document using the content extraction methods ACCB and DSC.</p>
          </caption>
          <table>
            <thead>
              <tr>
                <th rowspan="2" align="left" valign="middle">Source</th>
                <th rowspan="2" align="center" valign="middle">Number of documents</th>
                <th colspan="4" align="center" valign="middle">SMOG </th>
              </tr>
              <tr style="border-top: solid thin">
                <th valign="middle">Gold standard </th>
                <th valign="middle">Full</th>
                <th valign="middle">ACCB </th>
                <th valign="middle">DSC</th>
              </tr>
            </thead>
            <tbody>
              <tr>
                <td align="left" valign="middle">BBC News</td>
                <td align="center" valign="middle">337</td>
                <td align="center" valign="middle">4.8323</td>
                <td align="center" valign="middle">4.0569 </td>
                <td align="center" valign="middle">4.9360 </td>
                <td align="center" valign="middle">4.8052 </td>
              </tr>
              <tr>
                <td align="left" valign="middle">The Economist </td>
                <td align="center" valign="middle">53</td>
                <td align="center" valign="middle">5.0578</td>
                <td align="center" valign="middle">4.2486 </td>
                <td align="center" valign="middle">5.1433 </td>
                <td align="center" valign="middle">5.0835 </td>
              </tr>
              <tr>
                <td align="left" valign="middle">Herald Tribune </td>
                <td align="center" valign="middle">300</td>
                <td align="center" valign="middle">5.0477</td>
                <td align="center" valign="middle">4.0891 </td>
                <td align="center" valign="middle">5.0650 </td>
                <td align="center" valign="middle">5.0412 </td>
              </tr>
              <tr>
                <td align="left" valign="middle">MSNBC News</td>
                <td align="center" valign="middle">197</td>
                <td align="center" valign="middle">4.8949</td>
                <td align="center" valign="middle">4.4675 </td>
                <td align="center" valign="middle">4.9050 </td>
                <td align="center" valign="middle">4.8491 </td>
              </tr>
              <tr>
                <td align="left" valign="middle">Yahoo News</td>
                <td align="center" valign="middle">227</td>
                <td align="center" valign="middle">4.9416</td>
                <td align="center" valign="middle">4.2063 </td>
                <td align="center" valign="middle">4.7563 </td>
                <td align="center" valign="middle">4.7670 </td>
              </tr>
              <tr>
                <td align="left" valign="middle">Total</td>
                <td align="center" valign="middle">1114</td>
                <td align="center" valign="middle">4.9344</td>
                <td align="center" valign="middle">4.1793 </td>
                <td align="center" valign="middle">4.9385 </td>
                <td align="center" valign="middle">4.8820 </td>
              </tr>
            </tbody>
  </table>
        </table-wrap>
        <p>We measured the correlation of the readability of the actual hand cleaned main content with the readability values obtained for the full document and the ones for the automatically cleaned documents. As shown in <xref ref-type="table" rid="futureinternet-04-00238-t003">Table 3</xref> it turned out that the actual readability of a document and the values on the full document are at best weakly correlated, while the cleaned documents show a relatively good correlation. Further, we looked at the mean square error (MSE) of the data series to measure the local deviations. Also here employing ACCB and DSC leads to far better values.</p>
        <table-wrap id="futureinternet-04-00238-t003" position="anchor">
          <object-id pub-id-type="pii">futureinternet-04-00238-t003_Table 3</object-id>
          <label>Table 3</label>
          <caption>
            <p>Correlation and MSE in measuring readability for different choices of text samples.</p>
          </caption>
          <table>
            <thead>
              <tr>
                <th align="left" valign="middle">Choice of text sample </th>
                <th align="center" valign="middle">Correlation </th>
                <th align="center" valign="middle">MSE</th>
              </tr>
            </thead>
            <tbody>
              <tr>
                <td align="left" valign="middle">Full document </td>
                <td align="center" valign="middle">0.3220 </td>
                <td align="center" valign="middle">0.6624 </td>
              </tr>
              <tr>
                <td align="left" valign="middle">ACCB </td>
                <td align="center" valign="middle">0.8406 </td>
                <td align="center" valign="middle">0.0277 </td>
              </tr>
              <tr>
                <td align="left" valign="middle">DSC </td>
                <td align="center" valign="middle">0.8612 </td>
                <td align="center" valign="middle">0.0277 </td>
              </tr>
            </tbody>
          </table>
        </table-wrap>
      </sec>
      <sec>
        <title>4.4. Bias Correction in Readability on the Web</title>
        <p>The good correlation values obtained when cleaning the documents on the basis of the ACCB and DSC content extraction filters indicate a linear relation between the readability value on the actual content and the index as it is computed on the automatically created extract. However, looking at the individual values in <xref ref-type="table" rid="futureinternet-04-00238-t002">Table 2</xref>, ACCB, for instance, tends to overestimate the level of difficulty for readability.</p>
        <p>Our hypothesis is that these deviations are systematic and are caused by the bias of CE algorithms to achieve either a better precision or better recall when determining the main content. If this hypothesis is correct, we can correct the deviations by learning a functional relation between the true readability values obtained on the gold standard and the estimates obtained via the CE methods.</p>
        <p>To learn such a functional relation, we sampled a random subset of 100 documents from our test corpus to train a linear regression model on the data. Using the results of this model we can correct the bias in the computation of the readability. To furthermore check if such a bias correction in readability would be sufficient to replace CE methods entirely, we also trained a model on the SMOG index values of the full documents. This lead to the following adjusted bias-correction (<italic>bc</italic>) formulae for SMOG on different text samples: </p>
        <disp-formula id="futureinternet-04-00238-i017">
<inline-graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="futureinternet-04-00238-i017.tif"/>

<label>(5)</label>
</disp-formula>
        <disp-formula id="futureinternet-04-00238-i018">
<inline-graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="futureinternet-04-00238-i018.tif"/>

<label>(6)</label>
</disp-formula>
        <disp-formula id="futureinternet-04-00238-i019">
<inline-graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="futureinternet-04-00238-i019.tif"/>

<label>(7)</label>
</disp-formula>
        <p>We evaluate the bias correction by computing again MSE for the adjusted values on the remaining 1.014 documents. The results in <xref ref-type="table" rid="futureinternet-04-00238-t004">Table 4</xref> show that the bias correction reduces MSE in all series. The biggest relative improvement for SMOG is obtained on the full documents. However, the best absolute value is achieved for the bias corrected SMOG index on the DSC filter. For the ACCB content extraction filter the improvements are still valid but do not reach the quality of DSC.</p>
        <table-wrap id="futureinternet-04-00238-t004" position="anchor">
          <object-id pub-id-type="pii">futureinternet-04-00238-t004_Table 4</object-id>
          <label>Table 4</label>
          <caption>
            <p>MSE with and without bias correction.</p>
          </caption>
          <table>
            <thead>
              <tr>
                <th align="left" valign="middle">Choice of text sample </th>
                <th align="center" valign="middle">No correction </th>
                <th align="center" valign="middle">Bias correction</th>
              </tr>
            </thead>
            <tbody>
              <tr>
                <td align="left" valign="middle">Full document </td>
                <td align="center" valign="middle">0.6624 </td>
                <td align="center" valign="middle">0.0833</td>
              </tr>
              <tr>
                <td align="left" valign="middle">ACCB </td>
                <td align="center" valign="middle">0.0277 </td>
                <td align="center" valign="middle">0.0249</td>
              </tr>
              <tr>
                <td align="left" valign="middle">DSC </td>
                <td align="center" valign="middle">0.0277 </td>
                <td align="center" valign="middle">0.0234 </td>
              </tr>
            </tbody>
          </table>
        </table-wrap>
      </sec>
    </sec>
    <sec id="sec5-futureinternet-04-00238">
      <title>5. Exploiting Web Resources to Estimate Readability</title>
      <p>The readability metrics SMOG, FRE, and Gunning fog have been designed for English. The adaptation of FRE for German required an adjustment of parameters, for most other languages there are no parameters for this metric. Furthermore, for some languages there are no readability metrics at all.</p>
      <sec id="sec5dot1-futureinternet-04-00238">
        <title>5.1. Web Metrics for Documents</title>
        <p>Fortunately, linguists have observed that also other features correlate with the difficulty of reading and understanding a text. One such feature is how common the words in a text are [<xref ref-type="bibr" rid="B13-futureinternet-04-00238">13</xref>]. A text is easier to comprehend if it contains mainly commonly used words and harder if it uses words that are not part of everyday language.</p>
        <p>Frequency classes are one approach to measure how common a word is. The frequency class of a word quantifies how much less frequent a word is than the most frequent word of a language. Class <italic>c</italic><sub>0</sub> is assigned to the most frequent word, class <italic>c</italic><sub>1</sub> to all words that are at least half as frequent as the one in <italic>c</italic><sub>0</sub>. In turn, class <italic>c</italic><sub>2</sub> contains all words that are at least half as frequent as the words in <italic>c</italic><sub>1</sub>, and so on. The class of a term <italic>t</italic> w.r.t. the most frequent term <italic>t</italic><sub>0</sub> can be computed in a closed form: </p>
        <disp-formula id="futureinternet-04-00238-i025">
<inline-graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="futureinternet-04-00238-i025.tif"/>

<label>(8)</label>
</disp-formula>
        <p>To compute the term frequencies in a language it is necessary to have statistics over a large corpus. The largest electronically accessible corpus is provided by the web. The Wortschatz project of the University of Leipzig [<xref ref-type="bibr" rid="B27-futureinternet-04-00238">27</xref>] has massively crawled large amounts of web documents in different languages. The project provides a SOAP interface to its database that allows for easy querying corpus statistics and also for directly obtaining term frequency classes for terms.</p>
      </sec>
      <sec>
        <title>5.2. Analysis of Texts</title>
        <p>We build an application that can take up any arbitrary text, tokenizes it into words and retrieves via the web service interface of the Wortschatz project the frequency class for each word. We then analysed how many distinct words are in every frequency class and obtain a distribution of terms to frequency classes.</p>
        <p>We created a collection of publicly available texts and assigned each text into one of the classes of small children’s literature, novels, scientific texts, news and philosophical manuscripts. For news we used a sub-sample of our dataset described above in 4.2, the philosophical texts were obtain from project Gutenberg [<xref ref-type="bibr" rid="B28-futureinternet-04-00238">28</xref>]. As scientific text we used the full papers published at the 9th WWW conference and the children’s literature and novels were text samples taken from recent books or web published short stories of contemporary authors. <xref ref-type="table" rid="futureinternet-04-00238-t005">Table 5</xref> lists how many documents are contained in each category. </p>
        <table-wrap id="futureinternet-04-00238-t005" position="anchor">
          <object-id pub-id-type="pii">futureinternet-04-00238-t005_Table 5</object-id>
          <label>Table 5</label>
          <caption>
            <p>Classes of analysed texts.</p>
          </caption>
          <table>
            <thead>
              <tr>
                <th align="left" valign="middle">Text category </th>
                <th align="center" valign="middle">Number of documents</th>
              </tr>
            </thead>
            <tbody>
              <tr>
                <td align="left" valign="middle">Children </td>
                <td align="center" valign="middle">19 </td>
              </tr>
              <tr>
                <td align="left" valign="middle">Novels </td>
                <td align="center" valign="middle">14 </td>
              </tr>
              <tr>
                <td align="left" valign="middle">Scientific </td>
                <td align="center" valign="middle">57 </td>
              </tr>
              <tr>
                <td align="left" valign="middle">News </td>
                <td align="center" valign="middle">198 </td>
              </tr>
              <tr>
                <td align="left" valign="middle">Philosophy </td>
                <td align="center" valign="middle">5 </td>
              </tr>
            </tbody>
          </table>
        </table-wrap>
        <p>The graph in <xref ref-type="fig" rid="futureinternet-04-00238-f002">Figure 2</xref> displays the distribution of terms into the frequency classes in the different text categories. The plots show that the distributions do correspond to the difficulty of the texts. However, there are several interesting aspects:</p>
        
        <list list-type="bullet">
          <list-item>
            <p>Children’s literature as well as news articles use few rare words. Looking into the texts, the rare words mainly corresponded to names of people or locations, such as towns, rivers, countries. While at first sight it might seem surprising that news and texts for children should be comparable in terms of readability, our above results also showed that news texts on average have a SMOG index of 5 which approximately corresponds to children having completed primary school.</p>
          </list-item>
          <list-item>
            <p>Relative to the other categories of texts, the very rare and very common words are under-represented in scientific texts. In the middle range frequency classes the WWW papers have a higher percentage of words, which can be explained with a scientific community having its particular but not too common language.</p>
          </list-item>
          <list-item>
            <p>Novels contain more rare words. In philosophical texts this observation is even stronger. For the novels this can be explained in a well elaborated text style, for the philosophical texts with a very high level of writing.</p>
          </list-item>
        </list>
        <fig id="futureinternet-04-00238-f002" position="anchor">
          <label>Figure 2</label>
          <caption>
            <p>Distribution of terms in frequency classes for different types of texts.</p>
          </caption>
          <graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="futureinternet-04-00238-g002.tif"/>
        </fig>
        <p>Given the distribution of the terms we can estimate the expected frequency class <inline-graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="futureinternet-04-00238-i027.tif"/> for a randomly chosen term from a text in a given category. The values in <xref ref-type="table" rid="futureinternet-04-00238-t006">Table 6</xref> show this expected frequency that corresponds to the intuitive order of perceived difficulty of the texts in a category.</p>
        <table-wrap id="futureinternet-04-00238-t006" position="anchor">
          <object-id pub-id-type="pii">futureinternet-04-00238-t006_Table 6</object-id>
          <label>Table 6</label>
          <caption>
            <p>Expected frequency classes of analysed texts.</p>
          </caption>
          <table>
            <thead>
              <tr>
                <th align="left" valign="middle">Text category </th>
                <th align="center" valign="middle"><inline-graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="futureinternet-04-00238-i028.tif"/></th>
              </tr>
            </thead>
            <tbody>
              <tr>
                <td align="left" valign="middle">Kids </td>
                <td align="center" valign="middle">10.58</td>
              </tr>
              <tr>
                <td align="left" valign="middle">Novels </td>
                <td align="center" valign="middle">12.79 </td>
              </tr>
              <tr>
                <td align="left" valign="middle">Scientific </td>
                <td align="center" valign="middle">11.85</td>
              </tr>
              <tr>
                <td align="left" valign="middle">News </td>
                <td align="center" valign="middle">10.83 </td>
              </tr>
              <tr>
                <td align="left" valign="middle">Philosophy </td>
                <td align="center" valign="middle">14.28</td>
              </tr>
            </tbody>
          </table>
        </table-wrap>
      </sec>
      <sec>
        <title>5.3. Frequency Classes and Readability</title>
        <p>Given the observations in the last section, the question arises, whether it is possible to automatically estimate the readability of a given text based on its frequency classes. In the previous section we observed the expected frequency class <inline-graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="futureinternet-04-00238-i027.tif"/> to correspond with an intuitive ranking of the readability of text categories. While above we looked at entire collections of texts from the same categories, we now change our focus to individual documents and see if we can observe a similar pattern.</p>
        <p>To gain some first insights we chose four different texts. <xref ref-type="table" rid="futureinternet-04-00238-t007">Table 7</xref> shows the readability of these texts according to SMOG and Gunning fog. Additionally the table shows the expected frequency class. It can be seen, that on this document level the three metrics do not agree. The values do not imply the same ranking of the documents.</p>
        <table-wrap id="futureinternet-04-00238-t007" position="anchor">
          <object-id pub-id-type="pii">futureinternet-04-00238-t007_Table 7</object-id>
          <label>Table 7</label>
          <caption>
            <p>SMOG, Gunning fog and expected frequency class for selected documents.</p>
          </caption>
          <table>
            <thead>
              <tr>
                <th align="left" valign="middle">Document</th>
                <th align="center" valign="middle">SMOG</th>
                <th align="center" valign="middle">Gunning fog</th>
                <th align="center" valign="middle"><inline-graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="futureinternet-04-00238-i028.tif"/></th>
              </tr>
            </thead>
            <tbody>
              <tr>
                <td align="left" valign="middle">Enid Blyton: Famouse Five—Five Are Together Again </td>
                <td align="center" valign="middle">4.01 </td>
                <td align="center" valign="middle">6.34 </td>
                <td align="center" valign="middle">12.86 </td>
              </tr>
              <tr>
                <td align="left" valign="middle">William Shakespeare: Romeo and Juliet </td>
                <td align="center" valign="middle">4.05 </td>
                <td align="center" valign="middle">7.04 </td>
                <td align="center" valign="middle">13.65 </td>
              </tr>
              <tr>
                <td align="left" valign="middle">Bible: Genesis 1–50 </td>
                <td align="center" valign="middle">4.08 </td>
                <td align="center" valign="middle">7.39 </td>
                <td align="center" valign="middle">13.29 </td>
              </tr>
              <tr>
                <td align="left" valign="middle">Arthur Conan Doyle: The Hound of the Baskervilles </td>
                <td align="center" valign="middle">4.27 </td>
                <td align="center" valign="middle">8.62 </td>
                <td align="center" valign="middle">12.95 </td>
              </tr>
            </tbody>
          </table>
        </table-wrap>
        <p>In a second, more extensive experiment we computed FRE and the expected frequency class <inline-graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="futureinternet-04-00238-i027.tif"/> for 60 documents of different readability. The readability of these documents ranged from an FRE index of 28.073 for the most difficult to a value of 88.499 for the easiest document. The value of <inline-graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="futureinternet-04-00238-i027.tif"/>, instead, was in the ranged between 8.904 and 14.745. The values of FRE and <inline-graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="futureinternet-04-00238-i027.tif"/> showed a slight, but no significant Pearson correlation of 0.47. Thus, we can say that while the expected frequency class can give indications about the readability of a document, as an exclusive feature it is not sufficient to provide a precise analysis for a given text.</p>
        <p>This is quite surprising, given that the expected frequency class provided a good insight into the readability of an entire collection of documents belonging to the same category. One possible explanation for this unexpected behaviour on individual documents is data sparsity. A single document might not contain enough different words to extract a sound model of the term distribution over frequency classes. Hence, it might be necessary to apply smoothing methods to the observed distribution in order to get more reliable results. Another approach that can be pursued in parallel is to improve the results by using more information about the frequency class distribution itself for the prediction of the readability. Such additional information could come in the form of estimating the variance or directly using the actual discrete distribution. Beyond feature engineering, it might additionally be necessary to adjust a non-linear function to derive the level of readability from the multi-dimensional input data obtained from the frequency class distribution. However, while we laid the foundation for this analysis, the concrete steps to be taken are left for future work.</p>
      </sec>
    </sec>
    <sec sec-type="conclusions">
      <title>6. Conclusions and Future Work</title>
      <p>In this paper we analysed the relation between readability indices and the World Wide Web. We looked at the topic from two different angles: How to determine readability of documents on the web and on the potential of exploiting web resources to indicate parameters for new approaches to determine readability.</p>
      <p>Concerning the application of readability measures on web documents we showed that the introduction of content extraction filters into the process leads to significantly improved estimates. Further, we developed bias adjustments for CE based SMOG and FRE indices that lead to still better estimates for the readability of web documents. Given that we focused in our analysis on news documents, it remains to investigate how the CE methods operate on other type of documents or documents with different levels of readability.</p>
      <p>On the other hand we found indications that corpus statistics of the web can be exploited to obtain language independent measures for readability. We showed that the distribution of terms into frequency classes reproduces very nicely the intuitively perceived difficulty of text categories. However, when looking at the document level, the latter results require some further investigations. Predicting the readability for a single document simply based on the expected frequency class does not provide results of the desired quality yet. Overcoming data sparsity in single documents and using more characteristics and features of the frequency class distribution seems a promising approach here.</p>
      <p>In future work we will address exactly this task of feature engineering on the frequency class distribution in individual documents. We are confident, that by applying smoothing techniques and identifying a set of suitable features it will be possible to estimate the readability also of individual documents based on the frequency classes of the contained terms. Once such a metric is established, an interesting question will be if the observations can be generalized to other languages, thereby providing a language independent readability metric.</p>
    </sec>
  </body>
  <back>
    <ack>
      <title>Acknowledgements</title>
      <p>The research leading to these results has received partial funding from the European Community’s Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 257859, ROBUST.</p>
    </ack>
    <ref-list>
      <title>References</title>
      <ref id="B1-futureinternet-04-00238">
        <label>1.</label>
        <citation citation-type="confproc">
          <person-group person-group-type="author">
            <name>
              <surname>Kienle</surname>
              <given-names>H.</given-names>
            </name>
            <name>
              <surname>Vasiliu</surname>
              <given-names>C.</given-names>
            </name>
          </person-group>
          <article-title>Evolution of Legal Statements on the Web</article-title>
          <source>Proceedings of the 10th IEEE International Symposium on Web Site Evolution</source>
          <conf-loc>Beijing, China</conf-loc>
          <conf-date>3–4 October 2008</conf-date>
          <fpage>73</fpage>
          <lpage>82</lpage>
        </citation>
      </ref>
      <ref id="B2-futureinternet-04-00238">
        <label>2.</label>
        <citation citation-type="confproc">
          <person-group person-group-type="author">
            <name>
              <surname>Nonaka</surname>
              <given-names>R.</given-names>
            </name>
            <name>
              <surname>Yumoto</surname>
              <given-names>T.</given-names>
            </name>
            <name>
              <surname>Nii</surname>
              <given-names>M.</given-names>
            </name>
            <name>
              <surname>Takahashi</surname>
              <given-names>Y.</given-names>
            </name>
          </person-group>
          <article-title>Finding How-to Information Web Pages and Their Ranking by Readability</article-title>
          <source>Proceedings of the IADIS International Conference Internet Technologies and Society (ITS ’10)</source>
          <conf-loc>Perth, Australia</conf-loc>
          <conf-date>29 November 2010</conf-date>
          <fpage>155</fpage>
          <lpage>163</lpage>
        </citation>
      </ref>
      <ref id="B3-futureinternet-04-00238">
        <label>3.</label>
        <citation citation-type="confproc">
          <person-group person-group-type="author">
            <name>
              <surname>Lau</surname>
              <given-names>T.P.</given-names>
            </name>
            <name>
              <surname>King</surname>
              <given-names>I.</given-names>
            </name>
          </person-group>
          <article-title>Bilingual Web Page and Site Readability Assessment</article-title>
          <source>Proceedings of the 15th international conference on World Wide Web (WWW ’06)</source>
          <publisher-name>ACM</publisher-name>
          <publisher-loc>New York, NY, USA</publisher-loc>
          <conf-loc>Edinburgh, UK</conf-loc>
          <conf-date>22–26 May 2006</conf-date>
          <year>2006</year>
          <fpage>993</fpage>
          <lpage>994</lpage>
        </citation>
      </ref>
      <ref id="B4-futureinternet-04-00238">
        <label>4.</label>
        <citation citation-type="confproc">
          <person-group person-group-type="author">
            <name>
              <surname>Miltsakaki</surname>
              <given-names>E.</given-names>
            </name>
            <name>
              <surname>Troutt</surname>
              <given-names>A.</given-names>
            </name>
          </person-group>
          <article-title>Real-Time Web Text Classification and Analysis of Reading Difficulty</article-title>
          <source>Proceedings of the 3rd Workshop on Innovative Use of NLP for Building Educational Applications (EANL ’08)</source>
          <publisher-name>Association for Computational Linguistics</publisher-name>
          <publisher-loc>Stroudsburg, PA, USA,</publisher-loc>
          <conf-loc>Columbus, OH, USA</conf-loc>
          <conf-date>June 2008</conf-date>
          <year>2008</year>
          <fpage>89</fpage>
          <lpage>97</lpage>
        </citation>
      </ref>
      <ref id="B5-futureinternet-04-00238">
        <label>5.</label>
        <citation citation-type="journal">
          <person-group person-group-type="author">
            <name>
              <surname>Hussain</surname>
              <given-names>W.</given-names>
            </name>
            <name>
              <surname>Sohaib</surname>
              <given-names>O.</given-names>
            </name>
            <name>
              <surname>Ali</surname>
              <given-names>A.</given-names>
            </name>
          </person-group>
          <article-title>Improving web page readability by plain language</article-title>
          <source>IJCSI Int. J. Comput. Sci. Issues</source>
          <year>2011</year>
          <volume>8</volume>
          <fpage>315</fpage>
          <lpage>319</lpage>
        </citation>
      </ref>
      <ref id="B6-futureinternet-04-00238">
        <label>6.</label>
        <citation citation-type="confproc">
          <person-group person-group-type="author">
            <name>
              <surname>Collins-Thompson</surname>
              <given-names>K.</given-names>
            </name>
            <name>
              <surname>Callan</surname>
              <given-names>J.</given-names>
            </name>
          </person-group>
          <article-title>Information Retrieval for Language Tutoring: An Overview of the REAP Project</article-title>
          <source>Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’04)</source>
          <publisher-name>ACM</publisher-name>
          <publisher-loc>New York, NY, USA</publisher-loc>
          <conf-loc>Sheffield, UK</conf-loc>
          <conf-date>25–29 July 2004</conf-date>
          <year>2004</year>
          <fpage>544</fpage>
          <lpage>545</lpage>
        </citation>
      </ref>
      <ref id="B7-futureinternet-04-00238">
        <label>7.</label>
        <citation citation-type="confproc">
          <person-group person-group-type="author">
            <name>
              <surname>Gottron</surname>
              <given-names>T.</given-names>
            </name>
          </person-group>
          <article-title>Detecting Website Redesigns via Template Similarity on Streams of Documents</article-title>
          <source>Proceedings of the 3rd International Conference on Internet Technologies and Applications (ITA ’09)</source>
          <conf-loc>Wuhan, China</conf-loc>
          <conf-date>18–20 August 2009</conf-date>
        </citation>
      </ref>
      <ref id="B8-futureinternet-04-00238">
        <label>8.</label>
        <citation citation-type="journal">
          <person-group person-group-type="author">
            <name>
              <surname>Flesch</surname>
              <given-names>R.</given-names>
            </name>
          </person-group>
          <article-title>A new readability yardstick</article-title>
          <source>J. Appl. Psychol.</source>
          <year>1948</year>
          <volume>32</volume>
          <fpage>221</fpage>
          <lpage>233</lpage>
          <pub-id pub-id-type="doi">10.1037/h0057532</pub-id>
        </citation>
      </ref>
      <ref id="B9-futureinternet-04-00238">
        <label>9.</label>
        <note>
        <p>From the formal definition it becomes obvious that FRE can also produce values out of the intended range, when applied to non standard texts.</p>
        </note>
      </ref>
      <ref id="B10-futureinternet-04-00238">
        <label>10.</label>
        <citation citation-type="journal">
          <person-group person-group-type="author">
            <name>
              <surname>McLaughlin</surname>
              <given-names>G.H.</given-names>
            </name>
          </person-group>
          <article-title>SMOG grading: A new readability formula</article-title>
          <source>J. Read.</source>
          <year>1969</year>
          <volume>12</volume>
          <fpage>639</fpage>
          <lpage>646</lpage>
        </citation>
      </ref>
      <ref id="B11-futureinternet-04-00238">
        <label>11.</label>
        <citation citation-type="book">
          <person-group person-group-type="author">
            <name>
              <surname>Gunning</surname>
              <given-names>R.</given-names>
            </name>
          </person-group>
          <source>The Technique of Clear Writing</source>
          <publisher-name>McGraw-Hill International Book Co.</publisher-name>
          <publisher-loc>New York, NY, USA</publisher-loc>
          <year>1952</year>
        </citation>
      </ref>
      <ref id="B12-futureinternet-04-00238">
        <label>12.</label>
        <citation citation-type="book">
          <person-group person-group-type="author">
            <name>
              <surname>Amstad</surname>
              <given-names>T.</given-names>
            </name>
          </person-group>
          <source>Wie Verständlich Sind Unsere Zeitungen?</source>
          <publisher-name>Dissertation, University Zürich</publisher-name>
          <publisher-loc>Zürich, Switzerland</publisher-loc>
          <year>1978</year>
        </citation>
      </ref>
      <ref id="B13-futureinternet-04-00238">
        <label>13.</label>
        <citation citation-type="journal">
          <person-group person-group-type="author">
            <name>
              <surname>Köhler</surname>
              <given-names>R.</given-names>
            </name>
            <name>
              <surname>Altmann</surname>
              <given-names>G.</given-names>
            </name>
          </person-group>
          <article-title>Synergetische aspekte der linguistik</article-title>
          <source>Z. Sprachwiss.</source>
          <year>1986</year>
          <volume>5</volume>
          <fpage>253</fpage>
          <lpage>265</lpage>
          <pub-id pub-id-type="doi">10.1515/zfsw.1986.5.2.253</pub-id>
        </citation>
      </ref>
      <ref id="B14-futureinternet-04-00238">
        <label>14.</label>
        <citation citation-type="confproc">
          <person-group person-group-type="author">
            <name>
              <surname>Tanguy</surname>
              <given-names>L.</given-names>
            </name>
            <name>
              <surname>Tulechki</surname>
              <given-names>N.</given-names>
            </name>
          </person-group>
          <article-title>Sentence Complexity in French: A Corpus-Based Approach</article-title>
          <source>Proceedings of the 17th International Conference Intelligent Information Systems (IIS 09)</source>
          <conf-loc>Kraków, Poland</conf-loc>
          <conf-date>16–18 July 2009</conf-date>
          <fpage>131</fpage>
          <lpage>144</lpage>
        </citation>
      </ref>
      <ref id="B15-futureinternet-04-00238">
        <label>15.</label>
        <citation citation-type="confproc">
          <person-group person-group-type="author">
            <name>
              <surname>Petersen</surname>
              <given-names>S.E.</given-names>
            </name>
            <name>
              <surname>Ostendorf</surname>
              <given-names>M.</given-names>
            </name>
          </person-group>
          <article-title>Assessing the Reading Level of Web Pages</article-title>
          <source>Proceedings of the ICSLP 9th International Conference on Spoken Language Processing (INTERSPEECH ’06)</source>
          <conf-loc>Pittsburgh, PA, USA</conf-loc>
          <conf-date>17–21 September 2006</conf-date>
          <fpage>833</fpage>
          <lpage>836</lpage>
        </citation>
      </ref>
      <ref id="B16-futureinternet-04-00238">
        <label>16.</label>
        <citation citation-type="confproc">
          <person-group person-group-type="author">
            <name>
              <surname>Miltsakaki</surname>
              <given-names>E.</given-names>
            </name>
          </person-group>
          <article-title>Matching Readers’ Preferences and Reading Skills With Appropriate Web Texts</article-title>
          <source>Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics: Demonstrations Session (EACL ’09)</source>
          <publisher-name>Association for Computational Linguistics</publisher-name>
          <publisher-loc>Stroudsburg, PA, USA</publisher-loc>
          <conf-loc>Athens, Greece</conf-loc>
          <conf-date>30 March–3 April 2009</conf-date>
          <year>2009</year>
          <fpage>49</fpage>
          <lpage>52</lpage>
        </citation>
      </ref>
      <ref id="B17-futureinternet-04-00238">
        <label>17.</label>
        <citation citation-type="confproc">
          <person-group person-group-type="author">
            <name>
              <surname>Gottron</surname>
              <given-names>T.</given-names>
            </name>
          </person-group>
          <article-title>Evaluating Content Extraction on HTML Documents</article-title>
          <source>Proceedings of the 2nd International Conference on Internet Technologies and Applications (ITA ’07)</source>
          <conf-loc>Wrexham, North Wales, UK</conf-loc>
          <conf-date>4–7 September 2007</conf-date>
          <fpage>123</fpage>
          <lpage>132</lpage>
        </citation>
      </ref>
      <ref id="B18-futureinternet-04-00238">
        <label>18.</label>
        <citation citation-type="confproc">
          <person-group person-group-type="author">
            <name>
              <surname>Pinto</surname>
              <given-names>D.</given-names>
            </name>
            <name>
              <surname>Branstein</surname>
              <given-names>M.</given-names>
            </name>
            <name>
              <surname>Coleman</surname>
              <given-names>R.</given-names>
            </name>
            <name>
              <surname>Croft</surname>
              <given-names>W.B.</given-names>
            </name>
            <name>
              <surname>King</surname>
              <given-names>M.</given-names>
            </name>
            <name>
              <surname>Li</surname>
              <given-names>W.</given-names>
            </name>
            <name>
              <surname>Wei</surname>
              <given-names>X.</given-names>
            </name>
          </person-group>
          <article-title>QuASM: A System for Question Answering Using Semi-Structured Data</article-title>
          <source>Proceedings of the 2nd ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL ’02)</source>
          <publisher-name>ACM</publisher-name>
          <publisher-loc>New York, NY, USA</publisher-loc>
          <conf-loc>Portland, OR, USA</conf-loc>
          <conf-date>14–18 July 2002</conf-date>
          <year>2002</year>
          <fpage>46</fpage>
          <lpage>55</lpage>
        </citation>
      </ref>
      <ref id="B19-futureinternet-04-00238">
        <label>19.</label>
        <citation citation-type="confproc">
          <person-group person-group-type="author">
            <name>
              <surname>Gottron</surname>
              <given-names>T.</given-names>
            </name>
          </person-group>
          <article-title>Content Code Blurring: A New Approach to Content Extraction</article-title>
          <source>Proceedings of the 19th International Workshop on Database and Expert Systems Applications (DEXA ’09)</source>
          <conf-loc>Turin, Italy</conf-loc>
          <conf-date>1–5 September 2008</conf-date>
          <fpage>29</fpage>
          <lpage>33</lpage>
        </citation>
      </ref>
      <ref id="B20-futureinternet-04-00238">
        <label>20.</label>
        <citation citation-type="confproc">
          <person-group person-group-type="author">
            <name>
              <surname>Moreno</surname>
              <given-names>J.</given-names>
            </name>
            <name>
              <surname>Deschacht</surname>
              <given-names>K.</given-names>
            </name>
            <name>
              <surname>Moens</surname>
              <given-names>M.</given-names>
            </name>
          </person-group>
          <article-title>Language Independent Content Extraction From Web Pages</article-title>
          <source>Proceeding of the 9th Dutch-Belgian Information Retrieval Workshop</source>
          <conf-loc>Enschede, The Netherlands</conf-loc>
          <conf-date>2–3 February 2009</conf-date>
          <fpage>50</fpage>
          <lpage>55</lpage>
        </citation>
      </ref>
      <ref id="B21-futureinternet-04-00238">
        <label>21.</label>
        <citation citation-type="confproc">
          <person-group person-group-type="author">
            <name>
              <surname>Mohammadzadeh</surname>
              <given-names>H.</given-names>
            </name>
            <name>
              <surname>Gottron</surname>
              <given-names>T.</given-names>
            </name>
            <name>
              <surname>Schweiggert</surname>
              <given-names>F.</given-names>
            </name>
            <name>
              <surname>Nakhaeizadeh</surname>
              <given-names>G.</given-names>
            </name>
          </person-group>
          <article-title>A Fast and Accurate Approach for Main Content Extraction based on Character Encoding</article-title>
          <source>Proccedings of the 8th Workshop on Text-based Information Retrieval (TIR ’11)</source>
          <conf-loc>Toulouse, France</conf-loc>
          <conf-date>29 August–2 September 2011</conf-date>
          <comment>Unpublished work.</comment>
        </citation>
      </ref>
      <ref id="B22-futureinternet-04-00238">
        <label>22.</label>
        <citation citation-type="confproc">
          <person-group person-group-type="author">
            <name>
              <surname>Gottron</surname>
              <given-names>T.</given-names>
            </name>
            <name>
              <surname>Martin</surname>
              <given-names>L.</given-names>
            </name>
          </person-group>
          <article-title>Estimating Web Site Readability Using Content Extraction</article-title>
          <source>Proceedings of the 18th International World Wide Web Conference (WWW ’09)</source>
          <conf-loc>Madrid, Spain</conf-loc>
          <conf-date>20–24 April 2009</conf-date>
          <fpage>1169</fpage>
          <lpage>1170</lpage>
        </citation>
      </ref>
      <ref id="B23-futureinternet-04-00238">
        <label>23.</label>
        <citation citation-type="confproc">
          <person-group person-group-type="author">
            <name>
              <surname>Yan</surname>
              <given-names>X.</given-names>
            </name>
            <name>
              <surname>Song</surname>
              <given-names>D.</given-names>
            </name>
            <name>
              <surname>Li</surname>
              <given-names>X.</given-names>
            </name>
          </person-group>
          <article-title>Concept-Based Document Readability in Domain Specific Information Retrieval</article-title>
          <source>Proceedings of the 15th ACM International Conference on Information and Knowledge Management</source>
          <publisher-name>ACM</publisher-name>
          <publisher-loc>New York, NY, USA</publisher-loc>
          <conf-loc>Arlington, VA, USA,</conf-loc>
          <conf-date>6–11 November 2006</conf-date>
          <year>2006</year>
          <fpage>540</fpage>
          <lpage>549</lpage>
        </citation>
      </ref>
      <ref id="B24-futureinternet-04-00238">
        <label>24.</label>
        <citation citation-type="confproc">
          <person-group person-group-type="author">
            <name>
              <surname>Rosa</surname>
              <given-names>K.D.</given-names>
            </name>
            <name>
              <surname>Eskenazi</surname>
              <given-names>M.</given-names>
            </name>
          </person-group>
          <article-title>Effect of Word Complexity on L2 Vocabulary Learning</article-title>
          <source>Proceedings of the 6th Workshop on Innovative Use of NLP for Building Educational Applications (IUNLPBEA ’11)</source>
          <publisher-name>Association for Computational Linguistics</publisher-name>
          <publisher-loc>Stroudsburg, PA, USA</publisher-loc>
          <conf-loc>Portland, OR, USA</conf-loc>
          <conf-date>24 June 2004</conf-date>
          <year>2011</year>
          <fpage>76</fpage>
          <lpage>80</lpage>
        </citation>
      </ref>
      <ref id="B25-futureinternet-04-00238">
        <label>25.</label>
        <citation citation-type="confproc">
          <person-group person-group-type="author">
            <name>
              <surname>François</surname>
              <given-names>T.</given-names>
            </name>
            <name>
              <surname>Watrin</surname>
              <given-names>P.</given-names>
            </name>
          </person-group>
          <article-title>On the Contribution of MWE-based Features to a Readability Formula for French as a Foreign Language</article-title>
          <source>Proceedings of the International Conference Recent Advances in Natural Language Processing 2011 (RANLP ’11)</source>
          <conf-loc>Hissar, Bulgaria</conf-loc>
          <conf-date>12–14 September 2011</conf-date>
          <fpage>441</fpage>
          <lpage>447</lpage>
        </citation>
      </ref>
      <ref id="B26-futureinternet-04-00238">
        <label>26.</label>
        <citation citation-type="confproc">
          <person-group person-group-type="author">
            <name>
              <surname>Weir</surname>
              <given-names>G.R.S.</given-names>
            </name>
            <name>
              <surname>Ritchie</surname>
              <given-names>C.</given-names>
            </name>
          </person-group>
          <article-title>Estimating Readability with the Strathclyde Readability Measure</article-title>
          <source>Proceedings of the ICT in the Analysis, Teaching and Learning of Languages (ICTATLL’06)</source>
          <conf-loc>Glasgow, UK</conf-loc>
          <conf-date>21–22 August 2006</conf-date>
        </citation>
      </ref>
      <ref id="B27-futureinternet-04-00238">
        <label>27.</label>
        <citation citation-type="confproc">
          <person-group person-group-type="author">
            <name>
              <surname>Quasthoff</surname>
              <given-names>U.</given-names>
            </name>
            <name>
              <surname>Richter</surname>
              <given-names>M.</given-names>
            </name>
            <name>
              <surname>Biemann</surname>
              <given-names>C.</given-names>
            </name>
          </person-group>
          <article-title>Corpus Portal for Search in Monolingual Corpora</article-title>
          <source>Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC ’06)</source>
          <conf-loc>Genoa, Italy</conf-loc>
          <conf-date>24–26 May 2006</conf-date>
        </citation>
      </ref>
      <ref id="B28-futureinternet-04-00238">
        <label>28.</label>
        <citation citation-type="web">
          <article-title>Project Gutenberg</article-title>
          <access-date>(accessed on 28 January 2011)</access-date>
          <comment>Available online:<ext-link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://www.gutenberg.org/" ext-link-type="uri">http://www.gutenberg.org/</ext-link></comment>
        </citation>
      </ref>
    </ref-list>
  </back>
</article>
