A Semantic Focused Web Crawler Based on a Knowledge Representation Schema

: The Web has become the main source of information in the digital world, expanding to heterogeneous domains and continuously growing. By means of a search engine, users can systematically search over the web for particular information based on a text query, on the basis of a domain-unaware web search tool that maintains real-time information. One type of web search tool is the semantic focused web crawler (SFWC); it exploits the semantics of the Web based on some ontology heuristics to determine which web pages belong to the domain deﬁned by the query. An SFWC is highly dependent on the ontological resource, which is created by domain human experts. This work presents a novel SFWC based on a generic knowledge representation schema to model the crawler’s domain, thus reducing the complexity and cost of constructing a more formal representation as the case when using ontologies. Furthermore, a similarity measure based on the combination of the inverse document frequency (IDF) metric, standard deviation, and the arithmetic mean is proposed for the SFWC. This measure ﬁlters web page contents in accordance with the domain of interest during the crawling task. A set of experiments were run over the domains of computer science, politics, and diabetes to validate and evaluate the proposed novel crawler. The quantitative (harvest ratio) and qualitative (Fleiss’ kappa) evaluations demonstrate the suitability of the proposed SFWC to crawl the Web using a knowledge representation schema instead of a domain ontology.


Introduction
According to the website Live Stats [1], there are more than one billion of active websites on the World Wide Web (WWW). As a result, the increasing necessity of faster and reliable tools to effectively search and retrieve web pages from a particular domain has been gaining importance. One of the most popular tools to systematically collect web pages from the WWW are web crawlers. A web crawler is a system based on Uniform Resource Locator (URL) indexing to traverse the Web. URLs indexing provides a better service to web search engines and similar applications to retrieve resources from the web [2]. The web crawler searches for any URL reachable from the web page being retrieved by the search engine. Each URL found by the crawler is placed in a search queue to later be accessed by the search engine. The process repeats for each new URL retrieved from the queue. The stop criterion for URL searching varies; the most common is until reaching a threshold in the number of URLs retrieved from a seed or when reaching a level of depth.
The architecture of a web crawler is composed of three main components: (i) a URL frontier, (ii) the page downloader, and (iii) a repository. The Frontier stores the URLs that the web crawler has to visit. The page downloader retrieves and parses the web pages from the URLs in the frontier. Finally, the downloaded web pages are stored in the repository component [3].
From the huge amount of resources in the web, most of them could be irrelevant to the domain of interest. This is why focused web crawlers (FWC) are better preferred to retrieve web pages. An FWC is based on techniques such as machine learning (classification) to identify relevant web pages, adding them to a local database [4]. An FWC (Figure 1) adds to the traditional crawler architecture a topic classifier module. This module is featured-based, modeling an input target domain to classify relevant web pages. If the web page is positively classified, its URLs are extracted and queued in the frontier module. In some FWC approaches [5][6][7][8], the classification module is based on document similarity metrics to filter related and non-related web pages to a given domain. However, these approaches do not take into account the expressiveness of web pages content, that is, they do not explore their semantic content or use that information in the filtering process. An FWC retrieves a set of topic-related web pages from a set of seed URLs. A seed URL is the starting point to iteratively extract URLs. That is, an FWC analyzes the content of seed URLs to determine the relevance of their content for a target domain. Such content analysis is based on techniques like ontology-based, machine learning, query expansion, among others [9]. Some approaches require an initial dataset to create a model (machine learning approaches [10]) or a set of keywords to produced specific domain queries (query expansion [11]).
The Semantic Web (SW), considered as an extension of today's Web, is based on a resource description framework (RDF) to express information in a well-defined meaning [12]. The SW arranges data as a logically linked data set instead of a traditional hyperlinked Web. An FWC that exploits the semantics of the Web content and uses some ontology heuristics is called Semantic Focused Web Crawler (SFWC). An ontology is a specification of a conceptualization, describing the concepts and relationships that can exist between domain's elements [13]. An SFWC determines the relevance of a web page to a user's query based on domain knowledge related to the search topic [14].
An SFWC performs two main tasks [15]: (i) content analysis and (ii) URL traversing (crawling). The content analysis task consists of determining if a web page is relevant or not for the topic given by the user's query. Algorithms such as PageRank [16], SiteRank [17], Visual-based page segmentation (VIPS) [18], and densometric segmentation [19] are well known web page content analyzers. The URL traversing task has as objective to define the order in which URLs are analyzed. Techniques like bread-first, depth-first, and best-first are representative traversing strategies for this task [15].
In an SFWC, an ontology is commonly used to determine if a web page is related to the domain, comparing its text content with the ontology structure through similarity measures such as the cosine similarity [20] or the semantic relevance [12]. The use of a domain specific ontology helps to face problems like heterogeneity, ubiquity, and ambiguity [21] since a domain ontology defines classes and their relationships, limiting its scope to predefined elements.
The main limitation of any SFWC is its dependency to the domain ontology being used, with particularly two main issues [22]: (i) an ontology is designed by domain experts, limiting their representation to the experts' understanding on the domain and (ii) data are dynamic and constantly evolving.
As an alternative to classic SFWC designs that use ontologies, this work presents a novel SFWC based on a generic knowledge representation schema (KRS) to model a target domain. The KRS analyzes the content of a document to identify and extract concepts, i.e., it maps the content of a document, from an input corpus, to an SW representation. The KRS, generated from each document, is stored in a knowledge base (KB) [23] to provide access to their content. The KRS is less expressive than a domain ontology (it does not define any rule or restriction over the data), but it is domain independent. Ontology-based approaches are structures whose concepts and relations are predefined by domain experts. Additionally, a similarity measure is proposed based on the inverse document frequency (IDF) measure and statistical measures such as arithmetic mean and standard deviation to compute the similarity between a web page content against the KRS. Our proposed SFWC is simple to build without the complexity and cost of constructing a more formal knowledge representation such as a domain ontology, but keeps the advantage of using SW technologies like RDFS (Resource Description Framework Schema).
In summary, the main contributions of this work are: A similarity measure based on IDF and the statistical measures of arithmetic mean and standard deviation to determine the relevance of a web page for a given topic.
The proposed KRS builds a KB from an input corpus without an expert intervention, i.e., the KRS is based on content, representing entities as the most important element in a domain.
The rest of this paper is organized as follows. Section 2 discusses the related work. Section 3 presents the methodology for the construction of the SFWC and the similarity measure. Section 4 presents the results from the experiments. Finally, Section 5 concludes this work.

Related Work
This section presents relevant SFWC approaches proposed in the literature and the most recurrent metrics to measure the web page similarity in a given domain ontology.
SFWC approaches [22,24,25] exploit the expressiveness of an ontology to compute the similarity of a web page content against a domain ontology. Table 1 summarizes some ontology-based SFWC targeting different tasks, describing the measure used to determine the web page relevance. As it is shown, the cosine similarity is the most common measure used to determine the relevance of a web page against an ontology content. SFWCs could be applied to different domains such as recommendation systems [25][26][27] or cybercrime [28], as Table 1 shows. In all cases, a specific domain ontology must define the most relevant elements and their relationship in the given domain. These approaches leave aside the semantic analysis of the source content, which could be exploited to better discrimination of web resources related to the domain. The proposed SFWC tries to alleviate the aforementioned situation, providing a semantic analysis to represent the relationship between content (words) and source (documents) through the KRS. The proposed KRS defines a set of classes with certain properties based on the SW standard RDFS. The KRS is a lightweight version of an ontology since it does not define complex elements like axioms or formal constrains but it is also based on SW technologies. The KRS depends on the input corpus to model a topic, i.e., the content information of the corpus is used to generate the KRS. The schema provides an incremental feature, i.e., the KRS could be expanded with more specific domain documents since entities are independent between them but related by the source, e.g., all words from the same document are linked together.

Task
Description Measure Cloud service recommendation system [25,26] A concept ontology-based recommendation system for retrieving cloud services.
Semantic relevance Website models [29] An ontology-supported website model to improve search engine results.

Cosine similarity
Web directory construction [20] Based on a handmade ontology from WordNet to automatically construct a web directory. Cosine similarity User-based recommendation system [27] A knowledge representation model built from a user interest database to select seeds URLs.

Concept similarity
Concept labeling [5,22] An ontology-based classification model to label new concepts during the crawling process, integrating new concepts and relations to the ontology.
Cosine similarity, semantic relevance Cybercrime [28] Enhanced crime ontology using ant-miner focused crawler. Significance Traditional SFWCs are based on metrics like semantic relevance or cosine similarity to determine the relevance of a web page to a given domain. This kind of metric is used to measure the distance between two elements in an ontology. TF-IDF is a metric that has been used by different approaches to characterize a corpus and built a classification model [30][31][32]. Wang et al. [30] present a Naive Bayes classifier based on TF-IDF to extract the features of a web page content. Pesaranghader et al. [31] propose a new measure called Term Frequency-Information Content as an improvement of TF-IDF to crawl multi-term topics. A multi-term topic is a compound set of keywords that could not be eliminated to kept the meaning of the whole topic, e.g., web services. Peng et al. [32] present a partition algorithm to segment a web page into content blocks. TF-IDF was used also to measure the relevance of content blocks and to build a vector space-model [33] to retrieve topic and genre-related web pages. Kumar and Vig [34] proposed a Term-Frequency Inverse-Document Frequency Definition Semantic (TIDS). TIDS is a table of words associated with the sum of all TF-IDF values in the corpus. Hao et al. [35] proposed the combination of TF-IDF with LSI (Latent Semantic Indexing) to improve crawling results.
TF-IDF has been also used as a feature space to built structures in tasks like machine learning, multi-term topic, content block analysis, and table indexing to create complex models to determine the similarity between a document and a target domain. For example, a classification model based on TF-IDF requires a test and training set to generate a model, and the addition of more documents could lead to generate a new model based on a new test and training set and to compute TF-IDF values again. If a new document is added to the corpus, an FWC based on TF-IDF needs to compute again this value over all document's words in the corpus. In this work, we proposed the use of IDF as similarity measure since it provides the importance of a word in the corpus. The computation of IDF is faster in comparison with TF-IDF since it only needs to be computed for the words in the corpus and not for each word in a document in the corpus. The arithmetic mean and standard deviation are used to provided a dynamic threshold to define the similarity between a web page and the target domain.

Methodology
The proposed methodology for the KRS-based SFWC is divided into two general steps: (i) the KRS construction and (ii) the SFWC design. The following subsections explain each step in detail.

The KRS Construction
The SW provides a set of standards to describe data and their metadata. The resource description framework (RDF) is the SW standard to describe structured data. The basic element of RDF is known as triple. A triple is composed of a subject, object, and a predicate to define the relationship between them. A set of related triples are known as RDF graph. The SW also provides additional standards to define more complex structures like ontologies, being RDFS and the Ontology Web Language (OWL) the standards for this purpose. These standards define the rules to build ontologies; however, RDFS is less expressive than OWL since it does not define restrictions or rules over the ontology.
The KRS ( Figure 2) is a general and domain-free structure to describe the entities from a text source. In this work, a corpus is represented as a set of KRS stored in a KB. The goal of the KB is to provide the mechanisms to query the content of the KRS to measure the similarity between a web page content and the KRS. The KRS is based on RDF and RDFS to define topic entities and relationships. It is built considering the well known NIF (The NLP Interchange Format) [36] vocabulary which provides interoperability between language resources and annotations. In the KRS, a word is an instance of sfwc:Entity, representing a word as a phrase (nif:Phrase) or as a single word (nif:Word). Each word is described considering the following elements: (i) lemma word (nif:anchorOf ), (ii) NE tag (sfwc:neTag), (iii) Part of Speech Tagging (PoS Tag) (nif:posTag), (iv) url (itsrdf:taIdentRef ) and (v) IDF (sfwc:idfValue). These elements are used to determine if a new document is related to the target topic.
A document instance (sfwc:Document) is described only by the title of the source document, and it is related to the target topic (sfwc:Topic). The steps followed to populate the KRS are the following ( Figure 3):  Figure 3. The KRS construction steps.
After the KRS is constructed, it is added to the topic classifier step of the SFWC process.

SFWC Design
The proposed SFWC ( Figure 4) was inspired by the basic structure of an SFWC, whose main element is the topic classifier. The topic classifier determines if the content from a web page is related or not to the target topic or domain. Traditional approaches integrate a domain ontology in the topic classifier step. The domain ontology provides a predefined knowledge about the domain or topic. It describes the relationship between domain elements and could define some rules and restrictions. The KRS is an alternative to the use of domain ontologies, providing a simple schema to represent a topic specific content.  The proposed SFWC takes a seed web page as input to start the process and a queue of links to collect the URLs related with the target topic. In general, the proposed SFWC is divided in the following steps:

S0. KRS construction
• S0. KRS construction: It represents a previous step in the SFWC process. • S1. Web page retrieval: It downloads the web page to be locally analyzed. The first and second steps (S1 and S2) are focused on getting the content from a seed web page. One of the core components of the proposed SFWC is the topic classifier, which is constructed in the third step. Like the domain ontology-based approaches, the topic classifier performs a content analysis of web pages to determine their similarity with the KRS. The topic classifier begins with the text preprocessing and the NNE tasks. These tasks have the same purpose as in the KRS construction. The IDF value for each enriched noun is computed from the KRS using SPARQL queries. SPARQL is the SW standard to query RDF triples from a KB or an RDF graph. In this work, a query retrieves the number of documents containing the noun extracted from the web page content. The retrieved results must match the noun anchor text and they must be described by the same URL.
The similarity measure, described in the next section, calculates the arithmetic mean with respect to the extracted IDF from each enriched noun. In this work, it is established that a web page content is similar to the KRS if the arithmetic mean is within a threshold.
The last two steps (S4 and S5) extract the corresponding URLs and store them in a queue links. The process from S1 to S5 is repeated until the queue is empty or the process reaches a predefined number of iterations.

Similarity Measure
The proposed SFWC compares web page's content against the KRS. The goal is to determine if a web page is closely related with a target domain considering the input corpus. The system takes into account the enriched nouns from a source to compare their content against the KRS. Our proposed approach uses the IDF and the statistics measures of arithmetic mean and standard deviation to calculate the similarity between the web page content and the KRS.
Some SFWC proposals [32,34,35] are based on the use of TF-IDF as similarity measure. TF is a statistical measure to define the weight or importance of each word from a document. IDF is a statistical measure to define the importance of each word with respect to the corpus. The combination of these measures define the weight of each word with respect to the document and the corpus.
The main issue with TF-IDF is that a noun can be weighted with different TF-IDF value in accordance with their corresponding document, i.e., a noun from different documents will have different TF-IDF values. To create a unique value per word, a method [34] was proposed to calculate the average of the TF-IDF value for each word; however, this value must be updated if the corpus increases their number. The proposed similarity measure is based on IDF (Equation (1)) since it defines a unique value for each noun with respect to a corpus, and it is easily updated if the number of documents increases: where t is a term (word) in a document, C is a corpus, N is the number of documents in the corpus, and n is the number of documents containing the target word. The IDF metric tends to be high for uncommon words and low for very common words. However, there is no specification about the ideal IDF value to determine the relevance of a word in the corpus. Equations (2)-(4) define respectively the arithmetic mean, standard deviation, and the similarity measure used in this work: where t i is an enriched noun whose URI value is not empty. The arithmetic mean is calculated over enriched nouns whose description is linked to a KB of the SW, e.g., Wikidata, DBpedia, etc.
µ is the arithmetic mean of IDF values in the corpus (µ C ) or in a document (µ doc ), and σ represents the standard deviation calculated from IDF values. In this work, µ and σ define the threshold used to determine the similarity between a web page and the KB. This threshold is calculated as: µ(IDF) ± σ. The similarity measure was inspired in normal distribution where the threshold tries to represent frequent words and uncommon words, that is, we suppose that relevant words are in the range of µ ± σ, i.e., the similarity measure selects the most representative words described in the KRS. The calculated threshold is used as a reference to determine whether the content of a web page is related to the KRS or not.

Implementation and Experiments
This section presents the implementation of the KRS and the SFWC and the evaluation of the proposed SFWC.

KRS Implementation
The proposed method was evaluated over three topics from Wikipedia: (i) computer science, (ii) politics, and (iii) diabetes.
The implementation of the KRS construction is divided into three steps: (i) corpus gathering, (ii) text processing, and (iii) mapping process.
In the corpus gathering step, the documents for each topic from Wikipedia online encyclopedia are collected. However, it could be used any other source rather than Wikipedia pages-for example, a specific set of domain related documents or a specific corpus from repositories such as kaggle (https:// www.kaggle.com/) or the UCI repository (https://archive.ics.uci.edu/ml/index.php). The Wikipedia encyclopedia is an open collaboration project, and it is the general reference work on the World Wide Web [37][38][39]. It tends to have better structured, well-formed, grammatical and meaningful, natural languages' sentences compared to raw web data [40]. Table 2 shows the number of pages extracted for each topic, the depth of the extraction system, and the restriction set. The depth extraction refers to the number of subtopics extracted for each topic and the restriction is the filtering rule to select the Wikipedia pages. After building the corpus for each topic, the next step is to generate the corresponding KRS.
Text Processing Figure 5 shows the KRS generation. Each document is analyzed to extract enriched nouns through NLP and SW technologies. The Stanford core NLP tool splits a document's content into sentences and extracts information like PoS Tags, lemmas, indexes, and NEs. Additionally, each sentence is analyzed with DBpedia spotlight to look for entities linked to the DBpedia KB, retrieving the corresponding URL. The tasks involved in this process are: • Sentence splitting: The content of a document is divided into sentences, applying splitting rules and pattern recognition to identify the end of a sentence. • Lemmatization: The root of each word is identified, e.g., the lemma of the verb producing is produce. According with Figure 5, the first task identifies and extracts enriched nouns from the corpus and store them in the NOSQL DB MongoDB. Then, the relevance of an enriched noun is computed based on the statistical measure IDF.

Mapping Process
The KRS is produced from MongoDB, where enriched nouns are mapped to the KRS as RDF triples and stored in a KB.
The KB provided the basic functionality of querying over RDF triples. It is set up in a SPARQL endpoint to query their content and retrieve the data needed to compute the similarity between a web page content and the KB.
An example of the KRS is shown in Figure 6. The figure shows the Atkins_diet resource of type document (Basal_rate, 15-Anhydroglucitol and Artificial_pancreas are also of type document), associated with the topic of diabetes. The document contains five NEs linked to DBpedia KB (astrup, approach, appetite, analysis, Atkins).

The SFWC Implementation
The implementation of the proposed SFWC is explained in the following paragraphs.

Web Page Retrieval
The first step retrieves a web page from an input seed URL or from a queue of URLs. This module implements two methods to select the set of seed URLs as input for the proposed SFWC: (i) querying a search engine about a topic and (ii) randomly selecting a set of seed URLs from the input corpus. In the first case, the Google search API was used to query and retrieve seed URLs. The API allows for setting up a personalized Google search engine to query. For the experiments, the first five page results from Google were collected (50 URLs). In the second case, the same number of URLs (50 URLs) was randomly selected from the input corpus as in the first case.

HTML Content Extraction
The second step was implemented with the Java library Jsoup. The library contains functions to extract the content of predefined HTML tags, e.g., the < p > tag defines a paragraph. Jsoup is used to retrieve the text enclosed by this tag.

Topic Classifier
The Stanford Core NLP tool was used to analyze the web pages content, defining the PoS Tag, lemma, and entity label. DBpedia Spotlight was used to define the semantic annotation for each noun. The enriched nouns are used to compute the similarity of the web page content against the KRS. In this case, if the web page content is similar to the KRS content, the web page is stored in a repository of related web pages.

URL Extraction
This module implements a breadth-first approach to extract and add URLs to the crawler frontier. Figure 7 illustrates the breadth-first approach (part A) and how they are stored in the queue of URLs (part B).

Crawler Frontier
The crawler frontier was implemented as a queue of URLs, arranged in accordance with the breadth-first algorithm.

Results and Evaluation
The experiments were executed in an iMac with a 3 GHz Intel Core i5 processor (Victoria, Tamps, Mexico), 16 GB of RAM and macOS Mojave as an operating system. The implemented application was developed in Java 8.
The experiments were conducted over three different corpuses, built from the Wikipedia categories of computer science, politics, and diabetes. A KRS was constructed to represent the content of each corpus. The relevance of a web page content in a given topic was computed using a similarity measure based on the statistical measure IDF and a threshold defined by the arithmetic mean and the standard deviation. Table 3 shows the statistics of the three Wikipedia categories. The number of Wikipedia pages retrieved for each category corresponds to the first level of the category hierarchy. For example, the root level of computer science category contains 19 subcategories and 51 Wikipedia pages. For each subcategory, the corresponding Wikipedia pages are extracted, resulting in 1151 documents (second column in Table 3). The third column presents the total number of enriched nouns extracted and the average enriched nouns per document. The fourth column shows the total number of enriched nouns with a URL associated with a KB of the SW and the average value per Wikipedia page. The results from experiments were analyzed qualitatively and quantitatively. The first one is focused on the number of downloaded web pages related to a topic. The second one is focused in the quality of the results from the quantitative experiments.

Qualitative Results
The proposed SFWC was evaluated over two sets of seed URLs from a different source: (i) seed URLs retrieved from the Google search engine and (ii) seed URLs selected from the built corpus (Wikipedia category). Tables 4 and 5 show the results per topic after processing both sets of seed URLs. The first column corresponds to the topic. The second column is associated with the number of seed URLs retrieved from the Google search engine and Wikipedia. In the case of the search engine, it was queried with the topic name, e.g., the query string "computer science" was used to retrieve the web pages related with the topic of computer science. For the case of Wikipedia, the set of seed URLs was randomly selected for each category from the built corpus, e.g., 50 Wikipedia pages were randomly selected from the politics corpus. The last three columns show a summary of the processed seed URLs: (i) crawled, (ii) not crawled, and (iii) not processed Wikipedia pages. The seed URLs crawled column defines the number of seed URLs whose content was similar to the corresponding topic after computing the similarity measure, i.e., the similarity measure result was in the threshold. The seed URLs not crawled column defines the seed URLs whose content was not similar to the corresponding topic, i.e., the similarity measure result was not in the threshold. The last column (seed URLs not processed) defines the number of seed URLs that was not processed because an error occurred, e.g., the seed URL returns the HTTP 400 error code (Bad Request Error). That means that the request sent to the website server was incorrect or corrupted and the server couldn't understand it. The results from Google's seed URLs (Table 4) got the lowest number of seed URLs crawled in comparison with the Wikipedia's seed URLs results (Table 5) in which all topic seed URLs crawled are above 50%. Additionally, the Google's seed URLs were prone to errors, being the most recurrent the HTTP 400 error code (bad URL request). In contrast, Wikipedia's seed URLs were not prone to these kinds of errors. domain names are heterogeneous, and the content could drastically change in format and structure from one URL to another.

Evaluation
The evaluation is based on the Harvest Ratio [31,32,41] (HR) measure shown in Equation (5). According to Samarawickrama and Jayaratne [42], the HR is the primary metric to evaluate a crawler performance. The HR measure the rate at which relevant web pages are acquired and irrelevant web pages are filtered off from the crawling process: where R p corresponds to those web pages accepted by the system and evaluated as correct and T p corresponds to the total accepted web pages downloaded by the SFWC, evaluated as correct or incorrect.

Similarity Measure
The similarity measure is based on the statistical measure IDF computed over the enriched nouns with an URL associated with a KB from the SW. The similarity measure of a web page content against the KRS is calculated as follows: 1.
The arithmetic mean (µ) and standard deviation (σ) for the KB is computed over all enriched nouns whose URL value is not empty.

2.
For every new web page content, enriched nouns are extracted.

3.
The IDF value for the new web page is calculated over all enriched nouns whose URL value is not empty.

4.
If the arithmetic mean of the web page content is between µ ± σ, the web page is accepted. Table 8 defines the threshold range for each topic.
Equation (6) shows the process to compute the IDF value for the enriched noun "algol", where N is the total number of documents in the computer science topic and n is the number of documents containing the enriched noun "algol". The computed IDF value is 6.36 which is added to the IDF values calculated from the remaining enriched nouns from the web page content: To illustrate this process, Listing 1 shows the query used to retrieve the number of documents (?total) containing the word "algol" from the computer science topic. The returned value corresponds to the divisor (n) in the IDF equation. The dividend (N) value is retrieved by the query shown in Listing 2, returning the total number of documents in the KB (the subjects whose type is sfwc:Document).
Listing 1: SPARQL query to retrieve the number of documents containing the word "algol" from the KRS. The evaluation was conducted by four human raters and performed over a stratified random sample of the crawled web pages. This kind of sample was selected to maintain consistency in the results since a human rater evaluates the results from each topic. The first step of the stratified random sample consists of calculating the sample size from the whole data (see Equation (7)): where n is the sample size, N is the size of the corpus, σ is the standard deviation, Z is the confidence value, and is the sample error rate. The second step consists of calculating the sample size for the accepted and rejected web pages (see Equation (8)): where n i corresponds to the sample size of accepted or rejected web pages, N i is the total web pages for accepted or rejected, and N is the size of the corpus. Table 9 shows the sample values for accepted and rejected web pages for Google (G) and Wikipedia (W). The sample, for Google and Wikipedia, was randomly selected.  Tables 10 and 11 show the HR results for each rater and the summary per topic for the seed URLs from Google and Wikipedia, respectively. According to the results from the Tables 10 and 11, the proposed SFWC was consistent with the results for the seed URLs from Google and Wikipedia. These results demonstrate that the KRS and the similarity measure selects the most relevant concepts for each topic. The KRS describes the nouns from the input corpus and the arithmetic mean and the standard deviation establish a threshold to determine which nouns are the most representative for the topic. The similarity measure defines if web page content is related to the given topic or not if the result is between the predefined threshold. The combination of KRS and the similarity measure help to select the most related web pages.
The best results were obtained with the diabetes topic which is a more specific topic than computer science and politics. The average value for the computer science and diabetes topics is closed, whereas, for political topics, there is an important difference for Google and Wikipedia.
The computer science and politics topics contain several subtopics, e.g., the root level of the Wikipedia category of computer science contains 18 subcategories, the category of politics contains 38 subcategories, and the category of diabetes contains 10 subcategories. The corpus for each topic was built only with the first level of the Wikipedia category, leaving aside a significant number of Wikipedia pages, e.g., Table 12 shows the number of Wikipedia pages for the first five levels.
The average results obtained by computer science and politics are promising since it does not contain the whole Wikipedia pages from their corresponding categories. The diabetes category is a more specialized category, containing specific terms of the topic and, as can be seen in Table 12, the number of Wikipedia pages does not exponentially increase level by level. The average results obtained with the diabetes topic are better than those obtained with the remaining categories. In the particular case of the diabetes topic for Google results, the number of seed URLs crawled was 9 and the total number of web pages analyzed was 957, resulting in 265 accepted web pages. These numbers are lower in comparison with the seed URLs crawled from Wikipedia; however, the average percentage is quite similar, even when the number of accepted web pages are too different. The qualitative evaluation was conducted using the Fleiss' kappa measure, shown in Equation (9). The Fleiss' kappa is an extension of Cohen's kappa which is a measure of the agreement between two raters, where agreement due to chance is factored out. This case, the number of raters can be more than two. As for Cohen's kappa, no weighting is used and the categories are considered to be unsorted: wherep defines the actual observed agreement andp e represents chance agreement. The factorp −p e represents the degree of agreement actually achieved above chance and the factor 1 −p e represents the degree of agreement that is attainable above chance. κ takes the value of 1 if the raters are in complete agreement. The results obtained by the human raters in the quantitative evaluation are analyzed with the Fleiss' kappa measure. Table 13 shows the results for each topic and for each seed URLs source (Google and Wikipedia). Table 14 shows the interpretation agreement between raters. According with these values, Wikipedia's seed URLs obtained a substantial agreement between the human raters; meanwhile, Google's seed URLs obtained a moderate agreement (computer science and politics) and substantial agreement for diabetes. The diabetes corpus was consistent in the qualitative evaluation in both cases (Wikipedia and Google).

Discussion
In accordance with the results from Tables 10 and 11, the average HR for each topic is above 70%, that is, the accepted or downloaded web pages are relevant to the corresponding topic. The computer science and politics topics got an average HR under the 80% since both topics are broader than the diabetes topic, i.e., the computer science and politics topics contain several Wikipedia pages as is pointed out in Table 12, e.g., the fourth level for computer science contains 79,845 Wikipedia pages and the diabetes topic contains 357 Wikipedia pages at the same level.
The SFWC relies on the proposed KRS to describe the content of a corpus from any topic. In the evaluation, the corpus size does not determine the quality of the crawling results. The quality was determined by the content of the corpus, and the selection of the most representative enriched nouns for each corpus in the KRS. For example, the diabetes corpus size is 202, and it was the topic with the best results in the quantitative and qualitative analysis. However, the computer science and politics topics could improve the results if the corpus increases their size since the number of Wikipedia pages per level has a significance difference, as it is shown in Table 12.

Conclusions and Future Work
This work presented a novel semantic focused web crawler (SFWC) based on a knowledge representation schema (KRS), as an alternative to traditional SFWCs that use domain ontologies designed by human experts. The KRS has the feature to model any domain, is less complex, less formal, and easier to build than an ontology. The KRS describes the most relevant elements in the domain and can be automatically constructed through a semantic analysis of an input corpus. Even with a relatively low number of input web pages used to construct the corpus, as it was the case with the Wikipedia pages in this work, the average results are promising as the SFWC was able to filter relevant web pages with a score above 70%, endorsed by human raters.
As part of the mechanisms for the SFWC to filter web pages, a new metric was used, by combining the IDF and statistical measures. The achieved results demonstrated the high capacity (above 69%) of the proposed SFWC to filter relevant web page content based on a quantitative and qualitative evaluation, being more effective with specialized topics such as the diabetes topic whose vocabulary terms have a close relation among them and thus the content of web pages associated with that domain.
The quantitative and qualitative results demonstrate that the proposed SFWC reaches a substantial agreement between the human raters, obtaining better results (with a score above 80%) with the diabetes topic, which is more specific than the politics and computer science topics (score above 70%).
As future work, alternative approaches will be explored to select the input web page corpus for the KRS construction, that is, to select the most relevant topic's documents as well as to extend the evaluation to broader topics.