Next Article in Journal
Accurate Analysis of Target Characteristic in Bistatic SAR Images: A Dihedral Corner Reflectors Case
Next Article in Special Issue
An Effective Delay Reduction Approach through a Portion of Nodes with a Larger Duty Cycle for Industrial WSNs
Previous Article in Journal
Comparison of Random Forest, k-Nearest Neighbor, and Support Vector Machine Classifiers for Land Cover Classification Using Sentinel-2 Imagery
Article Menu
Issue 1 (January) cover image

Export Article

Open AccessArticle
Sensors 2018, 18(1), 16; https://doi.org/10.3390/s18010016

WARCProcessor: An Integrative Tool for Building and Management of Web Spam Corpora

1
ESEI: Higher Technical School of Computer Engineering, University of Vigo, 32004 Ourense, Spain
2
CINBIO: Biomedical Research Centre, Campus Universitario Lagoas-Marcosende, 36310 Vigo, Spain
3
CITI: Centre for Research Transference and Innovation, Avda. Galicia 2, Parque Tecnolóxico, San Cibrao das Viñas, 32900 Ourense, Spain
*
Author to whom correspondence should be addressed.
Received: 24 November 2017 / Revised: 16 December 2017 / Accepted: 18 December 2017 / Published: 22 December 2017
(This article belongs to the Special Issue Sensor Networks and Systems to Enable Industry 4.0 Environments)
Full-Text   |   PDF [5335 KB, uploaded 22 December 2017]   |  

Abstract

In this work we present the design and implementation of WARCProcessor, a novel multiplatform integrative tool aimed to build scientific datasets to facilitate experimentation in web spam research. The developed application allows the user to specify multiple criteria that change the way in which new corpora are generated whilst reducing the number of repetitive and error prone tasks related with existing corpus maintenance. For this goal, WARCProcessor supports up to six commonly used data sources for web spam research, being able to store output corpus in standard WARC format together with complementary metadata files. Additionally, the application facilitates the automatic and concurrent download of web sites from Internet, giving the possibility of configuring the deep of the links to be followed as well as the behaviour when redirected URLs appear. WARCProcessor supports both an interactive GUI interface and a command line utility for being executed in background. View Full-Text
Keywords: web spam research; corpus generation and maintenance; multiple data sources; WARC format 1.0 web spam research; corpus generation and maintenance; multiple data sources; WARC format 1.0
Figures

Graphical abstract

This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. (CC BY 4.0).
SciFeed

Share & Cite This Article

MDPI and ACS Style

Callón, M.; Fdez-Glez, J.; Ruano-Ordás, D.; Laza, R.; Pavón, R.; Fdez-Riverola, F.; Méndez, J.R. WARCProcessor: An Integrative Tool for Building and Management of Web Spam Corpora. Sensors 2018, 18, 16.

Show more citation formats Show less citations formats

Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Related Articles

Article Metrics

Article Access Statistics

1

Comments

[Return to top]
Sensors EISSN 1424-8220 Published by MDPI AG, Basel, Switzerland RSS E-Mail Table of Contents Alert
Back to Top