Next Article in Journal
Neogeography and Preparedness for Real-to-Virtual World Knowledge Transfer: Conceptual Steps to Minecraft Malta
Next Article in Special Issue
The ARCOMEM Architecture for Social- and Semantic-Driven Web Archiving
Previous Article in Journal
Assessing Social Value in Open Data Initiatives: A Framework
Previous Article in Special Issue
Should I Care about Your Opinion? Detection of Opinion Interestingness and Dynamics in Social Media
Future Internet 2014, 6(3), 518-541; doi:10.3390/fi6030518
Article

ARCOMEM Crawling Architecture

1,* , 2
, 3
, 2
, 4
, 3
, 4
 and 1
1 Institute for the Management of Information Systems, Athena Research and Innovation Center, Artemidos 6 & Epidavrou, Maroussi 15125, Greece 2 Internet Memory Foundation, 45 ter rue de la Révolution, 93100 Montreuil, France 3 CNRS LTCI, Institut Mines-Télécom, Télécom ParisTech, 46 rue Barrault, 75634 Paris Cedex 13, France 4 Research Center, University of Hannover, Appelstr. 9a, 30167 Hannover, Germany
* Author to whom correspondence should be addressed.
Received: 15 April 2014 / Revised: 11 July 2014 / Accepted: 14 July 2014 / Published: 19 August 2014
(This article belongs to the Special Issue Archiving Community Memories)
Download PDF [840 KB, uploaded 19 August 2014]

Abstract

The World Wide Web is the largest information repository available today. However, this information is very volatile and Web archiving is essential to preserve it for the future. Existing approaches to Web archiving are based on simple definitions of the scope of Web pages to crawl and are limited to basic interactions with Web servers. The aim of the ARCOMEM project is to overcome these limitations and to provide flexible, adaptive and intelligent content acquisition, relying on social media to create topical Web archives. In this article, we focus on ARCOMEM’s crawling architecture. We introduce the overall architecture and we describe its modules, such as the online analysis module, which computes a priority for the Web pages to be crawled, and the Application-Aware Helper which takes into account the type of Web sites and applications to extract structure from crawled content. We also describe a large-scale distributed crawler that has been developed, as well as the modifications we have implemented to adapt Heritrix, an open source crawler, to the needs of the project. Our experimental results from real crawls show that ARCOMEM’s crawling architecture is effective in acquiring focused information about a topic and leveraging the information from social media.
Keywords: web archiving; crawling architecture; content acquisition web archiving; crawling architecture; content acquisition
This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Share & Cite This Article

Export to BibTeX |
EndNote


MDPI and ACS Style

Plachouras, V.; Carpentier, F.; Faheem, M.; Masanès, J.; Risse, T.; Senellart, P.; Siehndel, P.; Stavrakas, Y. ARCOMEM Crawling Architecture. Future Internet 2014, 6, 518-541.

View more citation formats

Article Metrics

Comments

Citing Articles

[Return to top]
Future Internet EISSN 1999-5903 Published by MDPI AG, Basel, Switzerland RSS E-Mail Table of Contents Alert