Design of an Enhanced Web Archiving System for Preserving Content Integrity with Blockchain
Abstract
:1. Introduction
2. Related Research
2.1. OAIS
2.2. Web Archive
2.3. Blockchain
2.4. Public Blockchain vs. Private Blockchain
3. Design of BCLinked Web Archiving System
3.1. Web Archiving Method Based on Blockchain
3.2. Extension of WARC Record Format
3.3. Process of BCLinked Web Archiving System
3.4. The System Architecture
- The crawling processor, it is for web content crawling;
- The WARC handler, it is for adding the extended WARC fields and managing a WARC file storing in the repository;
- The blockchain requester, it is for communicating with blockchain manager to request to add the new web content information into the blockchain;
- In the blockchain manager, the blockchain requester handler is the interface to communicate with the archive manager;
- The blockchain node handler, it is for the blockchain data processing.
4. Experiment and Analysis
4.1. Experiment Environment
4.2. Experiment Results
5. Conclusions
Author Contributions
Funding
Conflicts of Interest
References
- Wikipedia. Web Archiving. Available online: https://en.wikipedia.org/wiki/Web_archiving (accessed on 14 August 2019).
- Teraoka, T. Organization and exploration of heterogeneous personal data collected in daily life. Hum. Cent. Comput. Inf. Sci. 2012, 2, 1–15. [Google Scholar] [CrossRef] [Green Version]
- Consultative Committee for Space Data System. Reference Model for an Open Archival Information System (OAIS). Available online: https://public.ccsds.org/pubs/650x0m2.pdf (accessed on 14 August 2019).
- Park, B.J.; Cha, S.J.; Lee, K.C. Data Mapping between Korea Deep Web Archiving Format and Reference Model for OAIS. In Proceedings of the Korean Information Science Society Conference; Korean Institute of Information Scientists and Engineers: Seoul, Korea, 2010; pp. 197–200. [Google Scholar]
- Deloitte. Breaking Blockchain Open—2018 Global Blockchain Survey. Available online: https://www2.deloitte.com/content/dam/Deloitte/us/Documents/financial-services/us-fsi-2018-global-blockchain-survey-report.pdf (accessed on 20 August 2019).
- Jinfang, N. An Overview of Web Archiving. D Lib Mag. 2012, 18. Available online: https://scholarcommons.usf.edu/si_facpub/308/ (accessed on 20 August 2019).
- Adobe. Adobe PDF Reference. Available online: https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdf_reference_archive/pdf_reference_1-7.pdf (accessed on 21 January 2020).
- International Internet Preservation Consortium. The WARC Format 1-1. Available online: https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1 (accessed on 1 September 2019).
- National Library of Korea. The Website Building Guide for the OASIS Web Archiving System. Available online: http://www.oasis.go.kr/about/guide.do (accessed on 18 August 2019).
- Lemieux, V.L. Trusting records: Is Blockchain technology the answer? Rec. Manag. J. 2016, 26, 110–139. [Google Scholar] [CrossRef]
- Kim, H.W.; Jeong, Y.S. Secure authentication-management human-centric scheme for trusting personal resource information on mobile cloud computing with blockchain. Hum. Cent. Comput. Inf. Sci. 2018, 8, 11. [Google Scholar] [CrossRef] [Green Version]
- Huh, J.H.; Seo, K. Blockchain-based mobile fingerprint verification and automatic log-in platform for future computing. J. Supercomput. 2019, 75, 3123–3139. [Google Scholar] [CrossRef]
- Korea Internet & Security Agency. FAQ Regarding a Digital Information Delivery and a Delivery Certificate. Available online: https://www.npost.kr/pages/notice/notice_0106.jsp (accessed on 20 January 2020).
- Nguyen, G.T.; Kim, K. A Survey about Consensus Algorithms Used in Blockchain. J. Inf. Process. Syst. 2018, 14, 101–128. [Google Scholar]
- Cachin, C. Architecture of the Hyperledger Blockchain Fabric. Workshop on Distributed Cryptocurrencies and Consensus Ledgers. 2016. Available online: https://www.zurich.ibm.com/dccl/papers/cachin_dccl.pdf (accessed on 20 August 2019).
- Jeong, W.Y.; Choi, M. Design of Recruitment Management Platform Using Digital Certificate on Blockchain. J. Inf. Process. Syst. 2019, 15, 707–716. [Google Scholar]
- Opensource Blockchain Technology. Hyperledger. Available online: https://www.hyperledger.org (accessed on 20 January 2020).
- GNU. Wget. Available online: https://www.gnu.org/software/wget/manual (accessed on 20 January 2020).
Public Blockchain | Private Blockchain |
---|---|
Open, anyone can join the network | Restricted and permissioned |
Low speed of transaction accomplishment | Fast speed of transaction accomplishment |
Long transaction approval frequency | Short transaction approval frequency |
High cost of each transaction | Comparatively cheap cost of each transaction |
Anonymous | Known users |
Large energy consumption | Low energy consumption |
Field | Value | |
---|---|---|
Domain blockchain | WARC-domain | Web site domain name for archive |
WARC-domain-creationtime | Creation time of the node | |
Webcontent blockchain | WARC-filename | WARC filename |
WARC-location | WARC file location | |
WARC-record-ID | block record UUID | |
WARC-block-digest | block digest-algorithm “:” digest-value | |
WARC-payload-digest | Payload digest-algorithm “:” digest-value |
Field | Value |
---|---|
WARC-Added-Chain | True/False |
WARC-domain-block | block hash of the domain blockchain |
WARC-webcontent-block | block hash of the webcontent blockchain |
API Name | Description | |
---|---|---|
The archive manager | /crawl | collect new web content :param target_url: target web page URL :return: the object JOD ID |
/archive | archive new web content to the blockchain :param jobid: the object JOD ID :return: WARC-domain-block WARC-webcontent-block | |
/update | update the WARC :param jobid: the object JOD ID domain_block: WARC-domain-block wcontent_block: WARC-webcontent-block :return: Boolean | |
The blockchain manager | /blocks/add | add new block :param location: the WARC file location target_url: target web page URL digest: UUID and digest values dictionary :return: block hash information dictionary |
/blocks/list | list blocks :return: blocks list |
Hardware | Intel i7 CPU/16 Gbyte RAM |
---|---|
OS | Microsoft Windows 10 Professional |
Linux emulation | Linux Cygwin emulation |
Web crawler | wget |
Development language | Python |
Database | MySQL |
System | BCLinked Web Archiving | Traditional Web Archiving | |
---|---|---|---|
Number of archiving | 810 (81 sites × 100 times) | 810 (81 sites × 100 times) | |
Difficulty 0000 | Total processing time | 3575.08 s | 3493.12 s |
Blockchain processing time | 78.67 s | 0 s | |
Total WARC files size | 120,459,510 bytes | 120,375,023 bytes | |
Blockchain data size | 1,142,594 bytes | 0 bytes | |
Difficulty 00000 | Total processing time | 5049.47 s | 3380.76 s |
Blockchain processing time | 1011.34 s | 0 s | |
Total WARC files size | 121,408,987 bytes | 120,946,193 bytes | |
Blockchain data size | 1,143,229 bytes | 0 bytes |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Hwang, H.C.; Shon, J.G.; Park, J.S. Design of an Enhanced Web Archiving System for Preserving Content Integrity with Blockchain. Electronics 2020, 9, 1255. https://doi.org/10.3390/electronics9081255
Hwang HC, Shon JG, Park JS. Design of an Enhanced Web Archiving System for Preserving Content Integrity with Blockchain. Electronics. 2020; 9(8):1255. https://doi.org/10.3390/electronics9081255
Chicago/Turabian StyleHwang, Hyun Cheon, Jin Gon Shon, and Ji Su Park. 2020. "Design of an Enhanced Web Archiving System for Preserving Content Integrity with Blockchain" Electronics 9, no. 8: 1255. https://doi.org/10.3390/electronics9081255
APA StyleHwang, H. C., Shon, J. G., & Park, J. S. (2020). Design of an Enhanced Web Archiving System for Preserving Content Integrity with Blockchain. Electronics, 9(8), 1255. https://doi.org/10.3390/electronics9081255