Design of an Enhanced Web Archiving System for Preserving Content Integrity with Blockchain

: A Web archive system is a traditional subject for preserving web content for the future and the importance is getting more signiﬁcant due to the explosive growth of web content. The reference model for an open archival information system (OAIS) has been advising guidance for a long-term archiving system and most organizations that archive web content follow this guidance. In addition, the web archive (WARC) ISO standard is for web content archiving. However, there is no way to secure content integrity, and it is hard to identify the original. Because of limitations, a web archive system has a weakness against the dispute of content integrity. In this paper, we proposed the blockchain linked (BCLinked) web archiving system, which uses blockchain technology and an extended WARC ﬁeld to keep a web content integrity metadata into a blockchain. Furthermore, we designed the BCLinked web archiving system, and we conﬁrmed the proposed system secures content integrity through the experiment. case is that trying to modify the web content which is in the beginning position in the blockchain data. All linked blockchain data and extended deﬁned-ﬁelds should be revised in this case, and T will be the total number of all blockchain data. The time complexity for S a 2 is O (cid:16) n 2 (cid:17) while the time complexity for S a 1 is O ( n ) . It shows that unauthorized content modiﬁcation is almost impossible due to massive overhead processing time in the real world with the BCLinked archiving system. we BCLinked web system


Introduction
The purpose of the web archive is to collect web content and preserve it for the long term to deliver valuable data to descendants [1]. Many personal data exist in various digital spaces such as web, email. Moreover, there are multiple types of research regarding collecting this personal digital data [2]. Nowadays, most web contents are dynamic and personalized web content rather than it used to be static web content because the web is the primary channel to deliver personalized information to people. The volume of content from the digital channel, including the web, is getting bigger, and the requirement of standardization for digital resources archiving was needed. Consultative committee for space data systems (CCSDS) has developed a reference model for an open archival information system (OAIS), which is ISO 14,721 for long-term preservation of digital records, and many web archive systems follow the standard [3]. However, the OAIS reference model does not provide an implementation method but prescribes a requirement to ensure OAIS-compliant [4]. There is the weakness of content integrity against unauthorized manner, and there is no mechanism to guarantee content integrity for the long-term, even though the OAIS reference model tells the long-term preservation of digital records. A blockchain is an emerging technology that connects a previous dataset called a block by cryptography key and decentralized system which anyone can share the ledger to review all datasets. A blockchain is getting more popular in an enterprise because it provides an enhanced trusted system with these characteristics [5]. In this paper, we proposed the extended specification of web archive (WARC) to integrate with blockchain technology and designed the enhanced web archiving system based on blockchain technology to provide better web content integrity. Finally, we implemented the

OAIS
OAIS reference model was developed by CCSDS in 2002 [3] as technical guidance for the long-term archiving and sharing the archiving information. This reference model is ISO 14,721 standard, and many organizations that do archive content follow this guidance model for their archive system infrastructure. It provides (a) the concept of archiving management for digital assets' long-term archiving and access; (b) the guidance for effective archiving management in a typical organization (c) the conceptual architecture for digital archiving (d) the way to compare each other different archiving strategy (e) the way to compare the data models (f) the way to identify the key factors for digital archiving. The environment model of an OAIS consists of producer, consumer and management [3]. A producer has the role which provides the information to be preserved, management has the responsibility to control of the OAIS and consumer's role is to interact with OAIS service to query and retrieve preserved information. A producer collects the data to be preserved and deliver to the OAIS as the submission information package (SIP) and OAIS manages the data as the archival information package (AIP). A consumer sends a query and gets query responses to retrieve interesting information from OAIS. There are three services guidance; common services, network services and security services in the detailed description of functional entities of OAIS. The security services advise data integrity service should secure the data which should not be altered or destroyed by an unauthorized manner, and data should be in permanent data storage [3]. Even though OAIS advises this security guidance, the current existing web archiving solution has not robust standard security specifications to secure the content integrity in terms of implementation and operation.

Web Archive
The purpose of the web archive is the long-term preservation of current information from the web for descendants [1]. The web archive is one of the essential topics due to we have a responsibility to archive the existing web content for descendants. However, the challenge for the web archive is that the amount of web data are growing significantly and the web content is getting more important because the web is the primary communication channel to deliver sensitive personal information such as a legal contract. The web archive workflow can be described as four steps of the web archiving as shown in Figure 1 [6]. The appraisal and selection step evaluates the value of records and decides whether the records should be preserved and the preserving period. The acquisition step does archive the records by using several web technologies such as a web crawler. The organization and storage step decide which kinds of resources are organized and which storage will be used. Moreover, the description and metadata step generate metadata and description from the preserved the records for postprocess after archiving such as content searching.
Electronics 2020, 9, x FOR PEER REVIEW 2 of 15 integrity. Finally, we implemented the proposed enhanced web archiving system and confirmed the enhanced archiving system provides enhanced web content integrity.

OAIS
OAIS reference model was developed by CCSDS in 2002 [3] as technical guidance for the longterm archiving and sharing the archiving information. This reference model is ISO 14,721 standard, and many organizations that do archive content follow this guidance model for their archive system infrastructure. It provides (a) the concept of archiving management for digital assets' long-term archiving and access; (b) the guidance for effective archiving management in a typical organization (c) the conceptual architecture for digital archiving (d) the way to compare each other different archiving strategy (e) the way to compare the data models (f) the way to identify the key factors for digital archiving. The environment model of an OAIS consists of producer, consumer and management [3]. A producer has the role which provides the information to be preserved, management has the responsibility to control of the OAIS and consumer's role is to interact with OAIS service to query and retrieve preserved information. A producer collects the data to be preserved and deliver to the OAIS as the submission information package (SIP) and OAIS manages the data as the archival information package (AIP). A consumer sends a query and gets query responses to retrieve interesting information from OAIS. There are three services guidance; common services, network services and security services in the detailed description of functional entities of OAIS. The security services advise data integrity service should secure the data which should not be altered or destroyed by an unauthorized manner, and data should be in permanent data storage [3]. Even though OAIS advises this security guidance, the current existing web archiving solution has not robust standard security specifications to secure the content integrity in terms of implementation and operation.

Web archive
The purpose of the web archive is the long-term preservation of current information from the web for descendants [1]. The web archive is one of the essential topics due to we have a responsibility to archive the existing web content for descendants. However, the challenge for the web archive is that the amount of web data are growing significantly and the web content is getting more important because the web is the primary communication channel to deliver sensitive personal information such as a legal contract. The web archive workflow can be described as four steps of the web archiving as shown in Figure 1 [6]. The appraisal and selection step evaluates the value of records and decides whether the records should be preserved and the preserving period. The acquisition step does archive the records by using several web technologies such as a web crawler. The organization and storage step decide which kinds of resources are organized and which storage will be used. Moreover, the description and metadata step generate metadata and description from the preserved the records for postprocess after archiving such as content searching. In the acquisition step, many web content collection strategies have been developing by using web crawlers for the capture and archiving stage. An HTML is the standard web content markup language to display a web content and the HTML has been version up from HTML 1.0 to HTML 5.X until now. An HTML file consists of many linked resources to display web content such as an external CSS (cascading style sheets) file, images, fonts, etc. An HTML has different characteristics like

Appraisal and Selection Acquisition
Organization and Storage Description and Access In the acquisition step, many web content collection strategies have been developing by using web crawlers for the capture and archiving stage. An HTML is the standard web content markup language to display a web content and the HTML has been version up from HTML 1.0 to HTML 5.X until now. An HTML file consists of many linked resources to display web content such as an Electronics 2020, 9, 1255 3 of 13 external CSS (cascading style sheets) file, images, fonts, etc. An HTML has different characteristics like mentioned above, unlike another digital document standard format, which is PDF. In the case of PDF, it can embed all related resources into a single file [7]. Hence, storing an HTML file and related resources into a repository have vulnerary because there is no way to confirm whether all related resources are stored. Moreover, it needs to synchronize the version, such as saving the HTML and resources with the same timestamp.
IIPC has discussed WARC (web archive) file format, which is standard ISO 28500. WARC provides a standard way to structure, manage and store web content, which is collected from the web and elsewhere [8]. A WARC file format is used in most of the digital content archiving organizations such as the national library of Korea [9]. Many of archiving software such as Heritrix and wget provide web archiving feature to archive as a WARC file. A WARC file format has more than one WARC records and additional metadata such as header as shown Figure 2. In general, the first WARC record has the description for the WARC file and the other records after the first record have the result of crawling from a source such as HTML, images and CSS files. There are the defined-fields in WARC record format to describe each record's properties as shown in Figure 3. There are two defined-fields for content integrity, which are WARC-block-digest and WARC-payload-digest [8].
Electronics 2020, 9, x FOR PEER REVIEW 3 of 15 mentioned above, unlike another digital document standard format, which is PDF. In the case of PDF, it can embed all related resources into a single file [7]. Hence, storing an HTML file and related resources into a repository have vulnerary because there is no way to confirm whether all related resources are stored. Moreover, it needs to synchronize the version, such as saving the HTML and resources with the same timestamp. IIPC has discussed WARC (web archive) file format, which is standard ISO 28500. WARC provides a standard way to structure, manage and store web content, which is collected from the web and elsewhere [8]. A WARC file format is used in most of the digital content archiving organizations such as the national library of Korea [9]. Many of archiving software such as Heritrix and wget provide web archiving feature to archive as a WARC file. A WARC file format has more than one WARC records and additional metadata such as header as shown Figure 2. In general, the first WARC record has the description for the WARC file and the other records after the first record have the result of crawling from a source such as HTML, images and CSS files. There are the defined-fields in WARC record format to describe each record's properties as shown in Figure 3. There are two defined-fields for content integrity, which are WARC-block-digest and WARC-payload-digest [8].  WARC-block-digest has a digest value of the full block of the record and WARC-payload-digest has a digest value of the payload referred to or contained by the record by using a hash algorithm such as an SHA-1. These two fields can be used for the proof of the content of WARC record integrity as shown in Figure 4. Figure 4 shows WARC-block-digest and WARC-block-digest has a digest value for the record, and the integrity of the WARC content can be confirmed by using these two fields. However, these two fields are optional defined-fields in the WARC specification, and there is no way to prevent to secure a WARC file against unauthorized modification manner. It tells a WARC file can be altered the digest value and the content together. It means there is no way to guarantee that a WARC file is the same as original and the digest value and record value are not revised. Hence, a WARC file format has a vulnerability against unauthorized content modification. Electronics 2020, 9, x FOR PEER REVIEW 3 of 15 mentioned above, unlike another digital document standard format, which is PDF. In the case of PDF, it can embed all related resources into a single file [7]. Hence, storing an HTML file and related resources into a repository have vulnerary because there is no way to confirm whether all related resources are stored. Moreover, it needs to synchronize the version, such as saving the HTML and resources with the same timestamp. IIPC has discussed WARC (web archive) file format, which is standard ISO 28500. WARC provides a standard way to structure, manage and store web content, which is collected from the web and elsewhere [8]. A WARC file format is used in most of the digital content archiving organizations such as the national library of Korea [9]. Many of archiving software such as Heritrix and wget provide web archiving feature to archive as a WARC file. A WARC file format has more than one WARC records and additional metadata such as header as shown Figure 2. In general, the first WARC record has the description for the WARC file and the other records after the first record have the result of crawling from a source such as HTML, images and CSS files. There are the defined-fields in WARC record format to describe each record's properties as shown in Figure 3. There are two defined-fields for content integrity, which are WARC-block-digest and WARC-payload-digest [8].  WARC-block-digest has a digest value of the full block of the record and WARC-payload-digest has a digest value of the payload referred to or contained by the record by using a hash algorithm such as an SHA-1. These two fields can be used for the proof of the content of WARC record integrity as shown in Figure 4. Figure 4 shows WARC-block-digest and WARC-block-digest has a digest value for the record, and the integrity of the WARC content can be confirmed by using these two fields. However, these two fields are optional defined-fields in the WARC specification, and there is no way to prevent to secure a WARC file against unauthorized modification manner. It tells a WARC file can be altered the digest value and the content together. It means there is no way to guarantee that a WARC file is the same as original and the digest value and record value are not revised. Hence, a WARC file format has a vulnerability against unauthorized content modification. WARC-block-digest has a digest value of the full block of the record and WARC-payload-digest has a digest value of the payload referred to or contained by the record by using a hash algorithm such as an SHA-1. These two fields can be used for the proof of the content of WARC record integrity as shown in Figure 4. Figure 4 shows WARC-block-digest and WARC-block-digest has a digest value for the record, and the integrity of the WARC content can be confirmed by using these two fields. However, these two fields are optional defined-fields in the WARC specification, and there is no way to prevent to secure a WARC file against unauthorized modification manner. It tells a WARC file can be altered the digest value and the content together. It means there is no way to guarantee that a WARC file is the same as original and the digest value and record value are not revised. Hence, a WARC file format has a vulnerability against unauthorized content modification.

Blockchain
Satoshi Nakamoto developed blockchain technology for bitcoin in 2008 [10]. It uses a dataset called a block, and these blocks are linked with the previous block by using cryptography as shown in Figure 5 [10]. This linked block set is called a blockchain. Each block contains the cryptography hash value of the previous block, a timestamp and a transaction dataset. A blockchain dataset stores across a peer-to-peer network to avoid the centralization of holding datasets, and it called distributed ledger. Anyone who attends the blockchain network can access and download the distributed ledger and can access each block data. A blockchain is widely used for cryptocurrency, customer contract storing in FSI (financial services industry) and many other enterprise area due to the characteristics of blockchain which are (1) all blockchain will be broken in case anyone block is damaged and (2) anyone who in peer-to-peer network can have the distributed ledger. Compared to a conventional centralized system, a blockchain has several advantages, including advantages in efficiency, security, resilience and transparency [11]. Most security solutions consider blockchain as a viable, robust and sustainable cybersecurity solution because of these characteristics [12]. Because of these characteristics, many enterprises consider introducing blockchain technology for their internal legacy system. For example, the e-document integrated support center and KISA, which are one of Korea's government centers, use blockchain technology for an e-document delivery certificate [13].

Blockchain
Satoshi Nakamoto developed blockchain technology for bitcoin in 2008 [10]. It uses a dataset called a block, and these blocks are linked with the previous block by using cryptography as shown in Figure 5 [10]. This linked block set is called a blockchain. Each block contains the cryptography hash value of the previous block, a timestamp and a transaction dataset. A blockchain dataset stores across a peer-to-peer network to avoid the centralization of holding datasets, and it called distributed ledger. Anyone who attends the blockchain network can access and download the distributed ledger and can access each block data. A blockchain is widely used for cryptocurrency, customer contract storing in FSI (financial services industry) and many other enterprise area due to the characteristics of blockchain which are (1) all blockchain will be broken in case anyone block is damaged and (2) anyone who in peer-to-peer network can have the distributed ledger.

Blockchain
Satoshi Nakamoto developed blockchain technology for bitcoin in 2008 [10]. It uses a dataset called a block, and these blocks are linked with the previous block by using cryptography as shown in Figure 5 [10]. This linked block set is called a blockchain. Each block contains the cryptography hash value of the previous block, a timestamp and a transaction dataset. A blockchain dataset stores across a peer-to-peer network to avoid the centralization of holding datasets, and it called distributed ledger. Anyone who attends the blockchain network can access and download the distributed ledger and can access each block data. A blockchain is widely used for cryptocurrency, customer contract storing in FSI (financial services industry) and many other enterprise area due to the characteristics of blockchain which are (1) all blockchain will be broken in case anyone block is damaged and (2) anyone who in peer-to-peer network can have the distributed ledger. Compared to a conventional centralized system, a blockchain has several advantages, including advantages in efficiency, security, resilience and transparency [11]. Most security solutions consider blockchain as a viable, robust and sustainable cybersecurity solution because of these characteristics [12]. Because of these characteristics, many enterprises consider introducing blockchain technology for their internal legacy system. For example, the e-document integrated support center and KISA, which are one of Korea's government centers, use blockchain technology for an e-document delivery certificate [13].  Compared to a conventional centralized system, a blockchain has several advantages, including advantages in efficiency, security, resilience and transparency [11]. Most security solutions consider blockchain as a viable, robust and sustainable cybersecurity solution because of these characteristics [12]. Because of these characteristics, many enterprises consider introducing blockchain technology for their internal legacy system. For example, the e-document integrated support center and KISA, Electronics 2020, 9, 1255 5 of 13 which are one of Korea's government centers, use blockchain technology for an e-document delivery certificate [13].

Block #0
Blockchain is a decentralized system, and there is the possibility that any user in peer-to-peer networks can create a new block, and it can be a conflict with another user. blockchain has the mechanism "proof of work" [14] to solve this possibility. This can be done by the Proof of Work algorithm, which finds the target value by using nonce value. Moreover, it has "difficulty" to adjust the number of a block can be created per minute [14]. The "difficulty" is increasing along with the number of nodes. For example, the Bitcoin can be created only one block per more than 10 min because of the "difficulty" [14].

Public Blockchain vs. Private Blockchain
A public blockchain is a permission-less blockchain. Anyone can add the data or read the data without any permission. Bitcoin is the most popular public blockchain, and anyone can download the Bitcoin blockchain to read or add data. A public blockchain such as the Bitcoin is decentralized and does not need to have a centralized entity that controls the network. Data on a public blockchain are secure as it is not possible to alter data once the data added into the blockchain. A private blockchain is a permissioned blockchain [15]. A private blockchain is a compromised solution between a public blockchain and traditional centralized server architecture. A private blockchain runs by the owners of the network who has controls, and they can accept a new candidate on the Internet. In a private blockchain, only the accepted participants can add a block and see the transaction, whereas the others cannot access it. Hyperledger Fabric of Linux Foundation [16,17] is one of the famous examples of a private blockchain. Various types of a private blockchain can be designed such as (a) anyone can read, but permissioned users only can write or (b) only permissioned users can read and write. Conversely, all customers can write the transaction information, whereas business users and authorization organizations can have only read permission to monitor all transactions in a financial enterprise.
A public blockchain and a private blockchain has a different characteristic like mentioned and as shown in Table 1. A public blockchain is more preferred for an enterprise that wants to do decentralizing. Moreover, private blockchain is more preferred for an enterprise that keeps controlling its security assets.

Web Archiving Method Based on Blockchain
One of the main challenges in the current web archive in the acquisition step is there is no integrated managed web content collection system, and there is no way to prove content integrity. Even though many organizations try to run a global standard web archive system for descendants, a user who uses the archived content cannot be sure their archived content is the same as the original. Moreover, there is no preventing system against unauthorized modification manner. A web archive system which has the content integrity proof can provide the trusted web archive.
We propose the new web archiving method based on blockchain to solve this challenge, and it records the web archiving activity and the content into the blockchain dataset. Once recording this information into the blockchain, the data cannot be deleted or altered. In general, the web archive system can collect the content from the various web site and collect the same page periodically to collect the latest content. Hence, the web archive system needs to care of web archive transaction for each domain. However, there is no relationship between each domain. It should be enough that the web archive system can recognize the life cycle of each domain. Hence, we define the BCLinked (blockchain linked) web archiving system, which has two levels of blockchain for the web archive as shown in Figure 6.
Electronics 2020, 9, x FOR PEER REVIEW 6 of 15 collect the latest content. Hence, the web archive system needs to care of web archive transaction for each domain. However, there is no relationship between each domain. It should be enough that the web archive system can recognize the life cycle of each domain. Hence, we define the BCLinked (blockchain linked) web archiving system, which has two levels of blockchain for the web archive as shown in Figure 6. The blockchain block in the first level which named domain blockchain contains a domain name, and a new block will be added in case a new domain is selected for web archive. The domain blockchain can be used for a web domain that exists on the web. The blockchain block in the second level which named webcontent blockchain, contains the linked information to the block in the domain blockchain and the web archive retrieval information. The number of webcontent blockchain will be the same as the number of blocks at the first level. All content of a WARC file can be too big to put into the blockchain block in the domain blockchain as data due to the limitation of the block size. Therefore, only the digest value which can identify the content will be added into the domain blockchain, and the original WARC file will be stored in a file repository. This web archive method provides a way to record the web archive process activity and the content information into the blockchain to secure content integrity.

Extension of WARC Record Format
We define fields for blockchain block data and a WARC record to integrate with blockchain and a WARC. The domain blockchain block will have each domain description, and the webcontent blockchain block will have each web retrieval result for each domain. Moreover, blockchain block has a specific block size, and all content of a WARC file can be too big to put into the blockchain block. So, we define the blockchain block data fields as shown in Table 2. Domain blockchain has WARC-domain and WARC-domain-creationtime fields. In case the web archiving system selects a new web domain to archive, then the information of the web domain will be added into a new block in the domain-creationtime blockchain. A new webcontent blockchain block will be added into the webcontent blockchain after a WARC file is created. A webcontent blockchain block does not have all content of a WARC file, but only the fields to identify the WARC as shown in Table 2. A WARC file has multiple WARC records and these records per a WARC will be added into one webcontent blockchain node. It can be hard to synchronize a WARC file across participants who attend the web archive blockchain due to several issues such as file size, whereas the web archive blockchain can be Block #0  The blockchain block in the first level which named domain blockchain contains a domain name, and a new block will be added in case a new domain is selected for web archive. The domain blockchain can be used for a web domain that exists on the web. The blockchain block in the second level which named webcontent blockchain, contains the linked information to the block in the domain blockchain and the web archive retrieval information. The number of webcontent blockchain will be the same as the number of blocks at the first level. All content of a WARC file can be too big to put into the blockchain block in the domain blockchain as data due to the limitation of the block size. Therefore, only the digest value which can identify the content will be added into the domain blockchain, and the original WARC file will be stored in a file repository. This web archive method provides a way to record the web archive process activity and the content information into the blockchain to secure content integrity.

Extension of WARC Record Format
We define fields for blockchain block data and a WARC record to integrate with blockchain and a WARC. The domain blockchain block will have each domain description, and the webcontent blockchain block will have each web retrieval result for each domain. Moreover, blockchain block has a specific block size, and all content of a WARC file can be too big to put into the blockchain block. So, we define the blockchain block data fields as shown in Table 2. Domain blockchain has WARC-domain and WARC-domain-creationtime fields. In case the web archiving system selects a new web domain to archive, then the information of the web domain will be added into a new block in the domain-creationtime blockchain. A new webcontent blockchain block will be added into the Electronics 2020, 9, 1255 7 of 13 webcontent blockchain after a WARC file is created. A webcontent blockchain block does not have all content of a WARC file, but only the fields to identify the WARC as shown in Table 2. A WARC file has multiple WARC records and these records per a WARC will be added into one webcontent blockchain node. It can be hard to synchronize a WARC file across participants who attend the web archive blockchain due to several issues such as file size, whereas the web archive blockchain can be synchronized. Hence, WARC-filename and WARC-location are used to keep the WARC file physical location and it will be added once into one block. WARC-record-ID, WARC-block-digest and WARC-payload-digest are used for each WARC record and it will be added multiple times because the number of a WARC record is greater than zero. These fields will be added as JSON format as shown in Figure 7. location and it will be added once into one block. WARC-record-ID, WARC-block-digest and WARCpayload-digest are used for each WARC record and it will be added multiple times because the number of a WARC record is greater than zero. These fields will be added as JSON format as shown in Figure 7.  Once adding the information for the WARC file to the blockchain node, the new block is created. The description of the node will be added to the WARC file. We define the extended defined-fields for the WARC record to describe the blockchain node as shown in Table 3. These fields will be added to the top part of a WARC file.

Process of BCLinked Web Archiving System
The process of the BCLinked web archiving system is shown as Figure 8. A web crawler collects web content and creates a WARC file in the content crawling stage. All WARC records are sent to BCLinked web archiving. BCLinked web archiving system checks whether the target domain exists in the domain blockchain with the WARC records. The existing blockchain node data will be used if the domain already exists or a new blockchain block will be added in the domain information storing stage. Then all WARC records are added to the webcontent blockchain block in the content information storing stage, and the block hash is added into WARC records for cross-validation in the Once adding the information for the WARC file to the blockchain node, the new block is created. The description of the node will be added to the WARC file. We define the extended defined-fields for the WARC record to describe the blockchain node as shown in Table 3. These fields will be added to the top part of a WARC file.

Process of BCLinked Web Archiving System
The process of the BCLinked web archiving system is shown as Figure 8. A web crawler collects web content and creates a WARC file in the content crawling stage. All WARC records are sent to Electronics 2020, 9, 1255 8 of 13 BCLinked web archiving. BCLinked web archiving system checks whether the target domain exists in the domain blockchain with the WARC records. The existing blockchain node data will be used if the domain already exists or a new blockchain block will be added in the domain information storing stage. Then all WARC records are added to the webcontent blockchain block in the content information storing stage, and the block hash is added into WARC records for cross-validation in the blockchain information storing stage.

The System Architecture
BCLinked web archiving system consists of archive manager and blockchain manager as shown in Figure 9. The subcomponents in the archive manager and the blockchain manager can be classified into following below cases: 1. The crawling processor, it is for web content crawling; 2. The WARC handler, it is for adding the extended WARC fields and managing a WARC file storing in the repository; 3. The blockchain requester, it is for communicating with blockchain manager to request to add the new web content information into the blockchain; 4. In the blockchain manager, the blockchain requester handler is the interface to communicate with the archive manager; 5. The blockchain node handler, it is for the blockchain data processing.
In general, the web archiving process happens in an enterprise or government which needs to keep the historical record. Hence, this BCLinked web archiving system does not need to give all permission to a user, but only giving read permission to a user is sufficient. Therefore, the BCLinked web archiving system can be run in a private blockchain. The council of organizations can run the BCLinked web archiving system, and the blockchain data will be synchronized via the network. However, the WARC files in the archive manager does not need to be synchronized because it requires heavy network traffic. Hence, we designed each organization in the council to keep their WARC files into their WARC repository, and only the metadata of the WARC file will be added and synchronized via the blockchain manager.

The System Architecture
BCLinked web archiving system consists of archive manager and blockchain manager as shown in Figure 9. The subcomponents in the archive manager and the blockchain manager can be classified into following below cases: 1.
The crawling processor, it is for web content crawling; 2.
The WARC handler, it is for adding the extended WARC fields and managing a WARC file storing in the repository; 3.
The blockchain requester, it is for communicating with blockchain manager to request to add the new web content information into the blockchain; 4.
In the blockchain manager, the blockchain requester handler is the interface to communicate with the archive manager; 5.
The blockchain node handler, it is for the blockchain data processing.
Electronics 2020, 9, x FOR PEER REVIEW 8 of 15 Figure 8. Web data archival process in the BCLinked web archiving method.

The System Architecture
BCLinked web archiving system consists of archive manager and blockchain manager as shown in Figure 9. The subcomponents in the archive manager and the blockchain manager can be classified into following below cases: 1. The crawling processor, it is for web content crawling; 2. The WARC handler, it is for adding the extended WARC fields and managing a WARC file storing in the repository; 3. The blockchain requester, it is for communicating with blockchain manager to request to add the new web content information into the blockchain; 4. In the blockchain manager, the blockchain requester handler is the interface to communicate with the archive manager; 5. The blockchain node handler, it is for the blockchain data processing.
In general, the web archiving process happens in an enterprise or government which needs to keep the historical record. Hence, this BCLinked web archiving system does not need to give all permission to a user, but only giving read permission to a user is sufficient. Therefore, the BCLinked web archiving system can be run in a private blockchain. The council of organizations can run the BCLinked web archiving system, and the blockchain data will be synchronized via the network. However, the WARC files in the archive manager does not need to be synchronized because it requires heavy network traffic. Hence, we designed each organization in the council to keep their WARC files into their WARC repository, and only the metadata of the WARC file will be added and synchronized via the blockchain manager. Both the archive manager and the block manager runs on each independent web server and it has the RESTful API to start a new archive process and communicate with each other as shown in Table 4. These two the archive manager and the blockchain manager can be run in separate physical enjoinment as well as in a single physical environment. The archive manager can be triggered

Archive manager
Blockchain manager Figure 9. System architecture of the BCLinked web archiving system.
In general, the web archiving process happens in an enterprise or government which needs to keep the historical record. Hence, this BCLinked web archiving system does not need to give all permission to a user, but only giving read permission to a user is sufficient. Therefore, the BCLinked web archiving system can be run in a private blockchain. The council of organizations can run the BCLinked web archiving system, and the blockchain data will be synchronized via the network. However, the WARC files in the archive manager does not need to be synchronized because it requires heavy network traffic. Hence, we designed each organization in the council to keep their WARC files into their WARC repository, and only the metadata of the WARC file will be added and synchronized via the blockchain manager.
Both the archive manager and the block manager runs on each independent web server and it has the RESTful API to start a new archive process and communicate with each other as shown in Table 4. These two the archive manager and the blockchain manager can be run in separate physical enjoinment as well as in a single physical environment. The archive manager can be triggered by/crawl API. The archive manager crawls web content and returns the unique job ID. Then the/archive API will be triggered to add the information into the blockchain. The/archive API will execute the/blocks/add API and return the result. Then the/update API will record the blockchain hash information into the WARC file, and the process is finished. A user can check all blockchain data by the/blocks/list API. Table 4. RESTful APIs in the BCLinked web archiving system.

API Name Description
The archive manager

Experiment Environment
We verified the BCLinked web archiving system described in this paper through an experiment. We develop the BCLinked archiving system by using open-source software as shown in Table 5. We used the well-known web crawler, which is wget for the crawling processor. A wget supports writing to a WARC file like Heritrix and other archiving tools since version 1.14 [18]. We developed the archive manager and the blockchain manager by using python language.
The BCLinked web archiving system has the PoW (Proof of Work) algorithm for avoiding conflict to create a new block at the same by several requests. We set the difficulty is as shown in Figure 10. This difficulty depends on various situations. For example, the difficulty of the Bitcoin is adjusted periodically depends on how much hashing power has been distributed on the network. In this experiment, we decided this difficulty based on previous research experience. The BCLinked web archiving system has the PoW (Proof of Work) algorithm for avoiding conflict to create a new block at the same by several requests. We set the difficulty is as shown in Figure 10. This difficulty depends on various situations. For example, the difficulty of the Bitcoin is adjusted periodically depends on how much hashing power has been distributed on the network. In this experiment, we decided this difficulty based on previous research experience.

Experiment Results
We experimented with archiving web content by the BCLinked web archiving system and the traditional web archiving system. We had two different experiments with different difficulty values; the result summary is shown in Table 6.
In the first experiment, we set "0000" as the difficulty value, and we selected the most popular 81 websites global and archived the index page. We archived these index pages 100 times to simulate web pages are archived periodically. The total processing time is 3575.08 s and 78.67 s of the total processing time is for the blockchain processing time. The total WARC file size is 120,459,510 bytes, and the blockchain data file size is 1,142,594 bytes with the BCLinked web archiving system. Meanwhile, the blockchain processing time and the blockchain data set is 0 with the traditional web archiving because the traditional web archiving system does not use the blockchain technology, but only use the web crawler. The minor difference regarding total WARC files size between the BCLinked web archiving system and the traditional web archiving system is because of the updating a web content from the source web site. In the second experiment, we set "00000" as the difficulty value, which is more difficult to find the proof value. The total processing time is 5049.47 s and 1011.34 s of the total processing time is for the blockchain processing time. The total WARC file size is 121,408,987 bytes, and the blockchain data file size is 1,143,229 bytes. Meanwhile, the result of the traditional web archiving system shows very similar results like the first experiment because the difficulty property is used only for the BCLinked web archiving system.
We confirmed the blockchain processing time is uniform regardless of the number of blocks and the blockchain processing time only depends on the difficulty property as shown in Figure 11. Hence, there is no unexpected overhead time consumption while archiving a new web content. A web

Experiment Results
We experimented with archiving web content by the BCLinked web archiving system and the traditional web archiving system. We had two different experiments with different difficulty values; the result summary is shown in Table 6.
In the first experiment, we set "0000" as the difficulty value, and we selected the most popular 81 websites global and archived the index page. We archived these index pages 100 times to simulate web pages are archived periodically. The total processing time is 3575.08 s and 78.67 s of the total processing time is for the blockchain processing time. The total WARC file size is 120,459,510 bytes, and the blockchain data file size is 1,142,594 bytes with the BCLinked web archiving system. Meanwhile, the blockchain processing time and the blockchain data set is 0 with the traditional web archiving because the traditional web archiving system does not use the blockchain technology, but only use the web crawler. The minor difference regarding total WARC files size between the BCLinked web archiving system and the traditional web archiving system is because of the updating a web content from the source web site. In the second experiment, we set "00000" as the difficulty value, which is more difficult to find the proof value. The total processing time is 5049.47 s and 1011.34 s of the total processing time is for the blockchain processing time. The total WARC file size is 121,408,987 bytes, and the blockchain data file size is 1,143,229 bytes. Meanwhile, the result of the traditional web archiving system shows very similar results like the first experiment because the difficulty property is used only for the BCLinked web archiving system.
We confirmed the blockchain processing time is uniform regardless of the number of blocks and the blockchain processing time only depends on the difficulty property as shown in Figure 11. Hence, there is no unexpected overhead time consumption while archiving a new web content. A web archiving solution collects a web content periodically when the content is updated and the BCLinked web archiving system only requires a small additional blockchain processing time for adding a new content.   We confirmed the all WARC files block-digest value was stored in the BCLinked web archiving system, and the block-digest in the blockchain can be used for the content-integrity experiment. In terms of the content integrity, we tried to modify the archived content by using the python script to simulate the unauthorized content modification. The python script modified a content, WARC blockdigest, WARC payload-digest and rebuilding the BCLinked blockchain data in the BCLinked web archiving system. Moreover, the python script modified a content, WARC block-digest, WARC payload-digest in the traditional web archiving system without the BCLinked blockchain data. As a result, it took less than 1 s to modify the archived web content in the traditional web archiving system because there is any linked relationship to another web contents, and it can be done by modifying only the original archived web content. However, it took 81.99 s with difficulty 0000 and 998.73 s with difficulty 00000 in the BCLinked web archiving system because all blockchain node data should be rebuilt even if only one single content is modified. We confirmed the all WARC files block-digest value was stored in the BCLinked web archiving system, and the block-digest in the blockchain can be used for the content-integrity experiment. In terms of the content integrity, we tried to modify the archived content by using the python script to simulate the unauthorized content modification. The python script modified a content, WARC block-digest, WARC payload-digest and rebuilding the BCLinked blockchain data in the BCLinked web archiving system. Moreover, the python script modified a content, WARC block-digest, WARC payload-digest in the traditional web archiving system without the BCLinked blockchain data. As a result, it took less than 1 s to modify the archived web content in the traditional web archiving system because there is any linked relationship to another web contents, and it can be done by modifying only the original archived web content. However, it took 81.99 s with difficulty 0000 and 998.73 s with difficulty 00000 in the BCLinked web archiving system because all blockchain node data should be rebuilt even if only one single content is modified.
The processing time of web content archiving can be express as shown below Figure 12. The overhead time in the BCLinked web archiving system is N × B t . The time complexity for both S a1 and S a2 is the same as O(n) and B t is much smaller than C t . Therefore, this overhead time is reasonable to get the benefit from the blockchain.
The processing of web content modification can be express as shown below Figure 13. The overhead time in the BCLinked web archiving system is mostly in updating all linked blockchain data. The worst case is that trying to modify the web content which is in the beginning position in the blockchain data. All linked blockchain data and extended defined-fields should be revised in this case, and T will be the total number of all blockchain data. The time complexity for S a2 is O n 2 while the time complexity for S a1 is O(n). It shows that unauthorized content modification is almost impossible due to massive overhead processing time in the real world with the BCLinked archiving system. Electronics 2020, 9, x FOR PEER REVIEW 13 of 15 The processing time of web content archiving can be express as shown below Figure 12. The overhead time in the BCLinked web archiving system is × . The time complexity for both 1 and 2 is the same as ( ) and is much smaller than . Therefore, this overhead time is reasonable to get the benefit from the blockchain. The processing of web content modification can be express as shown below Figure 13. The overhead time in the BCLinked web archiving system is mostly in updating all linked blockchain data. The worst case is that trying to modify the web content which is in the beginning position in the blockchain data. All linked blockchain data and extended defined-fields should be revised in this case, and will be the total number of all blockchain data. The time complexity for 2 is ( 2 ) while the time complexity for 1 is ( ). It shows that unauthorized content modification is almost impossible due to massive overhead processing time in the real world with the BCLinked archiving system. In conclusion, we confirmed the BCLinked web archiving system can provide have enhanced content integrity with small overhead than the traditional web archiving system.

Conclusions
In this paper, we proposed the BCLinked web archiving system to preserve content integrity by using blockchain technology and extending WARC specifications. blockchain technology is that connecting the previous node by using cryptography to secure all data in a blockchain, and it is the decentralization system which can be shared via peer-to-peer network. By using the BCLinked web archiving system, the block-digest value of a WARC file to identify content integrity is stored in the blockchain. The BCLinked web archiving can solve the weakness of the current web archiving in terms of securing content integrity. The BCLinked web archiving system can be used for a user who retrieves a web content from the BCLinked web archiving system and gets the confirmation of content integrity. Therefore, this system will be useful for an enterprise or a government that generated personalized web content to a customer because it can be proof against legal disputes. However, we and 2 is the same as ( ) and is much smaller than . Therefore, this overhead time is reasonable to get the benefit from the blockchain. The processing of web content modification can be express as shown below Figure 13. The overhead time in the BCLinked web archiving system is mostly in updating all linked blockchain data. The worst case is that trying to modify the web content which is in the beginning position in the blockchain data. All linked blockchain data and extended defined-fields should be revised in this case, and will be the total number of all blockchain data. The time complexity for 2 is ( 2 ) while the time complexity for 1 is ( ). It shows that unauthorized content modification is almost impossible due to massive overhead processing time in the real world with the BCLinked archiving system. In conclusion, we confirmed the BCLinked web archiving system can provide have enhanced content integrity with small overhead than the traditional web archiving system.

Conclusions
In this paper, we proposed the BCLinked web archiving system to preserve content integrity by using blockchain technology and extending WARC specifications. blockchain technology is that connecting the previous node by using cryptography to secure all data in a blockchain, and it is the decentralization system which can be shared via peer-to-peer network. By using the BCLinked web archiving system, the block-digest value of a WARC file to identify content integrity is stored in the blockchain. The BCLinked web archiving can solve the weakness of the current web archiving in terms of securing content integrity. The BCLinked web archiving system can be used for a user who retrieves a web content from the BCLinked web archiving system and gets the confirmation of content integrity. Therefore, this system will be useful for an enterprise or a government that generated personalized web content to a customer because it can be proof against legal disputes. However, we developed the system for experiments with simple blockchain by using python. This system is enough for proof of concept, but we need to research more-with-more stable and demonstrated In conclusion, we confirmed the BCLinked web archiving system can provide have enhanced content integrity with small overhead than the traditional web archiving system.

Conclusions
In this paper, we proposed the BCLinked web archiving system to preserve content integrity by using blockchain technology and extending WARC specifications. blockchain technology is that connecting the previous node by using cryptography to secure all data in a blockchain, and it is the decentralization system which can be shared via peer-to-peer network. By using the BCLinked web archiving system, the block-digest value of a WARC file to identify content integrity is stored in the blockchain. The BCLinked web archiving can solve the weakness of the current web archiving in terms of securing content integrity. The BCLinked web archiving system can be used for a user who retrieves a web content from the BCLinked web archiving system and gets the confirmation of content integrity. Therefore, this system will be useful for an enterprise or a government that generated personalized web content to a customer because it can be proof against legal disputes. However, we developed the system for experiments with simple blockchain by using python. This system is enough for proof of concept, but we need to research more-with-more stable and demonstrated blockchain platforms such as Ethereum. In future research, we will research the more efficient BCLinked web archiving system in a production environment with the demonstrated blockchain platform.