A Framework exploring the balance between Blockchain and IPFS

The current state of the Web, which is dominated by centralized cloud services, raises several concerns on different aspects such as governance, privacy, surveillance, and security. A way to address these issues is to decentralize the platforms by adopting new distributed technologies, such as IPFS and Blockchain, which follow a full peer-to-peer model. This work proposes a set of guidelines to design decentralized systems, taking into consideration the different trade-offs these technologies face with regard to their consistency requirements. These guidelines are then illustrated with the design of a decentralized questions and answers system. This system serves to illustrate a framework to create decentralized services and applications, that uses IPFS and Blockchain technologies and incorporates the discussion and guidelines of the paper, providing solutions for data access, data provenance and data discovery. Thus, this work proposes a framework for the design of decentralized systems and contributes a set of guidelines to decide in which cases Blockchain technology may be required, or when other technologies, such as IPFS, are sufficient.


Introduction
Nowadays, centralized cloud web services represent an increasingly large portion of the Internet [15]. This trend has been significantly accelerated since the emergence of the Web 2.0 model [50], in which web applications enabled user participation and user-generated contents. Thus, today's Internet activity is concentrated on highly successful web services which have dominance over their respective markets [17,24]. During recent years, there are increasing concerns on the multiple issues this situation arises, with respect to e.g. privacy [44], governance [20,17], legislation [15], surveillance [36] or security [30]. Consequently, there have been several proposals to tackle some of these issues through new legislation [8,51] or through recommendations for platform developers [26]. In parallel, these issues have triggered the emergence of a wide range of technical solutions through different forms of decentralization.
We may divide the proposed decentralized solutions in three waves. The first wave has been through "federated" technology [10,1,49], i.e. multiple central nodes communicating with each other, where users are free to choose the node to interact with. E-mail is a classic example of an open protocol which is federated, together with more recent XMPP for chatting [45], OStatus for microblogging [52], ActivityPub for social networking [54], OAuth for authentication [23], or SwellRT for real-time collaboration [40]. This approach is based on interoperability across services and servers [10,55,46]. However, many of these technologies are still hindered by several drawbacks, such as the existence of points of failure [42] and control [34], or the lack of interoperability of the data beyond a few applications [46,49].
The second wave of decentralized solutions has been achieved through fully distributed technology, i.e. P2P networks without classical servers but instead ordinary computers (different from classical cluster/grid parallel computing). There have been multiple attempts to offer P2P web services [41,31], such as Freenet for censorship-resistant communication [11], although broad adoption was mostly limited to the field of file-sharing, e.g. eDonkey, BitTorrent [13].
The third wave appears when some unresolved technical challenges with P2P solutions [33,53] became more evident. This opened the door to a new generation of solutions, most of them relying on cryptographic hashes organized in Merkle trees [37]. The advent of the first fully decentralized digital currency, Bitcoin [39], triggered a plethora of decentralized solutions based on its underlying technology, the Blockchain. In addition, another groundbreaking technology emerged around P2P storage: IPFS, or Inter-Planetary File System [3]. These two new decentralized technologies, often combined, enable a wide range of applications [18,4,19]. Furthermore, CRDTs [47] technology enabled real-time collaboration for P2P systems.
Exploring the synergies of these technologies may unveil new decentralization possibilities. IPFS is frequently used as a decentralized storage for blockchain applications. However, other non trivial combinations of these technologies may enable new decentralized systems designs. Therefore, there is a need of frameworks and models to explore the limitations and synergies of these recent innovations. This work proposes a combination of IPFS and Blockchain technologies for the design and implementation of open distributed systems. Concretely, it presents the trade-offs decentralized technologies face, and propose design guidelines to asses the adequacy of the different considered technologies.
The rest of the paper is structured as follows. Section 2 defines characteristics of the considered distributed systems. Then, Section 3 introduces the used decentralization technologies. Section 4 discusses the trade-offs of open distributed systems design, discusses the tensions and approaches for consistency in such systems and provides design guidelines to asses whether a system may require the use of blockchain technology. Afterwards, Section 5 applies the previous sections discussions and design guidelines to propose a distributed system design, using a distributed Questions and Answer (Q&A) system as example. The conclusions follow in Section 6.

Open Peer to Peer Systems
The purpose of this work is to provide a framework and set of guidelines that can facilitate the design of open Peer to Peer services and applications, whose management and governance is decentralized. It focuses on open and fully distributed peer to peer systems. These system's characteristics are further explored in the following subsections.

Openness
Open systems should provide means for autonomous agents to enter, interact among them, and leave the system.
The concept of open system is widely applied in computing and telecommunications since a long time (see for instance standardization efforts such as the OSI model [56]). Its main idea is that services (with well specified interfaces) can be provided by different entities with their own implementation. An open system, therefore, specifies the means for communication of its entities, which can enter, interact and leave [16,25].
The evolution of the open system is therefore highly dynamic, which makes quite complex to have a complete knowledge of the whole system state at any time. Entities only have a partial knowledge of their environment (the open system) and the only thing all of them hold in common is the ability to communicate each other [25]. In this sense, the paradigm of multi-agent systems (MAS), which assume as fundamental the autonomy and the ability to communicate of distributed entities, the agents, is a proper model for the development of open systems. An agent is an autonomous entity, with the assumption that its knowledge of the world is partial [22], so it tries to take the best decision (principle of rationality [48]), and interacts with other agents.

Peer to Peer Full Distribution.
Fully distributed Peer to Peer systems are composed by a network of interconnected agents that communicate and coordinate their actions without a central control entity.
Systems such as the Web and P2P File sharing programs are distributed systems composed by web servers, and computers sharing files, respectively [5,43]. While centralized systems depend on a single component for their operation, distributed systems are resilient to the disconnection of some of their components, e.g., if a web server is disconnected, the Web will still be a functional system. However, some distributed systems still depend on single components for parts of the system to work. For instance, if a web server disconnects, their web pages will become unavailable. This work refers to peer-to-peer systems when referring to distributed systems that are independent from any single node.

Decentralization Technologies
The proposal rely on blockchain [39] and IPFS [3] decentralization technologies. This section describes these technologies and some of their underlying concepts and properties, such as content-addressability and merkle linked structures.
Content Addressability: In centralized and federated systems, content is frequently referred with addresses that include location information, the Uniform Resource Locators (URLs) [6]. However, references to content can also be independent from their location, using Universal Resource Identifiers (URIs) [27]. In peer-to-peer systems, agents cannot rely on the location of other agents for accessing content, because the content could be provided by any agent. The hash 1 of any content can be used as its URI. Thus, these hash URIs are used in multiple distributed systems such as IPFS to build scalable content-addressable networks [43,38,28,3]. Merkle Links and Structures: The use of hash values (see previous subsection) to reference data in data structures was first introduced in 1987 by Merkle [37]. Complex data structures can use these links (See Figure 2 for an example). This Merkle linked structures are key to build technologies such as Git [35], Blockchain [39] and IPFS [3] among others. Section 5.2 propose the use of these structures for the data representation of the system. Blockchain: Blockchain was the first technology that enabled a fully distributed digital currency (Bitcoin) [39], solving the double-spending problem in distributed systems. It uses a Merkle Linked list of blocks of transactions (a Blockchain) to build a distributed ledger of transactions. To address the double spending problem, it made computationally difficult to propose a candidate for the next block in the distributed ledger and incentivized nodes to try to propose those blocks with valid transactions. Then, the protocol considers the largest observed chain the actual ledger to trust. Therefore, in order to forge a blockchain, an actor would need half of the computing power of the system, bringing security to the consistency of the data recorded in the ledger. Section 4.4 proposes the use of Blockchain to provide consistency to open distributed systems. IPFS: Some peer-to-peer systems like P2P sharing software [43] use hash of the content to address it. Other technologies such as Git use complex Merkle-Linked Structures [35]. IPFS integrates both the use of complex Merkle-Linked structure with the data-addressability of P2P file sharing systems. The content is distributed over a peer-to-peer network. Section 5.1 proposes the use of IPFS for the storage and distribution of data in the framework.

Design Trade-offs of Distributed Open Systems
The design of decentralized open systems face some challenges. Unlike centralized systems, they lack a single entity deciding on the consistency of the system and a complete view of the system state. This section frames these challenges, first introducing the CAP Theorem, which describes the compromises between data consistency, availability, and partition resistance of distributed systems (Subsection 4.1); then presenting how the CALM Principle provides tools to discover if an open distributed system needs coordination technology for consistent behaviour (Subsection 4.2); next, introducing how Conflict-free Replicated Data Types (CRDTs) provide a solution to achieve eventual consistency for these systems without needing coordination technologies (Section 4.3), and finally explaining how blockchain enables such coordination while preserving decentralization when CRDTs cannot be used or the system has stronger consistency requirements (Section 4.4). This section also provides guidelines to support the design of open distributed systems.

CAP Theorem
The CAP Theorem [7] states that a networked data system can only hold two of these three desirable properties: 1. C onsistency: The requests of the distributed system behaves as if handled by a single node with updated information. 2. Availability: Every request should be responded. 3. P artition resistance: The system is able to operate in presence of network partitions.
Given that the framework considers open systems where agents with partial information can join or leave at any moment, the P artition resistance is a needed property for our proposal. Therefore, one of the most important design decisions for the systems built within the framework is to find the best balance between C onsistency and Availability.

CALM Principle
Some queries are impossible to resolve in distributed open systems. Intuitively, in a distributed open system, since some data may not be accessible. Therefore, queries that need to take into account all the information of the system such as those counting the data that satisfy some constraints (e.g. counting the exact number of web pages that include a certain word) are impossible to resolve.
C onsistency As Logical M onotonicity (CALM) principle describes those queries that can be resolved in a distributed system without coordination [2]. A system is considered as logically monotonic if the truth of a given statement cannot change by considering new information. In such systems, the responses to distributed queries are consistent.
The designer of a distributed system can check the monotonicity of its queries as follows: Order independence: is a needed condition for logical monotonicity [2], i.e.
if the system behaviour depends on the order in which the information is received, then it is non-monotonic. For instance, in the double-spending problem, where an agent tries to spend "the same coin" twice, the state depends on which payment was done first. Therefore, it is a non-monotonic problem. Monotonicity: By definition, if new information may revoke a previously valid response to a query, the query is non-monotonic. For instance, counting the number of positive votes for an answer in a Q&A system is non-monotonic, since new votes would change the response. Formal analysis: can proof the logical monotonicity of a system [2].
In distributed open systems, non-monotonic queries may produce non consistent results without a coordination mechanism. Thus, in the presence of non-monotonic queries, the designer should decide on the consistency requirements of the system.

Guideline 1 Monotonic queries can be consistently resolved in open distributed systems without coordination technologies
Thus, in the presence of network partitions, choosing perfect consistency over availability can be implemented without coordination using logically monotonic systems. If inconsistent behaviour, like missing some votes in a Q&A system, is acceptable for the system, then coordination mechanisms are still not needed.
Guideline 2 Consistency requirements are a design decision. If inconsistent behaviour is acceptable for the non-monotonic queries of the system, coordination technologies are not required for open distributed systems.
Moreover, some non monotonic open distributed systems may achieve eventual consistency without coordination, as explored in next subsection.

Eventual Consistency
Eventual consistency is defined as consistency among the nodes of a distributed system once all the messages have been delivered. Conflict-free Replicated Data Types (CRDTs) proposal enable eventual consistency without coordination, such as reaching consensus or rolling back [47]. These data types can be defined on the properties of their operations. A data type is said to be a CRDT, if the possible concurrent operations are commutative. Note that with eventual consistency, statements that are considered true in a given time, can become false after receiving new messages. Thus, this consistency may not be sufficient for systems with strong consistency requirements such as crypto-currencies.
CRDTs warranties eventual consistency once all the messages have been delivered. Diferent systems may tolerate different delays of these messages. For instance, while a Q&A system may ignore a vote for a long period of time, for a collaborative document, incorporating relatively old updates may be problematic, regardless the eventual consistency.

Blockchain for distributed consistency
Some non-monotonic problems, such as the double-spending problem in distributed currencies, require strong consistency. Thus, a coordination mechanism is needed to provide that consistency. Blockchain technology enabled the implementation of Bitcoin [39], the first distributed digital currency. It proposed a fully distributed coordination mechanism to establish a consensus on the order of the valid transactions. Thus, it provided consistency to a nonmonotonic problem in a fully decentralized system. Indeed, blockchain can be used to provide consistency to other non monotonic systems, by establishing a consensus on the order in which the information should be considered.

Guideline 4
The non-monotonic queries of an open distributed system with strong consistency requirements should be supported by a coordination technology such as Blockchain.
The guidelines are summarized in Figure 1 Weak

Designing a Distributed Question and Answers system
In this section, the trade-offs and design guidelines introduced in this paper are presented through a running example of a simple Q&A system, such as the well-known Stack Overflow 2 . The balance between availability and consistency in the system is discussed, and the need for blockchain technology is assesed.
The proposed system architecture relies on IPFS for fully distributed data storage, public-key identities for data provenance, and a peer-to-peer network for communication. This section introduces how data access, data provenance and data discovery are provided by the proposal.

Accessing data
In centralized Q&A systems such as Stack Overflow the data is addressed and accessed using a location-centric model. i.e. a server is responsible to provide the data. For instance, a user may search for responses to a programming problem on Stack Overflow website.
The use of content-addressable models for data access provide a fully distributed alternative. Our architecture relies on the IPFS newtwork to distribute the data as Merkle-linked structures. This data structures provide both a Merkle-linked structure and data-addressability [3]. Concretelly, the data in the system is composed by key-value records and by named directed Merklelinks to other data (as depicted in Figure 2). This data may be provided by any agent of the system.

Data provenance
In centralized and federated systems, the trustworthiness of the data is provided through direct connection to trusted servers, e.g. the user of a centralized Q&A system trust a server for not hiding or altering the information of the system. Fully decentralized alternatives can also be considered to obtain trustworthy data.
We propose the use of asymmetric cryptography identities to provide trustworthy provenance of data. Data that is digitally signed by trusted identities is Fig. 2 Merkle linked data of an example Q&A system (such as Stack Overflow) trusted in the system. Following the technological choices of the architecture, the use of IPNS [3] or Ethereum [9] identity infrastructure can be used. Following our Q&A system example, every question, answer and vote is digitally signed by the authors. Replicating the behaviour of Stack Overflow, every user can submit questions and answers to the system. Thus, every signed question or answer is consider valid. A simple version of the system may consider every question, answer and vote valid, thus having weak consistency requirements. Such system would not need coordination technologies (Guideline 2) to work. However, systems such as Stack Overflow implement strategies to avoid system abuses, for instance, it only allows to vote to authors with at least 15 reputation points. Five reputation points are earned with each positive vote to a question or answer. Thus, to implement such strategies, our system should only allow the votes of users with at least 3 positive votes. Since this votes also have to be valid, the vote verification is recursive, until it reaches a trusted base case, e.g. identities that were initially allowed to vote without reputation in the system.
If negative votes are not considered in the system, answering whether a vote is valid is a non-monotonic problem. Thus, it can be implemented in a distributed system with strong consistency without coordination mechanisms (Guideline 1). However, the recursive nature of the example shows how the size and complexity of the data needed to trust a response may not be trivial.
The consideration of negative votes to questions and answers that would decrease the reputation of the authors adds complexity to the problem. The question of whether an identity has at least 15 reputation points is no longer monotonic, since observing new negative votes may change the results. Fortunately, adding and subtracting values to a number are commutative operations. Thus, and following CRDTs proposal, we could chose availability over consistency, and be able to operate in the system while not knowing all the up and down votes while trusting that eventual consistency will be achieved (Guideline 3).
Furthermore, digital signatures may not be enough to prove the authorship in the system. A malicious agent may sign data previously authored by other agents. Deciding then which was the first author become a non-monotonic problem that cannot be resolved with strong consistency without coordination.
This problem is alike the double-spending problem, and could be resolved using blockchain 3 if the designer considers that the system requires such strong consistency (Guideline 4).
Non-monotonic searches (see Section 4.2) with strong consistency requirements, such as getting exact number of votes of a question, may need the use of a blockchain as coordination mechanism. For instance, the votes of a Q&A system or the authorship of questions and answers could be registered in a blockchain to provide consistency to those queries. Our architecture proposes the development of smart contracts using Ethereum [9] to provide such consistency for these systems.

Data Discovery using a Trustless Distributed Protocol
To discover data in our open and distributed system we propose the use of a query protocol. The queries of the system state the constraint that the responses must satisfy. For instance, a question that contains a given text can be searched in a Q&A system. The query can also constraint the structure of the response (e.g. it has more than one answer and more than one positive vote).
Additionally, a score function can be defined to sort the valid responses. For instance, the the questions containing some text can be ranked by the number of positive votes.
Following, the protocol interactions ( Figure 3) are described: 1. An agent sends a query (with constraints and a score function). 2. Any agent can reply, with a response consisting of a content-centric link to the data satisfying the query and its corresponding score. 3. The querying agent access the data, and verifies the responses and scores.
This protocol presents the following characteristics: 1. Lightweight communication: Responses consist of a short link and a numeric value. Their length is then a few bytes long while they may represent complex large data structures. 2. Early distributed ranking: Responses may be ranked without accessing their data. 3. Trustless ranking and validity: The validity and ranking of the responses can be assessed without trusting the agents providing the responses or the data.
The protocol can be implemented using: 1) Merkle-linked data distributed over IPFS. 2) Javascript pure functions to express query constraints and score functions, using the JavaScript implementation of IPFS, and 3) A bus model for distributed systems communication [29] over IPFS pub-sub channels. Thus, it would enable the implementation of distributed open systems without with

Discussion and conclusions
This work introduces the tensions between consistency, availability and partition resistance in current fully distributed systems. It explores the possibilities and limitations of different approaches and technologies, providing guidelines to design these fully distributed systems. The guidelines help to assess whether blockchain technology may be needed for a distributed system. Four guidelines provide alternatives depending on the consistency and availability requirements of the system. The paper claims that these consistency and availability requirements are design decisions, and that some systems may not have strong requirements for either of them, thus not needing advanced technologies to enhance the coordination or availability (Guideline 2). For solutions that require strong consistency, logical monotonic systems can provide such consistency without coordination (Guideline 1). However, not all problems are non-monotonic, and in that case a blockchain is required to provide such consistency and maintain the system decentralization (Guildeline 4). For systems with weaker consistency requirements, CRDTs offer an alternative that favor high availability while relaxing their consistency requirements to eventual consistency (Guideline 3).
The paper then presents an architecture, which is illustrated with a running example of a Q&A system. In this proposal, the data is represented as Merkle-linked structures and distributed with IPFS. Asymmetric cryptography provides trust to the data provenance of the distributed system. Ethereum technology is proposed as the blockchain-based coordination framework to support the non-monotonic strong consistency requirements these systems may have. A query communication protocol enables the data discovery in the open distributed system, providing ranked responses and trust-less verification of responses.
This proposal faces some limitations and challenges, as other blockchainbased and distributed technologies, such as privacy [21,14] and sustainability [12]. Furthermore, the design of distributed systems following our proposal should consider security concerns of distributed systems such as sybil attacks [39] and generation attacks [32]. Still, the sustainability, and privacy of decentralized technologies is often better than the centralized alternatives [55].
Future work would help to consolidate and validate the contributions of this paper. Studying the efficiency and performance of the system, the proposal and implementation of new applications, the identification of more suitable network topologies and protocols, or the use of specialized agents such as search agents for specific applications, are some of the opportunities to explore.
Decentralization technologies offer an opportunity to solve some of the challenges of the current Internet. This paper has introduced design guidelines a framework to design and build these systems using the potentials of new decentralizing technologies.