The Design and Implementation of a Graph-Based P2P Data Storage Service

Mwinuka, Lunodzo J.; Cafaro, Massimo; Pereira, Lucas; Morais, Hugo

doi:10.3390/fi18010009

Open AccessArticle

The Design and Implementation of a Graph-Based P2P Data Storage Service

¹

Department of Engineering for Innovation, University of Salento, 73100 Lecce, Italy

²

Interactive Technologies Institute (ITI/LARSyS), Instituto Superior Técnico, Universidade de Lisboa, 1649-004 Lisbon, Portugal

³

Instituto de Engenharia de Sistemas e Computadores-Investigacão e Desenvolvimento (INESC-ID), Instituto Superior Técnico, Universidade de Lisboa, 1649-004 Lisbon, Portugal

^*

Author to whom correspondence should be addressed.

Future Internet 2026, 18(1), 9; https://doi.org/10.3390/fi18010009

Submission received: 18 November 2025 / Revised: 12 December 2025 / Accepted: 18 December 2025 / Published: 26 December 2025

(This article belongs to the Special Issue 2024 and 2025 Feature Papers from Future Internet’s Editorial Board Members)

Download

Browse Figures

Versions Notes

Abstract

This paper presents the design of G-IDSS (Graph-based InnoCyPES Data Storage Service): a novel, distributed data storage service that is built around a P2P network overlay to support handling distributed data. G-IDSS is accessible through a standard command-line interface and is based on a graph database to support the schema-less management of the distributed data stored among peers. The mechanisms to facilitate the execution of complex queries requiring distributed data integration and fusion are also presented and discussed. Besides the design, this work also provides relevant details related to the implementation of G-IDSS, reflecting several use cases that demand data that are distributed across different, even geographically spread, locations. G-IDSS scales to thousands of peers in an overlay, it is able to run distributed queries and can integrate data that are stored in different sources.

Keywords:

P2P; large scale data management; decentralised database; distributed systems; data integration; graph data

Graphical Abstract

1. Introduction

The amount of data that is produced is growing at a rapid rate from an increasing number of sources [1]. The rate at which data is currently generated is scaling up in terms of the size and number of sources. With an increasing number of connected devices, data can now originate from different geographical locations, different machine types, and it can be generated at varied rates. This is largely influenced by the growing number of online interactions, scientific research data, and industrial applications data. Managing this kind of data with traditional data management tools can be challenging [2,3,4].

Traditionally, data has been predominantly managed through centralised databases and data warehouses, which, despite their advantages in consistency and ease of management, are highly limited in scalability, fault tolerance, and latency, particularly for geographically distributed or dynamically evolving datasets [5]. Managing modern data effectively in such settings poses considerable challenges. First, scalability becomes increasingly complex due to the large volume and real-time generation of data. Second, data and system heterogeneity complicates integration and query execution. Lastly, simultaneously ensuring consistency, availability, and partition tolerance is exceptionally challenging (as shown by the CAP theorem), even when data is placed in distributed environments [6].

The prevailing data trends and management requirements demand innovative approaches to data management [7], particularly approaches that allow both horizontal and vertical scaling [8,9,10]. This would allow systems to manage colossal data whilst also offering a simplified approach for data retrieval. An exploration of P2P (Peer to Peer) systems [11,12,13,14,15] (for data management and beyond), given their features, is considered one of the preferred choices for the management of large-scale systems.

This paper presents a distributed data management system based on a P2P overlay and a graph database [16]. The combination is also associated with a data querying approach, which is appropriate for distributed settings.

The combination of these tools enables efficient data querying and storage across multiple distributed peers, ensuring comprehensive data coverage and fault tolerance while balancing the computational load of the peers. This is achieved through a unique and innovative approach to managing distributed queries in a resource-constrained environment.

The main contributions of this paper lie in the methods for decentralised graph data management and data querying (and their integration), which are facilitated by a DHT (Distributed Hash Table) overlay. The approach to executing distributed queries in a P2P system is demonstrated by illustrating query propagation and how to limit continuous broadcasting. Due to the limitations of the query language implemented, this study shows alternative query parsing strategies for supporting some of the important queries that could potentially facilitate the retrieval of useful reports.

The rest of this paper is organised as follows. Section 2 recalls the underlying technologies and state-of-the-art practices. Section 3 presents the design and implementation of the model for graph-based P2P data management. The validation of the proposed functionalities is presented in Section 4. Lastly, conclusions are drawn in Section 5.

2. Background and Literature Review

2.1. Background

P2P systems are an important piece of technology for decentralised computing, and they are praised for their scalability, resilience, and ability to operate without a centralised infrastructure [17]. They have been widely adopted for file sharing, content distribution, and blockchain technology. Primarily, P2P architectures distribute the workload across nodes, where the goal is enhancing efficiency and fault tolerance. Here, we highlight the primary known applications of P2P systems allowing proper case building, summarise the commonly known architectures to justify our choices for the adopted architecture, and—finally—we then hint at the opportunities for exploration within P2P systems.

P2P systems gained initial prominence through file sharing applications, such as Napster [18], Gnutella [19], and BitTorrent [20]. Napster’s centralised index facilitated music sharing, demonstrating P2P’s potential for rapid content dissemination. Gnutella’s decentralised protocol improved scalability via flooding-based queries [21], whilst BitTorrent’s tit-for-tat mechanism and piecewise distribution achieved high throughput, reportedly driving significant Internet traffic in the 2000s [22]. Beyond file sharing, systems like Skype leveraged P2P for voice-over-IP services, ensuring strong connectivity [23], and blockchain platforms like Bitcoin and Ethereum utilised P2P networks for decentralised transaction validation, achieving global adoption through cryptographic security and consensus protocols [24,25]. To date, the success of P2P systems originates from their scalability, fault tolerance, and adaptability, which are enabled by distributed resource utilisation and dynamic routing mechanisms like Kademlia [26].

Despite challenges, such as free riding and security vulnerabilities, innovations in incentive models and secure protocols have sustained their relevance. These attributes have made P2P systems foundational to applications like decentralised storage (e.g., IPFS—InterPlanetary File System) and computing, positioning them as a critical paradigm in distributed systems research.

2.2. P2P Architectures

Most P2P systems share the following common characteristics: resource sharing, decentralisation, symmetry, autonomy, self-organisation, scalability, and stability. Hence, they are being widely adopted. Several designs have led to various proposals for P2P network classification. For example, file sharing systems have been categorised into generations. The first generation includes hybrid designs that combine servers with P2P routing, and the second generation includes designs based on decentralised architectures. The third generation includes anonymised P2P systems, such as Freenet and I2P [17].

However, this kind of classification is not suitable because it overlooks several aspects of P2P systems. Hence, the preferred approach is to classify P2P into structured and unstructured overlays [17,27]. Unstructured overlays are generally distinguished by how search requests are propagated, the distribution of the node degree in the peer population, and by the differences in link formation with neighbour peers [28]. Structured overlays are differentiated according to the number of hops (i.e., multi-hop, single-hop, or variable-hop), routing algorithms, node degree, size of overlay, overlay geometry, and lookup type [29]. Structured overlays maintain a deterministic network topology, often utilising mechanisms, such as DHTs, to ensure predictable lookup paths and efficient routing [30]. In contrast, unstructured overlays rely on probabilistic routing without predefined paths, typically resulting in simpler management but potentially lower efficiency in data retrieval [27]. Structured overlays are chosen for their predictability, scalability, and improved performance in query processing, making them suitable for robust and efficient distributed data management [31]. Other classifications that may not necessarily fall into either structured or unstructured overlays are then categorised into hierarchical, federated, and several different categories, and they are grouped according to service or functionality [29].

2.3. P2P Data Management

Data management has been thoroughly explored, with relational databases dominating the market [32]. Due to their limitations, including their strict rules, there has been a growing interest in other approaches to address challenges related to relational databases. Initially, data warehouses and data lakes were the go-to options in cases where relational databases failed [33,34]. Recently, NoSQL (Not Only SQL) databases have gained popularity due to their flexibility and support for scalability. NoSQL databases can be classified into four major categories: (i) document-based, (ii) key value-based, (iii) graph-based, and (iv) column-based [35]. Among these, graph-based databases naturally represent real-world objects and offer better flexibility and scalability than relational ones. They also partially support ACID (Atomicity, Consistency, Isolation, and Durability) properties.

P2P data management systems, on the other hand, represent a decentralised approach for the storage, querying, and sharing of data across multiple peers without a central server [36,37]. Early P2P systems, such as Gnutella and BitTorrent, demonstrated the efficiency and scalability of decentralised architectures for file sharing [19,20,21,28]. However, their simplistic search mechanisms limited their applicability. Consequently, the development of structured P2P overlays (DHTs) marked a significant advancement. These overlays efficiently handle data lookup, insertion, and retrieval operations with a logarithmic time complexity, thereby enabling more sophisticated data management tasks [26,30,38]. Nevertheless, DHT-based approaches typically support only key-value storage, restricting their suitability for complex queries involving multidimensional or relational data.

To overcome this limitation, researchers have developed advanced P2P data management solutions that combine structured overlays with richer data representations. For instance, P-Grid [39], XPeer [40], PORDaS [41], OrbitDB2 [42], peerbit [43], DefraDB [44], and PeerDB [45] are dominant examples that integrate structured overlay protocols with database functionalities, enabling sophisticated queries and flexible data schemas in decentralised environments.

Other advanced approaches extend further into graph-based data models, recognising that real-world datasets are often naturally represented as graphs. Systems such as HypergraphDB [46], GUN [47], and Neo4j (with decentralisation extensions) demonstrate significant improvements in handling graph-based queries and managing highly connected and dynamic data efficiently across decentralised networks. We summarise graph-based P2P data management systems in Table 1.

However, despite these advances, significant challenges still remain. Issues, such as consistency, fault tolerance, efficient query processing, data integration, and distributed transactions, in dynamic and potentially unreliable P2P networks continue to attract research attention. HypergraphDB, for instance, was designed as an embedded, transactional database relying on BerkeleyDB for low-level storage, which is inherently centralised [46]. HypergraphDB also lacks native support for horizontal scalability and distributed clusters, necessitating a custom solution to manage large-scale, distributed hypergraphs. Furthermore, support for query distribution is highly limited.

Similar challenges are experienced with PeerDB. It is designed to support centralised replication and, hence, experiences limited horizontal scalability [45]. P-Grid provides comprehensive support for P2P overlays extending from traditional structured overlays, but it gives no support for typical database management functionalities [39]. GUN has demonstrated an innovative approach to mapping complex data relationships; its querying mechanisms do not entirely align with the existing database infrastructure [47]. In this regard, future research directions include developing an efficient distributed query management through optimisation techniques that work toward enhancing security and privacy mechanisms and creating adaptive data integration strategies. These efforts aim to enhance data availability, usability, and reliability across large-scale P2P systems. This research investigates the feasibility of integrating a P2P overlay with a graph database management system (DBMS) to facilitate efficient distributed query execution and data integration across multiple sources, thereby addressing common existing challenges.

3. Proposed Model and Implementation

3.1. Overview of IDSS

This study adopts our P2P data querying and integration strategy named IDSS (InnoCyPES Data Storage Service), which was originally proposed in [48] and is also briefly described below. The IDSS design presents an innovative approach to managing and integrating distributed data sources. The architecture is primarily designed around a P2P structured overlay network. An overlay utilises a DHT model to ensure the effective routing of queries and efficient management of network nodes/peers. The design aims to effectively mitigate single points of failure by distributing data storage across multiple nodes, hence significantly enhancing system/data availability, resilience, and scalability. To improve its efficiency for database management in a decentralised environment, IDSS uses an embedded database engine since using a stand-alone database server introduces another point of failure and also requires, for each local query to be issued at a peer, a client–server network interaction requiring and wasting precious time. The architecture integrates three major components: (i) an interface for query submission, (ii) a data management layer that comprises a query and update manager, and (iii) a P2P overlay with DHT, as presented in Figure 1. This design also reduces the complexity and improves the overall reliability of the system.

A client submitting queries to an overlay can be either part of the overlay (i.e., a peer) or a stand-alone user client connecting to one of the peers in the overlay. A user asynchronously submits queries to any network node (referred to as the initiator node), and each query is uniquely identified across the entire network by a Uniform Query Identifier (UQI). Queries propagate through structured broadcasting (utilising DHT), efficiently reaching all relevant nodes (referred to as intermediate nodes) without duplication. Each node executes the query locally, with results being aggregated not by using a centralised merging at the initiator node (Figure 2a), but through distributed merging across multiple nodes following a tree structure (Figure 2b) to balance the computational load among peers.

To efficiently manage the queries in an overlay, the IDSS design uses several query states. When a query is submitted to a peer, its state is QUEUED. When the query is executed locally, the state changes to LOCALLY_EXECUTED, before a peer forwards it to the peers in its routing table. Each intermediate peer, after running the query locally, forwards the query to the peers in its routing table. It then collects the results and merges the aggregate result. After this step, an intermediate peer changes the query state to COMPLETED, sends back the results to the parent peer from which the query was received, and the query state is then finally changed to SENT_BACK, as shown in Figure 3a. For an initiator node, the same protocol will be followed, except that the query will not be set to the SENT_BACK state after completion because an initiator node had received a query from a client, which is not involved in query broadcasting; rather, the client expects to retrieve the final result of its query when connecting later, providing the corresponding query’s UQI. Hence, the query state in the initiator node must be set to COMPLETED, as depicted in Figure 3b. This stage can be reached in two ways: (i) when all results have been collected, or (ii) when the user-defined time-to-live (TTL), in seconds, has elapsed. It is also possible for the two conditions to occur concurrently. A TTL parameter is incorporated to balance the query response time and the completeness of the returned results, ensuring both performance and reliability. TTL is introduced to further control how long a query should be processed: without the TTL, given an extremely large number of peers in an overlay, a query is likely to take too long.

The current implementation retains the original IDSS design, as described in this subsection, for peer overlay structure and query management. The technical implementation, including tools, techniques, and the reasons for selecting the technology stack, is further described below.

3.2. P2P Overlay

The current design utilises a structured P2P overlay network built with the libp2p (https://libp2p.io accessed on 17 December 2025) library and its Go implementation to enable decentralised communication and data distribution. The Golang implementation was chosen because it is currently the most complete among others within libp2p.

Libp2p is open-source and provides full-stack protocols for P2P systems, offering a simplified approach to implementing large-scale P2P systems (which have long been complex and challenging). It is built on top of IPFS, providing new applications with state-of-the-art P2P capabilities, including the flexibility to choose specific network features, interoperability, and the ability to enable decentralised infrastructure. The overlay forms a structured P2P network in which peers (nodes) can communicate directly without a central coordinator.

With libp2p, each node in the system initially initialises as a libp2p host that generates a unique cryptographic identity and dynamically allocates network addresses (multiaddresses), providing each peer in the overlay with a unique identification. A node’s multiaddress is composed of several segments, concatenated to form a self-describing hierarchical address format that enables peers to establish connections with each other. These addresses are defined in the following format:

/ip4/127.0.0.1/tcp/4000/p2p/QmXyzbc456…:

/ip4/127.0.0.1—This part specifies the network protocol (IPv4) and the IP address of the node. It specifies the protocol and corresponding address at which the node is reachable. Other supported alternatives include (IPv6) /ip6, /dns/, /dns4/, or /dns6/, which allow one to resolve hostnames to their respective IP addresses;
/tcp/4000—This indicates that the node is using the TCP transport protocol and is listening on port 4000. The transport layer ensures the reliable and ordered delivery of messages between peers. Other alternatives include UDP (/udp/), which is useful for lower latency but may be less reliable, and QUIC (/quic/), which provides connection-oriented reliability on top of UDP. For browser or web-based applications, WebSockets (/ws/) and secure WebSockets (/wss/) are also supported;
/p2p/QmXyzbc456…—The /p2p (formerly /ipfs in earlier versions) prefix identifies the following string (QmXyzbc456…) as the unique Peer ID of the node. This identifier is generated using cryptographic keys and distinguishes the node within the P2P overlay network.

The core component of the overlay is a Kademlia DHT, which provides scalable peer discovery, routing, and content addressing [26]. The Kademlia DHT organises peers in a binary tree structure based on their peer IDs using an XOR-based distance metric to determine the proximity between nodes. In the Kademlia DHT, a bootstrap process continually maintains a healthy, up-to-date routing table. G-IDSS runs the bootstrap process once on the startup of each peer, and it then runs periodically (with a time interval that can be set) to discover new joining peers and to identify peers close to oneself. Furthermore, G-IDSS leverages the DHT to facilitate efficient key-based routing and peer discovery, with additional local discovery enabled by multicast DNS (mDNS). This allows G-IDSS to run both locally (using local and custom bootstrap nodes) and globally (utilising default bootstrap peers provided by libp2p). The discovery and content advertising process of G-IDSS benefits from the custom-defined G-IDSS service tag, which defined during peer node initialisation.

To control message propagation in the overlay, libp2p supports adding a specific protocol prefix to the protocol ID. G-IDSS applies its own “idss/” prefix. For instance, the G-IDSS protocol used across an overlay is /idss/kad/1.0.0 for global connections or /idss/lan/kad/1.0.0 for local connections (rather than the generalised protocol IDs /kad/1.0.0 or /lan/kad/1.0.0). This approach allows for dynamic peer connectivity and controlled population of the DHT. Only peers that recognise and support the G-IDSS protocol can exchange messages. This approach avoids conflicts with other protocols operating on the same network, helps organise and route messages correctly, and facilitates future upgrades or changes without interfering with other network communications. It is particularly important when G-IDSS is launched to the public internet, where the need for packet filtration is of even greater importance. Hence, key libp2p features implemented in G-IDSS can be summarised as follows.

Multiaddress support and peer identification—Each node is assigned a unique cryptographic identity during initialisation, ensuring that every peer can be uniquely and securely identified within the network. This is accompanied by a self-describing multiaddress format allowing for the identification of various network protocols;
Secure communication—libp2p integrates either TLS (Transport Layer Security) or the Noise security protocol to encrypt communications among the peers. This guarantees the confidentiality and integrity of the transmitted data. G-IDSS uses the Noise protocol because it is lighter than TLS, more flexible, and easier to embed in a P2P setup without relying on a complex PKI infrastructure, which are requirements that align well with a decentralised system;
Stream multiplexing—Multiple logical streams can be carried over a single physical connection. This feature allows concurrent communication channels among peers, improving overall network efficiency. In G-IDSS, this also ensures reliable query propagation and data transfer by integrating it with the protocol buffer (protobuf) (https://protobuf.dev accessed on 17 December 2025) for data serialisation;
Local peer discovery—mDNS enables the automatic discovery of peers on local networks without the need for centralised DNS servers, which facilitates the rapid bootstrapping of nodes. Additionally, support is also available for the bootstrap and random walk protocols;
Efficient peer discovery and bootstrapping—The Kademlia DHT facilitates finding and connecting to other peers through a decentralised routing mechanism. It supports bootstrapping through a list of known peers (for global discovery) and continuously discovers new joining peers. For local discovery, the mDNS service is used to connect to peers that broadcast with the same service name, thereby populating its routing table. In this case, to form an overlay, at least two peers must exist. In G-IDSS, DHT also facilitates efficient query propagation and routing;
Key-based routing—Messages are routed based on keys in a deterministic manner. This ensures that any message addressed to a given key is incrementally forwarded toward the destination node;
Scalability—By organising peers into a structured overlay, the DHT supports scalable and efficient lookups. The routing table is maintained in a way that balances the load and minimises the latency;
Periodic routing table refresh—The DHT periodically refreshes its routing table to account for peer churning (nodes joining or leaving), which ensures that the routing information remains current and reliable. G-IDSS peers maintain their routing tables, which are hourly refreshed by default (the refresh rate can be adjusted as required).

3.3. Database Management System

G-IDSS integrates EliasDB (https://github.com/krotik/eliasdb accessed on 17 December 2025), a Golang graph database for managing data and queries. EliasDB is a lightweight solution for projects with data that demand a graph structure and provide most of the functionality without the need for a third-party application. It is a disk-based graph database used for storing both application data and metadata related to queries. This tool facilitates the representation of complex relationships by modelling data as nodes and edges. It defines a storage file on a disk that can contain a fixed number of records, each with a unique identifier. On disk, the logical storage file is split into multiple files, each with a maximum file size of 10 gigabytes. G-IDSS maintains a unique database instance in a peer-specific directory, which is available on each peer.

Data can be stored and retrieved using the GraphQL interface and EliasDB’s own query language, EQL, for more complex queries with an SQL-like syntax. For robust applications, the embedded EliasDB supports transactions with rollback, the iteration of data, and rule-based consistency management. To benefit from the rich set of features offered by EliasDB, G-IDSS utilises EQL for graph traversal and data querying while employing specific ECAL (Event Condition Action Language) functions (encompassing transactions) for node or edge creation, updates, and deletions.

The graph database is initialised during peer startup and leverages a predefined graph data structure that is defined in a JSON file. The format enables easy mapping of nodes and edges along with their associated properties. The examples in this work will follow a snapshot of the data schema, as described in Figure 4. The choice of EliasDB is justified by its lightweight footprint, embedded architecture, and high performance for graph operations, which eliminates the need for external database servers and reduces latency in a distributed environment. Its straightforward API and EQL’s similarity to SQL further simplify integration with a P2P system.

To facilitate testing, G-IDSS includes a data generation mechanism implemented in a Python (version 3.12.0) script: generate_data.py. This script, when invoked, generates synthetic graph data, including client nodes, consumption nodes, and edges linking clients to consumption records. The data is serialised as JSON and then loaded into EliasDB, with transactions ensuring consistency. The Python Faker (https://faker.readthedocs.io/en/master accessed on 17 December 2025) library ensures realistic data generation, whilst configurable parameters allow adjusting the data volume to be generated.

This approach enables rapid prototyping and validation of query execution without relying on real-world datasets, and JSON’s portability simplifies integration with EliasDB. In the defined graph model, the client node has five properties, as detailed in Table 2: key, kind, name, contract number, and the power of the house. The consumption node holds four properties, as detailed in Table 3: key, kind, timestamp, and measurement. The data model also includes edges connecting the two nodes, assigning each consumption record to a household, as described in Table 4. The number of nodes or edges to be generated can be changed using the available generate_data.py script. Hence, the data querying examples in the following sections will be based on this data schema.

3.4. Query Management

We recall that a user’s client must connect to one of the peers in the overlay to submit queries. Once a query is submitted, it is then handled according to the description in Section 3.1. To facilitate efficient query propagation and result aggregation in a distributed graph database system, we define a protobuf schema using the proto3 syntax, which is encapsulated within the common G-IDSS package. This schema provides a lightweight, platform-agnostic mechanism for serialising and exchanging structured messages across network nodes. The schema comprises three primary message types alongside an enumeration MessageType to distinguish between query and result messages: QueryMessage, QueryState, and Row.

The QueryMessage message serves as the core structure for encapsulating the query-related information propagated through the network. It includes fields for the message type (type—an enum of QUERY or RESULT), the query string (query), a unique query identifier (uqid), and a timestamp (timestamp). Additional fields manage the query’s lifecycle and metadata, such as the time-to-live (TTL—a float indicating the query’s propagation duration in seconds), sender and originator peer identifiers (sender and originator), and an error message (error) for failure reporting. The labels field, a repeated string, enables the categorisation of queries, while the result field, a repeated Row message, stores the query results. The record_count field tracks the number of records in the result, and the state field, whose type is QueryState, captures the query’s execution status.

The QueryState message defines the state of a query within the distributed system using an embedded enumeration state with values:

QUEUED (query available in the queue);
LOCALLY_EXECUTED (query executed locally);
SENT_BACK (result returned to the parent node);
COMPLETED (query fully processed);
FAILED (query execution failed).

The QueryState message also includes an optional result field and a repeated row to store the intermediate or final results during query processing.

The row message represents a single row in a query result, with a repeated data field of strings to accommodate flexible, schema-agnostic data representation. This design ensures compatibility with diverse query result formats whilst maintaining simplicity. Similarly, the presented Protobuf schema enables robust, scalable communication in the distributed graph database system by providing a standardised, efficient format for query propagation, state tracking, and result aggregation. Its use of enumerations and repeated fields aims to support extensibility and adaptability to different query workloads, thereby making the tool performant and reliable in a decentralised network environment.

3.5. Client Implementation

The G-IDSS client provides a command-line interface for interacting with the P2P network. The client connects to a peer using a specified multiaddress and protocol (which defaults to/lan/kad/1.0.0), submitting queries as QueryMessage objects over libp2p streams. The user inputs a query and a TTL value, and the responses are deserialised from protobuf and saved to XML files (results/<uqid>.xml) for persistence. The client distinguishes between status messages (e.g., for inserting data) and query results, logging relevant details. Termination signals are handled gracefully by cleaning up the result files and closing the libp2p host, ensuring the testing environment remains clean throughout the test. This lightweight client design ensures accessibility while maintaining compatibility with the server’s architecture, with XML output facilitating integration with external tools, if needed.

4. Validation

To evaluate the performance, scalability, and data integration of G-IDSS, we designed and carried out a series of experiments to validate its core components: P2P overlay, peer communication, DBMS, data querying mechanisms, and data integration. This section outlines the experimental setup, the types of experiments conducted, and the specific validation strategies for each component, focusing on peer communication, overlay maintenance, database operations, query execution and effectiveness, data integration for each operation, and scalability with regard to computing resources.

4.1. Experimental Setup

Experiments were conducted on various scales to validate the prototype as a proof of concept. Four main testing setups were used based on the required experiments and computing power. The first setup involved a basic GitHub Codespace with 8 GB of memory, 2 Intel(R) Xeon(R) Platinum 8370C CPUs at 2.80 GHz, and 32 GB of disk space, which will be referred to as the codespace from now on. The second setup involved a virtual machine on a PC with 32 GB of RAM, an 11th-generation Intel Core i7-10710U CPU at 1.10 GHz, and 500 GB of SSD storage. The VM instance was based on a Windows Subsystem for Linux (WSL) running Ubuntu 22.04.5 LTS, with 16 GB of memory and 12 logical processors, which we will now refer to as WSL. The third setup involved a server machine running Ubuntu 24.04.2 LTS, with 2 Intel(R) Xeon(R) 12 cores CPUs at 2.00 GHz, 64 GB of memory, and 3 TB of disk space, which will be referred to as gridsurfer from now on. The fourth setup used another high-performance server running Ubuntu 24.04.3 LTS, with Intel(R) Xeon(R) W-2495X 40-core CPUs, 255 GB of memory, and 6 TB of storage, which will be referred to as datadog from now on. The prototype’s codebase is primarily implemented in Go (version 1.21) for P2P overlay and database functionality, and its downloadable repository is available on GitHub (https://github.com/cafaro/IDSS/tree/main accessed on 17 December 2025). A bash script was used to simulate the peers on an overlay, and Python was used to generate synthetic data to populate graph storage. Protobuf was used in the implementation to facilitate stream communication between peers and message serialisation. A specific release related to test results reported in this manuscript is available on GitHub (https://github.com/Lunodzo/idss_graphdb/releases/tag/gidssv01 accessed on 17 December 2025).

The experiments were conducted by running multiple G-IDSS peers and clients, with each G-IDSS peer hosting its own dataset. Clients were run in different fashions to test the G-IDSS ability to handle multiple client requests and to respond accordingly. During experiments, the number of peers varied from 10 to 10,000 to assess for system’s scalability, with each peer running an instance of the EliasDB graph database and the libp2p-based P2P stack. Synthetic data populated each peer’s database with 10 client nodes, 100 consumption nodes, and 1000 belongs_to edges, unless specified otherwise in the experiments testing scalability. In such scenarios, we generated two graph nodes per peer, allowing the experiment to allocate compute resources to accommodate an increasing number of peers rather than using them for synthetic data generation. The Kademlia DHT was initialised with a subset of peers as bootstrap nodes, and the Noise protocol ensured secure communication among peers. Performance metrics were collected using the built-in pprof server (localhost:6060) and custom logging (go-log/v2). Each experiment was repeated several times to draw concrete conclusions and to record the best observed performance.

4.2. Overlay Establishment

In this experiment, the study primarily focused on two key metrics: peer communication and its scalability. In peer communication, during all experiments, peers were able to efficiently create their IDs and overlay properties and then join the G-IDSS overlay (which is launched with its own service tag and protocol). The number of peers during experiments was 10–100 in the codespace, 200 in WSL, 1500 on gridsurfer, and up to 10,000 peers on datadog. While similar scalability test reports for libp2p and other P2P libraries are scarce in the research space, other (unreviewed) experiments using the IPFS DHT with Go-libp2p reportedly enabled the launch of up to 1000 concurrent peer connections, after which streams were automatically reset. Some libp2p implementations written in Rust reported that nodes can maintain up to 10,000 peer connections, including more than 1500 validator connections (https://github.com/libp2p/rust-libp2p/discussions/3840?utm accessed on 17 December 2025). Currently, there are notable research works that have proposed approaches to enable DHTs to scale to millions of peers [49]. This implies that scaling to 100,000 nodes with libp2p is viable if the per-node peer degrees remain around 100 connections. It is worth noting that the limited computational resources can limit scalability when experimenting with P2P overlays on a single machine. However, with peer discovery mechanisms (mDNS, DHT, and bootstrap nodes), scalability can be further improved because peers do not need to maintain a full connection to each other. Existing scalability evaluations of well-known P2P overlays demonstrate substantial progress, though largely through controlled studies. Early structured overlays, such as Chord, have been shown to scale to approximately 10,000 peers using a P2P simulator [50], while Pastry and the gossip-enhanced DHT-like system Kelips report scalability up to 100,000 simulated peers [51,52]. More recent Kademlia-based deployments—including the BitTorrent Mainline DHT and related variants—have been evaluated at scales of more than a million nodes, either through large-scale simulations or real-world measurements [53,54].

In our experiments, the observed limits on the number of peers that we launched were due to resource exhaustion during the launch process. This improves if the launch process is performed in chunks, especially when launching more than 1000 peers. Maintaining an overlay does not require significant computational resources, mainly because we have set the routing table to refresh once every hour, and each peer maintains only a partial connection to all peers in the overlay. Libp2p also maintains a bucket size of 20 peers in the routing table, which is configurable depending on the requirements. This means that the computing required increases depending on DHT configurations. Important parameters to consider in such scenarios include the number of peers in the overlay, periodic discovery, routing table updates, liveness checks, handling new connections, and necessary data exchange during overlay population. Furthermore, since encryption of communication is required, key exchange operations should also be considered.

The observed failures were also due to the synthetic data generation process, rather than peer communication itself. This was the case in our experiments because, for a peer to join an overlay, we required them to have a graph data manager and some data loaded into it, which can be a slightly slower process depending on the data size. Hence, to test the maximum number we could launch, we had to set synthetic data generation to the minimum number of nodes and edges.

4.3. Database Management and Data Querying

This experiment aimed to test the tool’s capabilities for data management in terms of querying and integration in a decentralised fashion. For convenience, experiments on data querying and integration were conducted with 10–100 peers for WSL, 10–1000 nodes for gridsurfer, and 100–5000 peers for datadog, comprising 10-to-5 million graph nodes, to prove that decentralised graph data can be queried and integrated in a common format.

Each peer joining an overlay is mandated to host a graph DBMS before joining the network. Given the number of peers, this was successful in all experiments, and all peers involved in testing had both DBMS and graph data loaded. The results described in this section are based on query performance in the context of P2P distributed databases.

G-IDSS implements all basic queries supported by EQL’s syntax. Since their retrieval patterns involve fetching nodes under conditions like =, !=, | <, > |, as well as contain like and arithmetic operators, running such queries in distributed environments requires simple data aggregation methods among peers. Example queries may include the following:

get Client;
get Consumption traverse ::: where name = “Alice”;
lookup Client ’3’ traverse :::;
get Client traverse owner: belongs_to: usage: Consumption;
get Client show name.

The traversal expressions indicated by ::: provide a way to traverse the edges to fetch nodes and edges that are connected to each other (an equivalence to relationships in relational databases). The traversal statements may be followed by relational operators to filter data nodes/edges that comply with a condition given, i.e.,

get Client traverse owner: belongs_to: usage: Consumption where measurement >= 300 and measurement <= 800.

The EQL comes with the built-in COUNT function, which can be useful in data querying. For instance, it can be used to fetch nodes that have a certain number of connections (relationships). In this regard, when the count number is set to greater than zero, the query returns all nodes in the graph that are connected to each other and have at least one connection, so any connected node will be returned. Possible queries are presented in Table 5.

When such queries are run in a distributed fashion, they require only result aggregation among peers. However, this is not the case when performing queries that include additional functions, such as sorting datasets or computing the average of data values in a distributed manner. G-IDSS supports the distributed sorting of results using built-in EQL capabilities. To sort data, EQL uses the WITH clause, which is followed by the ordering command, for which a query has to specify an ascending or descending directive, and this is followed by a property to which one wants to refer for sorting, as indicated below. EQL also allows the use of show command to only return a specified property/column.

get Client with ordering(ascending Client:name);
get Client show name with ordering(ascending name).

We run test queries by launching up to 100 peers, each having a varied number of graph nodes. We chose two cases which can basically represent a minimal and reasonably high number of graph nodes that can be fetched in a decentralised fashion, as summarised below.

Case 1: Few peers with huge data load. This is conducted through an experiment that launches 10 peers with a large dataset comprising 100,000 client nodes, 1,000,000 consumption nodes, and 1,000,000 edges across the entire overlay. On the other hand, the experiment involved launching 1000 graph nodes in each peer, running between 10–5000 peers in WSL, gridsurfer, and datadog.
Case 2: Increased number of peers with a few data. This is conducted by launching 100 peers with 1000 client nodes, 10,000 consumption nodes, and 10,000 edges across the entire overlay. Furthermore, experiments were conducted on the same machines used in Case One, this time loading peers with only 10 graph nodes per peer, to understand the variation in performance due to system dynamics.

The two scenarios will presumably generalise the complexities of data fetching in a few-peer environment and in an increased number of peers environment, demonstrating the impact of querying complexities in a scaling network. Three basic queries were considered based on the nature of the result set they return. First, get Client; second, get Consumption; and, third, get Consumption traverse :::. These queries aggregate data from peer-to-peer integration and traverse across nodes, enabling decentralised querying across traversed data. All peers and data generation were initiated by the prepared bash script to automate background tasks. With a single client launched, we tested all query types described in this section.

4.3.1. Case 1: Loading Huge Datasets

With 10 peers, the first case query returned all results in 4-to-10 s, fetching data between 1000 and 100,000 across the entire overlay. As a result, we extended the experiments for this use case by testing different numbers of peers with 1000 graph nodes per peer, resulting in 10,000 graph nodes across 10 peers, 20,000 across 20 peers, and 1,000,000 across 1000 peers. These experiments were run in WSL, gridsurfer, and datadog. The WSL accommodated up to 100 peers in this case, with 100,000 graph nodes. Hence, it was able to fetch all the results in 60 s, 67% of the results in 30 s, and it fetched results in only 1 peer when launching a query between 1 and 10 s. A summary of WSL experiment results is presented in Table 6.

The same experiment was repeated on gridsurfer, where, with the same number of peers and data size simulated in WSL, G-IDSS was able to fetch all results in 30 s and more than half in 10 s. The summary of gridsurfer test results is presented in Table 7.

We finally ran the same experiment in the datadog, which is the best machine among others involved in the test. The datadog machine was loaded with the same amount of data as for gridsurfer and WSL. In the case of peers, datadog launched 100-to-5000 peers. The experiment demonstrated the capacity of datadog to fetch all loaded datasets in 30 s while hosting up to 1000 peers. With 5000 peers, datadog was able to fetch up to 50% of the entire dataset in 60 s. This test has shed an important light that, with the existence of many peers, small TTL values would not be able to fetch the best result for consumption. The summary of its performance is presented in Table 8.

4.3.2. Case 2: Scaling Number of Peers

With an increased number of peers containing fewer datasets, G-IDSS generally takes longer time to fetch the required results. The goal of this experiment was to observe how many peers each testing platform would be able to accommodate. G-IDSS was able to fetch all results in at least 30 s given a WSL machine, 10 s for the gridsurfer machine, and 1 s for datadog, all with 100 peers. Furthermore, we carried out additional experiments testing different peer sizes. In this case, every peer in the overlay had only 10 graph nodes. With WSL, G-IDSS was able to fetch all the results within 30 s while simulating with 50 peers. With 100 peers, the WSL fetched 89% of the required results in 30 s, 32% in 10 s, and 14% in 1 s. A summary of these results is presented in Table 9.

On the other hand, G-IDSS on gridsurfer fetched all results from 100 peers within 30 s. In 10 s, G-IDSS fetched 42% of the required results. Furthermore, 1 s was enough to fetch all results in 10 and 20 peers. A summary of these results is presented in Table 10.

We, lastly, simulated the experiments in datadog. In the tests, G-IDSS fetched all loaded results in 10 s while launching 1000 peers. With 5000 peers, G-IDSS fetched 80% of all results in 60 s. With 1 s, only 1% of the data was fetched. Other results are summarised in Table 11.

We also conducted small-scale tests to observe the impact of the TTL value on querying. We ran the test by launching 50 peers, each with 10 client nodes (resulting in 500 clients in the overlay), 100 consumption records (resulting in 5000 consumption records in the overlay), and 5000 edges (each client node having 100 edges, representing direct connections to consumption records). A “get Client” query, executed with a TTL equal to 3, returned a total of 330 records, indicating that 33 out of 50 peers were successfully reached and returned the results. Running the same query with an increased TTL value of 6, G-IDSS was able to fetch all 500 client nodes, thereby reaching all peers in the overlay. Query traversal can also be run in G-IDSS to search for targeted records. For instance, the query traversal “get Client traverse owner: belongs_to: usage: Consumption” retrieves all client records with their associated consumption records in 7 s, i.e., these are two nodes connected through an edge with a connection pattern described as “owner: belongs_to: usage: Consumption”. A detailed description of submitted queries and fetched results is provided in the GitHub repository. As stated in [48], fetching results from G-IDSS is based on a “best effort” approach, given the potentially large number of distributed servers. Considering other factors that may affect an overlay, querying with a TTL of 3, for instance, will not always return the same number of records, even with a constant number of peers.

4.3.3. Other Querying Support

G-IDSS also supports running local queries, i.e., queries that are not intended to be broadcast in the overlay. This allows exclusively fetching data from a single peer, without necessarily fetching it from other peers. Additionally, all operations that require modifications to the graph data are limited to local data repositories, as it would not be reasonable to allow a peer to modify another peer’s datasets. To run local queries, a client must submit a query with either -l or -local flag. With the current release, we do not necessarily need to submit a TTL, as it is intended to control query propagation and dictate how long a client can wait before receiving the results.

EQL does not inherently support data modification queries; however, EliasDB provides built-in ECAL functions for data modification, which can be executed as EliasDB graph manager operations or transactions. These functions are StoreNode(), which adds a node; RemoveNode(), which removes a node; and UpdateNode(), which updates an existing node. G-IDSS uses custom-defined keywords to facilitate the utilisation of these functions.

To add graph nodes, G-IDSS uses the keyword add to indicate a query that requires adding a node. The query must adhere to the “add <kind> <key> [properties]” syntax. This can be run as add Client 14 client_name = “John Doe” contract_number = 7437643 power = 7575.

To update/modify a graph node, G-IDSS uses the update keyword to change the values of an attribute or property. For instance, updating the Client with key “14” to a different name, we run the following query: update Client 14 client_name=“John Rhobi”. Lastly, to delete a graph node, a client must use the delete keyword, which is followed by a graph kind and key in the following format: delete <kind> <key>”. For example, a sample query can be formulated as delete Client 14.

In adding and updating graph nodes, queries may include attributes/properties that were not initially available in the graph data model. This implies that, when fetching the entire dataset, some graph nodes may have nil properties in certain instances, unless the data is universally updated or initial data preprocessing is performed within G-IDSS.

The time spent in fetching the complete result set is not constant. It is affected by factors such as the network traffic, the number of concurrent operations a single server must handle, the size of the overlay network, etc. The TTL set has a high impact on the completeness of the results. A higher TTL value will always guarantee the completeness of results. In small networks and controlled overlays, TTL may not be relevant; however, in large networks or when querying a global-scale network, it becomes relevant. In this case, we also note that the type of queries submitted did not place a significant computational load compared to other conditions. However, query formulation remains an important consideration when large loads are expected.

4.4. G-IDSS Clients

To assess the scalability and reliability of G-IDSS in a multi-client scenario, an experiment was conducted in which multiple G-IDSS clients were connected to a network of 10 server peers, with each client establishing a connection to a distinct server. The experiment aimed to evaluate the system’s performance in terms of connection establishment and data querying, focusing on success rates and data querying. The setup deployed 10 server peers, each running as a separate process with a unique TCP port (/ip4/127.0.0.1/tcp/x), and they were instantiated the EliasDB graph database with synthetic data (10 client nodes and 100 consumption nodes). Ten clients were simulated, each executed as a separate process, connecting to a randomly assigned server peer via its multiaddress. The experiment comprised two phases: (1) connection establishment, where clients initiated libp2p streams to their respective servers, and (2) data querying, where each client submitted a mix of local and distributed queries. Metrics included connection success rate, query response time, query success rate, and amount of data retrieved.

4.4.1. Client Connectivity

The connection phase tested clients’ ability to establish libp2p streams to their assigned server peers using the host.NewStream function in idss_client.go. The launch of client peers was performed across a variety of computing nodes with G-IDSS. Each client was provided with the multiaddress of its target server, e.g., /ip4/127.0.0.1/tcp/port/p2p/peerID, and the connection success rate was measured as the percentage of successful stream establishments. Across all server and client peers, the connection success was 100%, with no failures attributed. However, in a large-scale setup, we anticipate some failures due to port conflicts or transient resource contention within the test environment. When this happens, G-IDSS will log it by log.Error(“Error opening stream”). Connection latency, which is measured from stream initiation to successful establishment using WLS, averaged 15 ms in local settings and 73 s on the global internet because the bootstrap process traverses the entire libp2p network. Resource utilisation also peaks as the number of peers increases. The high success rate and minimal latency validate the robustness of the libp2p-based connection mechanism, particularly the lightweight Noise protocol’s secure handshake and the DHT’s rapid peer resolution. However, limitations include potential port exhaustion when scaling beyond 700 peers in the WSL setup, as dynamic port allocation can encounter conflicts, and increased CPU contention with more concurrent clients, suggesting a need for multi-VM deployments in larger scenarios. This was not experienced in gridsurfer as we launched up to 1500 peers.

4.4.2. Data Querying

The data querying phase evaluated the system’s ability to handle concurrent queries from multiple clients. One-to-one client-server querying was smooth, and data retrieval was straightforward. Queries were sequentially executed per client, with results processed locally and globally, in a distributed fashion. The query success rate, defined as the number of queries returning correct results (verified against a baseline of graph nodes loaded), saw both local and distributed queries achieve near-100% success, with data completeness affected by the TTL value and the load submitted to the server peer.

Distributed queries had a slightly lower data-completeness success rate due to occasional timeouts under high concurrency. Result accuracy was confirmed by checking that all client nodes returned the correct results for the distributed query example (as each had 10 consumption connections). These rates demonstrate G-IDSS’s ability to manage concurrent multi-client queries by leveraging the P2P overlay and EliasDB’s efficiency.

The experiment and broader G-IDSS implementation revealed several limitations across the P2P overlay, database management, and data querying, particularly in the single-VM setup. The P2P overlay faced scalability constraints due to port exhaustion and resource contention, as dynamic port allocation risked conflicts when scaling, and CPU/memory contention increased with concurrent clients/peers, limiting the system testing to a moderate network size. The reliance on the loopback interface eliminated real-world network variability (e.g., packet loss and latency), necessitating emulation via multi-VM testing for realistic conditions. In real-world scenarios, an overlay can scale up even further because each peer has its own autonomy and computing resources, thereby alleviating the problem of resource contention that we experienced in testing.

Database management was constrained by EliasDB’s performance with large datasets; while efficient for small graphs, query latency could increase significantly with an increase in nodes, requiring indexing or caching optimisations. Furthermore, data querying suffered from distributed query timeouts under high concurrency, as TTL-based propagation struggled with simultaneous requests and distributed queries incurred higher latency due to the merging of results. The single-machine setup exacerbated these issues by concentrating resource demands, and the lack of fault tolerance mechanisms (e.g., query retry policies) reduced robustness in the face of peer failures. Although libp2p promises scalability up to a million peers, G-IDSS still needs to be tested in a large-scale scenario that includes diverse datasets and multiple machines to achieve production-ready scalability and reliability.

5. Conclusions

This paper presents a novel approach to decentralised data management, leveraging a structured P2P overlay based on libp2p and a Kademlia DHT, as well as an embedded EliasDB graph database, to enable scalable, resilient, and schema-less data storage and querying. This paper details the design and implementation of G-IDSS, demonstrating its ability to distribute data across peers, execute complex queries with efficient propagation and result aggregation, and support for multiple clients through a lightweight command-line interface. The ability to integrate distributed data sources affirms G-IDSS’s potential as a proof-of-concept for managing large-scale, geographically distributed datasets in resource-constrained environments, as targeted within similar use cases.

Despite its achievements, G-IDSS faces several limitations that invite further exploration. The single-machine setup constrained scalability tests due to resource contention, limiting the system to moderate network sizes. Although libp2p promises high scalability, it is still necessary to expose the tool to large scenarios. Moreover, EliasDB’s performance with large datasets and the synthetic dataset’s uniformity restricts query variability, potentially masking edge cases.

The distributed queries experienced timeouts under high concurrency, and the lack of robust fault tolerance mechanisms reduced resilience to peer failures. Although timeouts were necessary in some cases to limit query propagation, further improvements are still needed for enhanced efficiency. These challenges hint at the need for optimised query propagation, efficient results aggregation, and diverse datasets to achieve production-ready scalability and reliability. Furthermore, despite libp2p’s features, such as mDNS discovery and secure communication and enhanced system flexibility, testing on the global internet revealed higher connection latencies, emphasising the importance of real-world network validation.

Nevertheless, G-IDSS offers a foundation for advancing decentralised data management in applications such as cyber-physical energy systems, scientific research, and IoT ecosystems. Future work should focus on addressing scalability limitations in resource-constrained environments by deploying G-IDSS multiple VMs or physical nodes to simulate realistic network conditions, incorporating advanced fault tolerance mechanisms (e.g., query retries and replication) and optimising EliasDB for larger, heterogeneous datasets. Enhancements to query management, such as adaptive TTL strategies, could further reduce latency and improve result completeness. Additionally, integrating security and privacy mechanisms, such as access control for specific data items (which can benefit from approaches presented in [55,56]), will be critical for public internet deployments. Finally, the deployment of multiple DBMSs in the overlay is also a possibility by employing query translation mechanisms to adapt to local DBMS query syntax.

Author Contributions

Conceptualization, L.J.M. and M.C.; methodology, L.J.M., M.C., L.P., and H.M.; software, L.J.M.; validation, L.J.M., M.C., L.P., and H.M.; investigation, L.J.M. and M.C.; writing—original draft preparation, L.J.M. and M.C.; writing—review and editing, L.J.M., M.C., L.P., and H.M. All authors have read and agreed to the published version of the manuscript.

Funding

L.J.M. and M.C. have been supported by the InnoCyPES (Innovative tools for Cyber-Physical Energy Systems) project, which has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska Curie grant agreement No. 956433. L.P. was supported by FCT projects 10.54499/LA/P/0083/2020, 10.54499/UIDP/50009/2020, and 10.54499/UIDB/50009/2020; and grant CEECIND/01179/2017. H.M. was supported by project n. 56—“ATE”, financed by European Funds, namely “Recovery and Resilience Plan—Component 5: Agendas Mobilizadoras para a Inovação Empresarial”, included in the NextGenerationEU funding program and by FCT projects UID/50021/2025 and UID/PRR/50021/2025.

Data Availability Statement

No new datasets were created or analysed in this study. Data sharing is not applicable for this article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Duarte, F. Amount of Data Created Daily. 2025. Available online: https://explodingtopics.com/blog/data-generated-per-day (accessed on 17 December 2025).
Siddiqa, A.; Hashem, I.A.T.; Yaqoob, I.; Marjani, M.; Shamshirband, S.; Gani, A.; Nasaruddin, F. A survey of big data management: Taxonomy and state-of-the-art. J. Netw. Comput. Appl. 2016, 71, 151–166. [Google Scholar] [CrossRef]
Abawajy, J.H.; Zomaya, A.Y.; Stojmenovic, I. Network computing and applications for Big Data analytics. J. Netw. Comput. Appl. 2016, 59, 361. [Google Scholar] [CrossRef]
Fu, X.; Pan, L.; Liu, S. Caching or re-computing: Online cost optimization for running big data tasks in IaaS clouds. J. Netw. Comput. Appl. 2025, 235, 104080. [Google Scholar] [CrossRef]
Mohan, R.; Gupta, S. Data Management Transformation to Drive Subsurface Autonomy. First Break 2024, 42, 75–78. [Google Scholar] [CrossRef]
Brewer, E. CAP twelve years later: How the “rules” have changed. Computer 2012, 45, 23–29. [Google Scholar] [CrossRef]
Etemad, M.; Küpçü, A. Verifiable database outsourcing supporting join. J. Netw. Comput. Appl. 2018, 115, 1–19. [Google Scholar] [CrossRef]
Amiri, M.J.; Agrawal, D.; El Abbadi, A. Modern large-scale data management systems after 40 years of consensus. In Proceedings of the 2020 IEEE 36th International Conference on Data Engineering (ICDE), Dallas, TX, USA, 20–24 April 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1794–1797. [Google Scholar]
Hoffer, J.A.; Ramesh, V.; Topi, H. Modern Database Management; Pearson: London, UK, 2016. [Google Scholar]
Raghavendra, G.; Basha, K.T. Data Processing through Data Warehouse and Data mining. Int. J. Mod. Trends Eng. Res. 2017, 4, 45–48. [Google Scholar] [CrossRef]
Qureshi, A.; Megías, D.; Rifà-Pous, H. PSUM: Peer-to-peer multimedia content distribution using collusion-resistant fingerprinting. J. Netw. Comput. Appl. 2016, 66, 180–197. [Google Scholar] [CrossRef]
Goudarzi, P.; Tabatabaee Malazi, H.; Ahmadi, M. Khorramshahr: A scalable peer to peer architecture for port warehouse management system. J. Netw. Comput. Appl. 2016, 76, 49–59. [Google Scholar] [CrossRef]
Amad, M.; Meddahi, A.; Aïssani, D.; Zhang, Z. HPM: A novel hierarchical Peer-to-Peer model for lookup acceleration with provision of physical proximity. J. Netw. Comput. Appl. 2012, 35, 1818–1830. [Google Scholar] [CrossRef]
Lee, P.; Jayasumana, A.P.; Dilum Bandara, H.; Lim, S.; Chandrasekar, V. A peer-to-peer collaboration framework for multi-sensor data fusion. J. Netw. Comput. Appl. 2012, 35, 1052–1066. [Google Scholar] [CrossRef]
Hu, C.L.; Kuo, T.H. A hierarchical overlay with cluster-based reputation tree for dynamic peer-to-peer systems. J. Netw. Comput. Appl. 2012, 35, 1990–2002. [Google Scholar] [CrossRef]
Cong, L.; Yu, J.; Ge, X. Privacy preserving subgraph isomorphism query for dynamic graph database. J. Netw. Comput. Appl. 2023, 211, 103562. [Google Scholar] [CrossRef]
Shen, X.; Yu, H.; Buford, J.; Akon, M. Handbook of Peer-to-Peer Networking; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2010; Volume 34. [Google Scholar]
Carlsson, B.; Gustavsson, R. The rise and fall of napster-an evolutionary approach. In Proceedings of the International Computer Science Conference on Active Media Technology, Hong Kong, China, 18–20 December 2001; Springer: Berlin/Heidelberg, Germany, 2001; pp. 347–354. [Google Scholar]
Ripeanu, M. Peer-to-peer architecture case study: Gnutella network. In Proceedings of the Proceedings First International Conference on Peer-to-Peer Computing, Linköping, Sweden, 27–29 August 2001; IEEE: Piscataway, NJ, USA, 2001; pp. 99–100. [Google Scholar]
Pouwelse, J.; Garbacki, P.; Epema, D.; Sips, H. The bittorrent p2p file-sharing system: Measurements and analysis. In Proceedings of the Peer-to-Peer Systems IV: 4th International Workshop, IPTPS 2005, Ithaca, NY, USA, 24–25 February 2005; Revised Selected Papers 4. Springer: Berlin/Heidelberg, Germany, 2005; pp. 205–216. [Google Scholar]
Ripeanu, M.; Foster, I.T. Mapping the Gnutella Network: Macroscopic Properties of Large-Scale Peer-to-Peer Systems. In Proceedings of the Revised Papers from the First International Workshop on Peer-to-Peer Systems, Cambridge, MA, USA, 7–8 March 2002; Springer: Berlin/Heidelberg, Germany, 2002. IPTPS ’01. pp. 85–93. [Google Scholar]
Schulze, H.; Mochalski, K. Internet Study 2008/2009. Ipoque Rep. 2009, 37, 351–362. [Google Scholar]
Baset, S.A.; Schulzrinne, H.G. An Analysis of the Skype Peer-to-Peer Internet Telephony Protocol. In Proceedings of the Proceedings IEEE INFOCOM 2006, 25TH IEEE International Conference on Computer Communications, Barcelona, Spain, 23–29 April 2006; pp. 1–11. [Google Scholar] [CrossRef]
Nakamoto, S.; Bitcoin, A. Bitcoin: A Peer-to-Peer Electronic Cash System, 2008. Available online: https://bitcoin.org/bitcoin.pdf (accessed on 17 December 2025).
Wood, G. Ethereum: A secure decentralised generalised transaction ledger. Ethereum Proj. Yellow Pap. 2014, 151, 1–32. [Google Scholar]
Maymounkov, P.; Mazieres, D. Kademlia: A peer-to-peer information system based on the xor metric. In Proceedings of the International Workshop on Peer-to-Peer Systems, Cambridge, MA, USA, 7–8 March 2002; Springer: Berlin/Heidelberg, Germany, 2002; pp. 53–65. [Google Scholar]
Castro, M.; Costa, M.; Rowstron, A. Peer-to-Peer Overlays: Structured, Unstructured or Both; Msr-tr-2004-73 Microsoft Research: Cambridge, UK, 2004. [Google Scholar]
Stutzbach, D.; Rejaie, R.; Sen, S. Characterizing unstructured overlay topologies in modern P2P file-sharing systems. IEEE/ACM Trans. Netw. 2008, 16, 267–280. [Google Scholar] [CrossRef]
Buford, J.F.; Yu, H. Peer-to-Peer Networking and Applications: Synopsis and Research Directions. In Handbook of Peer-to-Peer Networking; Shen, X., Yu, H., Buford, J., Akon, M., Eds.; Springer: Boston, MA, USA, 2010; pp. 3–45. [Google Scholar] [CrossRef]
Naor, M.; Wieder, U. A Simple Fault Tolerant Distributed Hash Table. In Proceedings of the Peer-to-Peer Systems II, Berkeley, CA, USA, 21–22 February 2003; Kaashoek, M.F., Stoica, I., Eds.; Springer: Berlin/Heidelberg, Germany, 2003; pp. 88–97. [Google Scholar]
Dhara, K.; Guo, Y.; Kolberg, M.; Wu, X. Overview of structured peer-to-peer overlay algorithms. In Handbook of Peer-to-Peer Networking; Springer: Boston, MA, USA, 2010; pp. 223–256. [Google Scholar] [CrossRef]
Jatana, N.; Puri, S.; Ahuja, M.; Kathuria, I.; Gosain, D. A survey and comparison of relational and non-relational database. Int. J. Eng. Res. Technol. 2012, 1, 1–5. [Google Scholar]
Nambiar, A.; Mundra, D. An Overview of Data Warehouse and Data Lake in Modern Enterprise Data Management. Big Data Cogn. Comput. 2022, 6, 132. [Google Scholar] [CrossRef]
Nargesian, F.; Zhu, E.; Miller, R.J.; Pu, K.Q.; Arocena, P.C. Data lake management: Challenges and opportunities. Proc. Vldb Endow. 2019, 12, 1986–1989. [Google Scholar] [CrossRef]
Agrawal, S.; Patel, A. Astudy on graph storage database of nosql. Int. J. Soft Comput. Artif. Intell. Appl. (IJSCAI) 2016, 5, 33–39. [Google Scholar]
Jawad, M. Data Privacy in P2P Systems. Ph.D. Thesis, Université de Nantes, Nantes, France, 2011. Available online: https://theses.hal.science/tel-00638721v1 (accessed on 17 December 2025).
Yin, S.; Fu, C.; Zhao, S.; Li, K.; Sun, X.; Xu, T.; Chen, E. A survey on multimodal large language models. Natl. Sci. Rev. 2024, 11, nwae403. [Google Scholar] [CrossRef] [PubMed]
Zhang, H.; Wen, Y.; Xie, H.; Yu, N. Distributed Hash Table: Theory, Platforms and Applications; Springer: Berlin/Heidelberg, Germany, 2013. [Google Scholar]
Aberer, K. P-Grid: A self-organizing access structure for P2P information systems. In Proceedings of the International Conference on Cooperative Information Systems, Trento, Italy, 5–7 September 2001; Springer: Berlin/Heidelberg, Germany, 2001; pp. 179–194. [Google Scholar]
Sartiani, C.; Manghi, P.; Ghelli, G.; Conforti, G. XPeer: A Self-Organizing XML P2P Database System. In Proceedings of the Current Trends in Database Technology—EDBT 2004 Workshops, Heraklion, Greece, 14–18 March 2004; Lindner, W., Mesiti, M., Türker, C., Tzitzikas, Y., Vakali, A.I., Eds.; Springer: Berlin/Heidelberg, Germany, 2005; pp. 456–465. [Google Scholar]
Norvag, K.; Eide, E.; Standal, O.H. Query planning in P2P database systems. In Proceedings of the 2007 2nd International Conference on Digital Information Management, Lyon, France, 28–31 October 2007; Volome 1, pp. 376–381. [Google Scholar] [CrossRef]
Contributors, OrbitDB: Peer-to-Peer Databases for the Decentralized Web. 2024. Available online: https://github.com/orbitdb/orbitdb (accessed on 21 July 2025).
Contributors, Peerbit: A CRDT-Based State Machine Framework on IPFS. 2022. Available online: https://github.com/dao-xyz/peerbit (accessed on 21 July 2025).
Contributors, DefraDB: A Peer-to-Peer Edge-First NoSQL Database. 2019. Available online: https://github.com/sourcenetwork/defradb (accessed on 21 July 2025).
Ng, W.S.; Ooi, B.C.; Tan, K.L.; Zhou, A. Peerdb: A p2p-based system for distributed data sharing. In Proceedings of the 19th International Conference on Data Engineering (Cat. No. 03CH37405), Bangalore, India, 5–8 March 2003; IEEE: Piscataway, NJ, USA, 2003; pp. 633–644. [Google Scholar]
Iordanov, B. Hypergraphdb: A generalized graph database. In Proceedings of the Web-Age Information Management: WAIM 2010 International Workshops: IWGD 2010, XMLDM 2010, WCMT 2010, Jiuzhaigou Valley, China, 15–17 July 2010; Revised Selected Papers 11. Springer: Berlin/Heidelberg, Germany, 2010; pp. 25–36. [Google Scholar]
Nadal, M.; Contributors. GUN: A Decentralized, Offline-First Graph Database Engine. 2022. Available online: https://github.com/amark/gun (accessed on 21 July 2025).
Cafaro, M.; Epicoco, I.; Pulimeno, M.; Mwinuka, L.J.; Pereira, L.; Morais, H. IDSS, a Novel P2P Relational Data Storage Service. arXiv 2025, arXiv:2507.14682. [Google Scholar] [CrossRef]
Monteiro, J.M.P. Scaling DHTs Towards Millions. Ph.D. Thesis, NOVA University of Lisbon, Lisbon, Portugal, 2021. [Google Scholar]
Stoica, I.; Morris, R.; Karger, D.; Kaashoek, M.F.; Balakrishnan, H. Chord: A scalable peer-to-peer lookup service for internet applications. ACM SIGCOMM Comput. Commun. Rev. 2001, 31, 149–160. [Google Scholar] [CrossRef]
Rowstron, A.; Druschel, P. Pastry: Scalable, decentralized object location, and routing for large-scale peer-to-peer systems. In Proceedings of the IFIP/ACM International Conference on Distributed Systems Platforms and Open Distributed Processing, Heidelberg, Germany, 12–16 November 2001; Springer: Berlin/Heidelberg, Germany, 2001; pp. 329–350. [Google Scholar]
Gupta, I.; Birman, K.; Linga, P.; Demers, A.; Van Renesse, R. Kelips: Building an efficient and stable P2P DHT through increased memory and background overhead. In Proceedings of the International Workshop on Peer-to-Peer Systems, Berkeley, CA, USA, 21–22 February 2003; Springer: Berlin/Heidelberg, Germany, 2003; pp. 160–169. [Google Scholar]
Jimenez, R.; Osmani, F.; Knutsson, B. Sub-second lookups on a large-scale Kademlia-based overlay. In Proceedings of the 2011 IEEE International Conference on Peer-to-Peer Computing, Kyoto, Japan, 31 August–2 September 2011; IEEE: Piscataway, NJ, USA, 2011; pp. 82–91. [Google Scholar]
Yu, J.; Fang, C.; Xu, J.; Chang, E.C.; Li, Z. ID repetition in Kad. In Proceedings of the 2009 IEEE Ninth International Conference on Peer-to-Peer Computing, Seattle, WA, USA, 9–11 September 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 111–120. [Google Scholar]
Loudet, J.; Sandu-Popa, I.; Bouganim, L. DISPERS: Securing highly distributed queries on personal data management systems. In Proceedings of the VLDB 2019-45th International Conference on Very Large Data Bases, Los Angeles, CA, USA, 26–30 August 2019; Volome 12, p. 4. [Google Scholar]
Javet, L. Privacy-Preserving Distributed Queries Compatible with Opportunistic Networks. Ph.D. Thesis, Université Paris-Saclay, Gif-sur-Yvette, France, 2023. [Google Scholar]

Figure 1. G-IDSS P2P data storage reference architecture.

Figure 2. Result collection and merging strategy (see [48]). The coloured disks represent different data held by the peers. In panel (a) each peer sends its data to the initiator node, which receives and aggregates them locally, in a centralised fashion. In panel (b) the query is executed in a distributed fashion, with each peer sending its data to an intermediate node, the peer that forwarded to it the query through the overlay. An intermediate node therefore receives and aggregates the data, and then sends the aggregated intermediate results back. Partial results are concurrently processed and communicated, until reaching the initiator node that originally received the user’s query.

Figure 3. Query states (see [48]).

Figure 4. JSON mapping for nodes and edges.

Table 1. Graph-based P2P data management tools.

Tool	Main features	Limits
HypergraphDB	Built on hypergraphs and hyperedges to support knowledge presentation and complex data relationships	Not fully decentralised and is also limited to its own query language
GUN	Decentralised graph database engine, supporting flexible data modelling	Does not implement querying mechanisms, but relies on filtering and key-based access. Full-text search is also not possible

Table 2. The client’s node property descriptions.

Property	Type	Sample	Description
Key	Integer	10	Unique ID for the client node
Kind	String	Client	Type of the node (always Client)
Name	String	John	Name of the client
Contract No.	Integer	6831	Client’s contract number
Power	Integer	798	Power capacity of the household

Table 3. The consumption node property descriptions.

Property	Type	Sample	Description
Key	Integer	11	Unique ID for the node
Kind	String	Consumption	Type of the node
Timestamp	Integer	1,625,133,600	UNIX timestamp
Measurement	Integer	7,688,529	Power capacity

Table 4. The client-consumption edge property descriptions.

Property	Type	Sample	Description
Key	String	e345	Unique ID for the edge
Kind	String	belongs_to	Relationship type
end1key	Integer	10	Key of the source node (Client)
end1kind	String	Client	Type of the source node
end1role	String	Owner	Role of the source node in the relationship
end1cascading	Bolean	True	Whether changes cascade from end1 to the relationship
end2key	Integer	11	Key of the target node (Consumption).
end2kind	String	Consumption	Type of the target node (always “Consumption”).
end2role	String	Usage	Role of the target node
end2cascading	Boolean	True	Whether changes cascade from end2 to the relationship.

Table 5. Descriptions of the sample queries using the COUNT function.

Query	Description
get Client where @count(owner: belongs_to: usage: Consumption) > 4	Returns a list of clients having at least 4 consumption records
get Consumption where @count(:::) > 0	Returns all consumption records that are being connected to any other node. Based on our data model, all consumption records will be returned
get Client where Power > 150 and @count(owner: belongs_to: usage: Consumption) > 5	Returns all clients that have at least 5 consumption records and the power of the house should be above 150

Table 6. Case 1: Selected best number of retrieved graph nodes observed in WSL.

Peers/Loaded Graph Nodes	1 s	10 s	30 s	60 s
10 peers (10,000 graph nodes)	1000	All	All	All
20 peers (20,000 graph nodes)	1000	12,000	All	All
50 peers (50,000 graph nodes)	1000	1000	23,000	All
100 peers (100,000 graph nodes)	1000	1000	67,000	All
1000 peers (1,000,000 graph nodes)	NA	NA	NA	NA

Table 7. Case 1: Selected best number of retrieved graph nodes observed in gridsurfer.

Peers/Loaded Graph Nodes	1 s	10 s	30 s	60 s
10 peers (10,000 graph nodes)	All	All	All	All
20 peers (20,000 graph nodes)	All	All	All	All
50 peers (50,000 graph nodes)	13,000	46,000	All	All
100 peers (100,000 graph nodes)	27,000	57,000	All	All
1000 peers (1,000,000 graph nodes)	NA	NA	NA	NA

Table 8. Case 1: Selected best number of retrieved graph nodes observed in datadog.

Peers/Loaded Graph Nodes	1 s	10 s	30 s	60 s
100 peers (100,000 graph nodes)	All	All	All	All
500 peers (500,000 graph nodes)	300,000	All	All	All
1000 peers (1,000,000 graph nodes)	1000	800,000	All	All
5000 peers (5,000,000 graph nodes)	1000	3000	10,000	2,500,000

Table 9. Case 2: Selected best number of retrieved graph nodes observed in WSL.

Peers/Loaded Graph Nodes	1 s	10 s	30 s	60 s
10 peers (100 graph nodes)	All	All	All	All
20 peers (200 graph nodes)	All	All	All	All
35 peers (350 graph nodes)	260	All	All	All
50 peers (500 graph nodes)	230	400	All	All
100 peers (1000 graph nodes)	140	320	780	890
1000 peers (10,000 graph nodes)	NA	NA	NA	NA

Table 10. Case 2: Selected best number of retrieved graph nodes observed in gridsurfer.

Peers/Loaded Graph Nodes	1 s	10 s	30 s	60 s
10 peers (100 graph nodes)	All	All	All	All
20 peers (200 graph nodes)	All	All	All	All
50 peers (500 graph nodes)	370	All	All	All
100 peers (1000 graph nodes)	210	420	All	All
1000 peers (10,000 graph nodes)	NA	NA	NA	NA

Table 11. Case 2: Selected best number of retrieved graph nodes observed in datadog.

Peers/Loaded Graph Nodes	1 s	10 s	30 s	60 s
100 peers (1000 graph nodes)	All	All	All	All
500 peers (5000 graph nodes)	2300	All	All	All
1000 peers (10,000 graph nodes)	3000	All	All	All
5000 peers (50,000 graph nodes)	500	5000	30,000	40,000

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Mwinuka, L.J.; Cafaro, M.; Pereira, L.; Morais, H. The Design and Implementation of a Graph-Based P2P Data Storage Service. Future Internet 2026, 18, 9. https://doi.org/10.3390/fi18010009

AMA Style

Mwinuka LJ, Cafaro M, Pereira L, Morais H. The Design and Implementation of a Graph-Based P2P Data Storage Service. Future Internet. 2026; 18(1):9. https://doi.org/10.3390/fi18010009

Chicago/Turabian Style

Mwinuka, Lunodzo J., Massimo Cafaro, Lucas Pereira, and Hugo Morais. 2026. "The Design and Implementation of a Graph-Based P2P Data Storage Service" Future Internet 18, no. 1: 9. https://doi.org/10.3390/fi18010009

APA Style

Mwinuka, L. J., Cafaro, M., Pereira, L., & Morais, H. (2026). The Design and Implementation of a Graph-Based P2P Data Storage Service. Future Internet, 18(1), 9. https://doi.org/10.3390/fi18010009

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

The Design and Implementation of a Graph-Based P2P Data Storage Service

Abstract

1. Introduction

2. Background and Literature Review

2.1. Background

2.2. P2P Architectures

2.3. P2P Data Management

3. Proposed Model and Implementation

3.1. Overview of IDSS

3.2. P2P Overlay

3.3. Database Management System

3.4. Query Management

3.5. Client Implementation

4. Validation

4.1. Experimental Setup

4.2. Overlay Establishment

4.3. Database Management and Data Querying

4.3.1. Case 1: Loading Huge Datasets

4.3.2. Case 2: Scaling Number of Peers

4.3.3. Other Querying Support

4.4. G-IDSS Clients

4.4.1. Client Connectivity

4.4.2. Data Querying

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI