Systematic Evaluation of LibreSocial—A Peer-to-Peer Framework for Online Social Networks

Peer-to-peer (P2P) networks have been under investigation for several years now, with many novel mechanisms proposed as is shown by available articles. Much of the research focused on showing how the proposed mechanism improves system performance. In addition, several applications were proposed to harness the benefits of the P2P networks. Of these applications, online social networks (OSNs) raised much interest particularly because of the scalability and privacy concerns with centralized OSNs, hence several proposals are in existence. However, accompanying studies on the overall performance of the P2P network under the weight of the OSN applications outside simulations are very few, if any. In this paper, the aim is to undertake a systematic evaluation of the performance of a P2P framework for online social networks called LibreSocial. Benchmark tests are designed, taking into account the random behavior of users, effects of churn on system stability and effect of replication factor. We manage to run benchmark tests for up to 2000 nodes and show the performance against costs of the system in general. From the results it is evident that LibreSocial’s performance is capable of meeting the needs of users.


Introduction
Social networking has experienced tremendous growth since the turn of the 21st century, a fact demonstrated by number of online social networks (OSNs) available, with studies showing a general overlap between the online and offline networks of many of these users [1]. As a consequence of the growth of these platforms some concerns have arisen on two fronts, technical and social [2]. The technical concerns arose due to a high dependence on centralization in administering the OSNs, and with a rapidly growing user base, various scalability performance issues and hence increasing cost of management and maintenance of the overall system infrastructure have emerged. As it currently stands, the OSN providers have succeeded in developing mitigating solutions for the scalability concerns, such as using distributed data management solutions such as Cassandra [3] and Haystack [4] in Facebook or using cloud services such as Amazon's AWS storage services (https://aws.amazon. com/products/storage/), with the more popular OSNs with very large user networks, such as Facebook (2.6 billion), YouTube (2 billion) and WhatsApp (2 billion) as of July 2020 (https://www. statista.com/statistics/272014/global-social-networks-ranked-by-number-of-users/). However, this level of scaling does not come cheap, and more often that not requires the providers to develop monetization models for revenue generation which result in certain violations of the user's rights leading into the second concern. The social concerns have to do with the users' privacy and data benchmarking as an experimental methodology that can be used for P2P systems and also describes the system quality properties that we aim to investigate. In Section 3, we briefly give a description of LibreSocial, our P2P-based OSN framework solution. Thereafter, Section 4 introduces the metrics that will be employed in evaluating the system, and further describes the test setup, including the test scenarios and the different workload specifications. Section 5 shows the results obtained with a discussion that interprets them. Finally, in Section 6, we give the concluding remarks.

Benchmarking P2P Systems
As an experimental methodology for evaluating a system's performance, benchmarking uses a synthetic/generic application on an existing, real environment [26], and relevant performance metrics are chosen based on the application's domain. Benchmarking is normally standardized within a given domain of operation, and the results in such experiments are easily obtainable and comparable. The disadvantage with this approach is its inability to fully represent a realistic situation. However, it helps in directing system designers on where to make adjustments based on results obtained. We now briefly focus on the key aspects of the P2P benchmarking process and thereafter highlight the P2P quality properties with respective metrics that are of concern to this work.

P2P Benchmarking Model
Benchmarking can either be vertical or horizontal. In vertical benchmarking only one system is evaluated to identify its boundaries of operation. For horizontal benchmarking, several similar systems are evaluated against each other. Our focus is on the former, meaning that no other alternative implementations are tested. In contrast to other standard benchmarks used for other computing systems, P2P-benchmarks need to define important aspects of the underlying network so that they are reproducible [27]. A benchmark must satisfy the following requirements: (i) be based on workloads representative of real-world applications, (ii) exercise all critical services provided by platforms, (iii) not be tuned/optimized for any specific product, (iv) generate reproducible results, and (v) not exhibit any inherent scalability limitations [28,29]. In order for the benchmark to meet these requirements, three important benchmark parameters, system, workload and environment, which interact with the system under test must be defined [30].
The system parameters are system specific and bound the system under test, such as the size of the routing table or replication factor. The workload parameters affect the workload generation such as the number of peers and also include other application-specific settings such as number of queries for a given activity. Environment parameters are bound up in the host and the underlying communication links. In the test process, not all parameters are of concern, and only a selected subset of parameters may be varied. This selected subset of parameters are termed factors and the most preferred factors are those that have the largest impact on the systems performance [31]. By altering the parameters appropriate results are obtained. These results are usually a collection of metrics which can then be used to measure the quality of the system. Therefore, it is important to define suitable quality properties that can be used along with the metrics collected. Next, we give a description of useful quality properties.

P2P Quality Properties and Relevant Metrics
The efficacy of any benchmark design relies on a well defined set of quality properties and quality metrics. While quality metrics focus on only describing a single attribute of a mechanism within a scenario, workload or configuration, the quality properties describe the system's/mechanism's characteristics putting into consideration various individual measurements of quality metrics. Quality properties are distinguished into workload-independent and workload-dependent [30], which are discussed next and the relevant quality properties with their respective metrics are given in Table 1. These are easily adopted from the computer system performance analysis using metrics that represent the system behavior under workload. They are obtainable through direct measurements or indirectly calculated using other quality aspects. In our test application, we are concerned with performance and cost.

•
Performance: For a feel of the system's performance, aspects to consider are responsiveness (how fast the system reacts), throughput (how much load that system can handle in a given time frame) and the extent to which the results match the expectation. The metrics chosen for performance measurements are hop-count, storage time (t store ), retrieval time (t retr ), message sending rate (m send ) and message receiving rate (m rec ).

•
Cost: This property describes the amount of resources that are used to ensure a given task is fulfilled or a service is provided. Relevant cost metrics are used storage space, used memory and network bandwidth.

Workload-Dependent Quality Properties
These have a relation to the manner in which the workload is introduced into the system. They are useful in helping us understand and define the workload for the system to be tested and are evaluated using the workload-independent quality properties. To show the efficacy of using these properties, there is need to conduct a baseline evaluation on the system to bring a clear comparison. The properties which we focus on are stability and scalability.

•
Stability: This describes the system's ability to continue performing despite inherent system behavior dynamics, such as high churn rate, and eventually converge to a stable state if the workload remains the same. Stability correlates to resilience under adverse conditions, i.e., sudden changes that may occur in the system architecture due to the workload. These changes may be expected since the defined protocols account for them, or unexpected because the protocols do not consider them. To ascertain the level of stability, the system needs to be evaluated against a baseline to show the relative differences of a particular parameter from the stable, baselined system. The relevant metrics are number of nodes in the network at any given time, the leafset size, routing table size, memory size, number of messages sent and the messaging rate, the amount of data transferred and the data transfer rate, the data items stored and the number of replicas, and the DDS data stored. • Scalability: Revolves around the system's ability to handle changing workloads. Two scaling dimensions are considered: horizontal which affects number of peers in the system, i.e., increase and decrease in the number of peers in the network, and vertical which concerns the ability of the P2P system to handle increasing workload from the participating peers. For this study, the focus was on horizontal scaling. The relevant metrics that will be looked at are the same as in the case of stability.
Summary: In this section, we introduced key terminologies that will be used throughout this paper. We have also briefly described the factors to consider in the creation of a P2P benchmark and finally described important quality properties with the relevant metrics, which are summarized in Table 1, where necessary, that are used in the evaluation. In the next section, Section 3, we introduce the P2P framework that we have developed in previous years, aimed at meeting the necessary service requirements for a functional OSN, briefly explaining all the key components of the framework.

A P2P Framework for Online Social Networks
The process of designing and eventually building a P2P-based OSN has to consider the combination of various, reliable and secure functions that are implemented on top of unreliable and insecure devices, such as a robust overlay connecting the participants, user management, reliable distributed data storage, access control and secure communication. The users of the system will usually expect a variety of elements, such as messaging walls, support for photo albums, messaging and chatting as well as audio-visual conversational support. To ensure controlled quality, these services require further supporting elements, such as distributed data structures, communication protocols (publish-subscribe, unicast and multicast) and monitoring mechanisms, to be incorporated. In a previous article [32], we present a detailed description of these requirements and the options available for implementing each required P2P mechanisms for an OSN.
LibreSocial [22] (previously LifeSocial.KOM [23][24][25]) is a P2P-based framework that provides all these. The application sits on a P2P framework that was developed based on the Open Services Gateway Initiative (OSGi) service platform (https://osgi.org/download/r7/osgi.core-7.0.0.pdf). The OSGi platform allows the various components of the application to be defined as bundles that can easily be loaded at runtime and plugged into the running application at will. The architecture of LibreSocial is made up of four distinct layers: the P2P overlay, the P2P framework, plugins and applications and the graphical user interface. These four layers are located on top of the Internet (network) layer as shown in Figure 1. In [22], the architecture is discussed in greater detail, but we nevertheless endeavor to briefly highlight the core functionalities. Thereafter, we give a description of the test environment that is bundled with the application.

P2P Overlay
The P2P overlay is the lowest layer of the framework, and connects to the network layer of the TCP/IP model. The overlay provides a degree of distribution transparency by abstracting the complexities of the physical connections. The overlay supports logarithmic message routing and object location using a heavily modified FreePastry (http://www.freepastry.org/FreePastry/), an implementation of Pastry [33], which provides an ID space of size 2 160 . All higher layers of the framework, depend on the overlay for secure and reliable routing. The next layer above the P2P overlay is the P2P framework.

P2P Framework
The P2P framework provides all the essential services that make up the bulk of the P2P system. There are four core services that the framework offers, i.e., storage, identity, communication, and testing and monitoring which are further discussed.

Storage
Simple file storage is provided via a heavily modified PAST [34,35], a persistent P2P storage utility bundled along with FreePastry which also includes replication services. In addition, three types of distributed data structures (DDSs), i.e., distributed sets, distributed linked lists and prefix hash trees [36] are also used for storage. These are useful for complex linked data such as albums with photos having comments or wall messages with comments. The storage service offers an intelligent caching mechanism that takes into account updates.

Communication
The framework supports both synchronous and asynchronous messaging depending on the channel used. It also supports unicast (1-to-1) messaging such as in direct messaging, multicast (1-to-N) messaging such as streaming to a group as well as using aggregation (N-to-M) mechanisms for distribution and aggregation of network information. In all cases authentication, confidentiality and integrity of communication is supported. It also allows using of IP-based communication where the communicating node retrieves the IP information of the other node and communicates directly with it. LibreSocial also uses Scribe [37], a push-based publish/subscribe utility bundled with FreePastry, offers streaming via WebRTC (https://webrtc.org/) for audio/video conferencing, and has a secure message channel that can be encrypted and signed.

Identity
Three key identity features offered are identity management, user and group management and access control. These features required significant modification of FreePastry to ensure secure node identification based on a public key infrastructure mechanism, in this case elliptic curve cryptography (ECC), in which the nodeID is the public key. User management is made possible by performing a mapping of the nodeID to an immutable userID. Access control uses the AES symmetric encryption algorithm and it ensures relevant users are authorized to read/write data. The owner of new data generated a symmetric cryptographic key and encrypts the stored data item, encrypts this key with the public key of each user/group who requires read writes. The encrypted data is signed and combined with the public key of the owner as well as the list of encrypted keys to generate the secure storage item that is then stored in the network. Overwriting of the data with a new storage data item is possible if it is signed with the corresponding private key of a specific public key after verification by the storing node.

Testing and Monitoring
The testing and monitoring plugins work in tandem during system testing. The test plugin sends instructions that mimic actual user activities to the different application plugins based on a defined test plan. The monitoring plugin gathers data via an aggregation channel during the testing and allows the tester, and later the interested user, to get a global view of the network status at any point in time during the testing process. The tree-based aggregation protocol, SkyEye.KOM [38,39] monitors and gathers relevant measurements during the testing process. The implementation of SkyEye.KOM is independent of the underlying P2P overlay, as seen in Figure 2a, and it provides the ability to aggregate and disseminate information within the tree structure which is constructed on top of the overlay thus providing a global view on the performance and costs of the P2P network. The aggregation process for the monitoring data is shown in Figure 2b. The aggregation functions include count, min, max, and mean. The services provided via the framework are then used by the application, which is made up of the plugin components. Next we describe the plugins which constitute the application.

Plugins and Application
The application components are broken down into plugins which are loaded dynamically at runtime. The plugins include, login, profile, notifications, files, search, friends, group/forum, calendar, messaging, multichat, audio/video chat, wall, photos, voting, testing, and monitoring. The plugins or bundles are software components that add a specific feature to the system and enhance the systems capabilities. This is possible because the OSGi design framework supports developing of the different plugins (or bundles) as modules that are easily extensible and simple to integrate. Each plugin provides an OSGi command interface that can be accessed during testing by the test plugin, which works in connection with the monitoring plugin to support quality monitoring and testing. The plugin contents are then presented to the user via the web application.

Graphical User Interface (GUI)
The GUI is the top most layer and it is the point of interaction between the user and the system. Figure 3 shows screenshots of LibreSocial's GUI. The GUI is composed of three sections, i.e., the plugin template, the plugin logic and the WebProvider. The plugin template is typically html files and a standard JavaScript implementation based on the Model-View-View-Model (MVVM) paradigm. The plugin logic works at transmitting user events from the front-end to the REST handler via the WebProvider and then transmits the results via the same channel back to the front-end. It also renders the data that it has obtained into the desired template. Lastly, the WebProvider is the interface between the plugins and the user's web browser. Via a desired web browser, the user can easily access all required aspects of OSN at runtime. Summary: In this section we have introduced LibreSocial, our P2P Framework for OSNs, briefly giving a description of how the core P2P components are integrated into a secure, scalable and fully decentralized OSN. The Framework constitutes of four layers, the overlay, the framework, the plugins and applications and the graphical user interface, with each offering or receiving necessary services from the adjacent layers to guarantee a working OSN. In the section that follows, we define the metrics and the describe the experimental scenarios and their respective workloads for the benchmarking process.

The Test Environment
A benchmark set can be considered as a tuple made up of the quality attributes Q, metrics M and the test scenarios S [40]. The quality attributes and relevant metrics to be used are introduced in Section 2.2. We now discuss the relevant scenarios designed for evaluating the system. Our test environment for all scenarios is the same so as to guarantee result consistency. The tests are run on the high performance computing (HPC) cluster at Heinrich Heine University (https://www.zim.hhu.de/ high-performance-computing).

Workload
The workloads for the benchmark tests are dependent on the particular test that is being carried out. The actual workload is generated using photo images and text messages. Photo images sizes ranged from 100-800 KBs, with an average of about 300 KBs, and the average message length is 150 characters. Also, for the testing, the photo images are used in place of files/documents. The system parameters for each of the test scenarios are shown in Table 2. A description of the different scenarios of the benchmarking test follows. These set of tests are used to baseline the system's performance. LibreSocial is primarily composed of plugins, bundles that implement various functionalities of the OSN, and includes a test plugin which interacts with each plugin via a command interface to trigger the functionalities as shown in Figure 5b. This test allows us to see the plugins in action and the overall effect due to the different plugins on the overall system. The system configurations for this test are shown in Table 2, with Table 3 as the workload generated via the test plugin. Two network sizes of 100 and 500 nodes are chosen, and for each of the networks we initiate two different sets of workload specifically, light load (with only two repetitions per test case or action) and medium load (five repetitions). The repetitions for each test case are allowed to complete over a duration of five minutes.

Pseudo-Random Behavior
The testing seeks to model realistic randomized user behavior. System parameters defined are given in Table 2. The random behavior algorithm selects tests in a random pattern as shown in Figure 4, mimicking normal user behavior. The algorithm was designed based on the work in [41] wherein aggregated data about the behavior of more than 35,000 users in four different social networks was studied. In the study, key parameters used are: the average time spent on a social network website, login frequency throughout the day, activities the user participates in and the order. The statistics gathered made it feasible to implement a mechanism that selects a period of time that a user spends online and the sequence of activities based on probabilities. At the end of that time period the user logs out, and may later log back in.

Scalability/Stability
This test scenario focuses on realizing a network with 1000 nodes to evaluate the system scalability and stability. Tables 2 and 4 shows the system parameters and the workload for this scenario. The test aims to show the ability of the network to smoothly scale up and the effect of churn on the network and proceeds as a single test in two phases as follows.
(i) Phase 1-Network growth: The test begins with 250 nodes, which is incremented in steps of 250 nodes representing 100%, 50% and 33.3% increments, respectively. At each step, the workload shown in Table 4 is executed. (ii) Phase 2-Network churn: Once we reach 1000 nodes, i.e., at the end of the network growth phase, joined the network, the churn phase begins. 250 nodes are removed a step-wise manner similar to network growth, representing, a 25%, 33.3% and 50% churn, until the network has only 250 nodes. Similar to the growth phase, after each churn, the workload is again executed. Table 4. Scalability Workload.

Plugin Action Repetitions Duration * (mins)
Messaging Send message 5 2 Photos Upload photo 5 2 Forum Comment forum thread 5 2 Wall Send wall post 5 2 * One minute pause after completion of the repetitions for each test case.

Replication Factor
The replication factor (RF) is responsible for the number of duplicate items in the underlying network so that in case a peer leaves the network, any previously uploaded files can be accessed by other peers. In this test scenario the replication factor and its effects are considered. The system parameters are presented in Table 2 and the workload parameters which are defined in the test plan are shown in Table 5.

Test Execution and Data Collection
In Figure 5 the setup of the test is portrayed. During testing, we differentiate between two roles for the nodes in the network, the master and the slave nodes. There can only exist a single master node in the P2P network during the testing, and it must also be the first node that is started. Once the network is established, the master node sends the signal to begin testing to the slave nodes, and does not execute any test cases itself. Slave nodes then bootstrap onto the network formed by the master node. To perform any test, there must be at least k + 1 slave nodes on the network, where k is the replication factor chosen for the data items in the network. The test roles then enable the execution of a test plan.  The test plan is a simple text file with test cases listed in order a desired order, with the number of repetitions per test case and the total time allotted for a particular test case (including the repetitions). A test case is defined as the smallest unit that can be executed during a test and corresponds to a single activity initiated by a user such as, sending a friend request, writing on a wall, sending a chat message, uploading a file and so on. For a test case to be executed, in many cases there are certain preconditions that must be met. For example, to send a chat message to someone, the other person must also be a friend. Therefore in executing a test case to send a chat message, the precondition to have a friend must also be fulfilled. Thus each test case checks that the necessary preconditions defined are met or else meets them first before the actual test case is then executed. The entire process of the test execution with a test plan is shown in Figure 5a. Figure 5b shows the interaction of the test plugin with other plugins at peer level. For each plugin, there are several test cases that can be executed and similarly, respective system metrics that can be collected.
During the test execution process, monitoring data is collected at five seconds intervals and stored in SQL format as it is easy to handle and presents minimal load to the test instances. The collected data is aggregated by the master node and distributed to the slave nodes as global monitoring data. Hence, each node has a local view as well as a global view of the monitoring statistics.

Evaluation of Results
A detailed discussion of the test results obtained from each of the test scenarios follows.

Baseline Tests: Plugin Analysis
The results are shown in Figures 6 and 7, for the 100 and 500 node tests respectively. Table 6 is a comparative analysis of the completion times, the network messages and network data generated as a consequence of the tests conducted. Table 7 is an analysis of the impact of each plugin action on the message rate and data rate. The results are discussed in the following.

Plugin Actions
Figure 6c-l for the 100 node tests and Figure 7c-l for the 500 node tests, are representative of the results due to the workload in Table 3. We include one additional plugin, the Friends plugin ( Figures 6 and 7), from which we can deduce that the nodes actually form friendship links with each other. Because of these links, the nodes are able to send direct messages to friends, establish livechat sessions or post comments on the walls of their established friends. The graphs are testament to the usefulness and accuracy of the monitoring component in particular as it is easy to see the difference due to the different workloads as well as the network sizes.       Table 6 presents the time taken for each individual test case to complete, with the monitoring capturing data in intervals of 5 s. All tests completed within the allotted test time (5 min), with the maximum completion times for the light load and medium workload generally oscillating around 15 s (7.5 sec/action) and 55 s (11 sec/action) respectively. The longest maximum completion times are observed in the 500 node test under medium workload for Filestorage-store files (100 s); Forumcomment post (95 s); Wall-comment post (80 s); and Voting-vote (75 s). It is also noteworthy that there is generally little or no disparity in completion times for the same workload despite increasing number of nodes, and an increase in the workload results in longer completion times as would be expected. The completion times point to a well balanced network that ensures that there are few overall delays experienced. Figure 6a,b for the 100 node tests, and Figure 7a,b for the 500 node tests portray the network message and data rates throughout the test duration. From the monitoring data collected, the number of messages and the amount of data due to the plugin actions was deduced and this is presented in Table 6. The message and data rates for the duration of each test is then presented in Table 7, from which three distinct cluster groups are visible when comparing the message rates and data rates consequent to the different test cases. We discuss the three groups and base the classification against the 500 node test taking into account the medium workload.

a)
Low message and low data rates: These test cases result in lower than 1000 messages/sec with data rates not exceeding 10 Kb/s. The majority of the test in this groups focus on data retrieval as opposed to acting on the data. With the use of the available caching mechanisms, it is expected that there will be only minimal messaging as well as network calls. The exception to this observation is the Create Calendar action, which is a local action rather than a global action with data replication for redundancy, and the livechat message which sends a short message that does not include large data. The actions in this class are associated with four plugins, Calendar, Livechat, Messaging and Photos. b) High message and low data rates: The test cases classified in this group are characterized as having message rates above 100 and less than 15,000 messages/sec with data rate not exceeding 15 Kb/s. The plugin actions in this class are associated with Filestorage, Forum, Photo, Voting and Wall plugins. Of particular interest is the Forum, Photo voting and wall plugins because they use DDSs to store data. Most of the plugin actions in this category make repeated calls for the DDSs relating to the data requested with each call resulting in significant messaging as the entire DDS is retrieved. c) High message and high data rates: These tests have the largest overall impact on the message and data rates. The message rates range from 500 to slightly less than 23,000 messages/sec. The highest data rates recorded is 51.98 Kb/s observed for the plugin Filestorage store files action in the 500 node test with light workload. The actions in this classification are associated with Filestorage, Forum, Group, Messaging and Livechat. The actions, such as store files by the Filestorage plugin, or send message by the Messaging plugin generate high messages as well as data because they use DDSs.
LibreSocial's modular design, which allows separation of the various OSN functions into plugins (or bundles) from the P2P core service components, supports the implementation of SkyEye [38,39], a tree-based monitoring solution. From the collected monitoring statistics, it is demonstrated that there is a smooth synergy between the application layer and the underlying components, seen by the effect of each plugin on the data and messaging rates. The worst case test completion times for each plugin action are tolerable being on average about 7.5 and 11 secs/action for the light and medium loads irrespective of the network size. These times may be caused by the replication factor value and lower times can be anticipated with lower replication factor values, although such a move may significantly affect the retrieval times as will be demonstrated in the replication test (Section 5.4). The message and data rates are dependent on the plugin action performed. Actions that require storage of significant data tend to generate more messages as well as data overall, such as uploading files or photos. Also because of the manner in which DDSs are stored (distributed across the network), there is a significant messaging as the DDS items are being retrieved, which is expected. This interaction renders the conclusion that the storage mechanism composed of both the PAST for simple files as well and the DDS for complex data types, coupled with the caching and replication, integrate well within the application. With a maximum recorded network data rate 51.98 and 48.75 Kb/sec for the light and medium load tests, we believe that the network is capable of handling more and can easily scale up.

Pseudo-Random Behavior
The evaluation looks at the node count, routing maintenance (leafset and routing table sizes), messaging (count and rate), memory usage size, network data rate and DDS data retrieval rate. Figure 8 shows the results from this test. Figure 8a portrays the active nodes in the network. There is a notable gradual decline in the network composition over the experimental period. This may probably be directly attributable to two aspects: the randomization algorithm and the absence of new participants joining the network. The leafset size, seen in Figure 8b, is stable at 25 throughout the experimental period, as is expected. On the other hand, the routing table size shown in Figure 8c depicts an increase as the network grows, with the maximum size of 35 at the end of the network joining phase, followed by a gradual decline corresponding to the network size reduction. FreePastry's routing management algorithm performs cleanup by link reorganization, i.e., as one link becomes unusable, the algorithm greedily selects a new link, since several paths exist between any two nodes. Hence, from our observation, we deduce the presence of dynamic route readjustments, and proactive refreshing of the routing table. The network messages and network messaging rate as shown in Figure 8d,e respectively are indicative of a increased network activity as the random action of the plugins begin (roughly after the 160th minute), with a maximum message rate of about 1300 messages/sec. The number of messages sent and the message rate drop with reduction in network size as anticipated. One of the requirements for the network is that each node provides a portion of its own memory for the network, which is dynamically adjusted as more memory is needed. Figure 8f shows the maximum memory provided by a single node through the test, which grows steadily to a maximum of about 850 MB at the close of the experiment. At its peak, the sending data rate, seen in Figure 8g, is recorded at about 40 KB/s. In general, the data rate oscillates between 10 and 30 KB/s, which we consider acceptable performance for a network of this size, and also because of the data used during the testing process, which was generally less than 1 MB. The maximum retrieval rate for DDSs is recorded at about 900 kB/s at the start of execution of the algorithm, thereafter dropping steadily as the active node count reduce as seen in Figure 8h. The results reveal the usability of LibreSocial in the real world. The social network is stable and the monitoring component captures network statistics throughout the test period, which helps in gathering insights relevant to explain the finer workings of the system. The results thus presented attest to a properly working social network application, and also provide sufficient evidence that the underlying FreePastry network structure is stable and reliable even in a randomly behaving network. From the results of this test, we may conclude that the randomization algorithm may have a part to play in the eventual reduction in the network size over the test duration. It is also probable that there may be other factors that may have caused the network to diminish other than the randomization algorithm. We believe that the randomization model is a good portrayal of realistic human interaction and hence, such a significant drop in the network size is not to be expected. This calls for further investigation into how the model interacts with the application in triggering the test cases.

Stability and Scalability
To show the ability of the network to reach a stable state, we look at the systems reaction to both network growth and churn. From the results we can deduce how stable the network is and also make remarks on the ability to scale up and down. As the test for both growth and churn are carried out in the same experiment, it is easy to make observations and deductions on system behavior in both situations and contrast them. We discuss the results shown in Figure 9 in the following. a) Node count: Throughout the experiment, the node count is stable both in the growth and churn phases as depicted in Figure 9a. There is also seamless transition from the growth phase into the churn phase. As the network is able to support up to 1000 nodes, we can deduce that the system can be easily scale up. During the churn phases, we notice that despite massive churn, (of up to 50%), the network remains operational. b) Routing table size: Figure 9b shows the maximum and average table size for a single node through the experimental duration. Observable is a corresponding adjustment of the routing table during the growth phase as well as during churn. The maximum value recorded is 39 nodes, but on average, the maximum is 30 nodes. Despite the relatively high growth (as well as churn) the routing table size does not significantly alter. If we take 250 nodes in the network as the base, hence a maximal value of 29 nodes and average of 22 nodes for the routing table size, the routing table maximally adds about 10 nodes and on average up to 9 nodes during the entire growth phase, and looses the same number of nodes during the entire churn phase. This means the the routing table adjusts accordingly to meet the demands of the network. c) Leafset size: From Figure 9c, we observe that the leafset is maintained at 25 nodes. The experimental setting for the leafset size is 24 nodes. The additional value is because the first and last value in the leafset is the node's own ID. As the network has more nodes than 25, it is expected that the average and maximum values for the leafset size will be the same. d) Stored data items and replicas: The replication factor for the experiment is set to 4. It would be therefore expected that the number of replicas should be at least 3 times more than the local items. However, each node also stores some replica, which become local to it. That notwithstanding, we note that generally, there is about twice as many replicas as locally stored data items. This holds true throughout the experimental duration, except at the last churn, for which the network is reduced by 50%. This may have caused a drastic reduction in the number of replicas, but nonetheless, some are still present. e) Messages sent: The maximum and average messages sent in a single node are shown in Figure 9e,f respectively. There is a steady rise in messages sent per node in the growth phase, evident in both figures, with the maximum value recorded being about 1.1 million messages by a single node corresponding to the end of the growth phase. On average, during the growth phase, the maximum value is about 25,000 messages, which is acceptable. During the churn phase, we observe different characteristics from the two figures. While the maximum messages per node shows a sharp decline to a value of about 200,000 messages, the average messages show a sharp increase to a value of 58,000 messages. This rise in average messages may be due to route adjustment queries as nodes are removed from the network and new routes have to be established. f) Message send rate: The maximum and average rates are shown in Figure 9g,h respectively. The maximum recorded message rate is about 430 messages/sec corresponding to the last increment of growth phase. Essentially, the maximum message rate during growth oscillates around 100 messages/sec. The average value for the message rate on the other hand, reaches a maximum during the last step of churn, with a value of about 15 messages/sec. On average, the message rage oscillates between 1 and 10 messages/sec. g) Data sent: The values recorded for maximum and average data sent per node are shown in Figure 9i,j. Both figures show a direct correlation with the messages sent. The maximum recorded value for data sent by a single node is about 3.7 MB which is seen at the end of the growth phase which then drops to a minimum of slightly more than 1.0 MB during the churn phase. On the average values, during the growth phase, we see a steady rise in data sent to reach 190 kB at the end of the growth phase. During the churn phase however, rather than a decline, there is a sharp rise in data sent, reaching just below 1.0 MB at the end of the experiment. This sharp rise just as in the case of messages stored may be due to high messaging, as well as readjustments in replica placements for nodes still in the network. h) Data send rate: Figure 9k,l show the maximum and average data sending rates. During the growth phase, the maximum sending rate hardly exceeds 2.5 kB/s. However, during the churn phase, especially as nodes leave the network, there are significant surges in the data rate, with a maximum data rate of 20 KB/s during the last step of churn. On average, the values are quite different. During growth phase, the data rate hardly exceeds 20 bytes/sec, and just as in the case of maximum data send rates, the average shows surges during the churn phase, with a notable high of 900 bytes/sec at the last churn step. The reason for this phenomenon is explained in the same way as the messages stored and data sent. i) DDS data stored: The maximum and average DDS data stored are shown in Figure 9m,n, respectively. During the growth phase, there is an increase in the DDS data after the first growth step. No DDS data is recorded during the first growth phase as the workload is yet to be executed. At the end of the growth phase, the DDS data stored is maximum size of DDS data stored is about 34 MB, with the average maximum size being 2 MB. During the churn phase, rather than a decline, there is an increase in the amount of DDS data stored per node, reaching a maximum of about 52 MB and an average maximum of 4.3 MB just before the end of the experiment. j) Total and used memory: A comparison of the maximum and average values for total allocated and used memory is shown in Figure 9o,p, respectively. We note that both the maximum and average for allocated memory correlate with the graphs for messages sent and DDS data stored. This is indicative of the fact that messages sent and DDS data stored has a large impact on the memory allocated. The maximum memory allocated is 900 MB, and the average maximum is about 550 MB. However, the memory that is used is less than half of the allocated memory. The maximum used is about 580 MB and the average maximum used memory is about 300 MB.
From the results, we can conclusively state that the network can scale up to at least 1000 nodes. There is a smooth growth of the network which remains stable even during the churn phase. There is an evident dynamic readjustment of the core network management protocols, such as the routing table entries and possibly the leafset entries, to accommodate the changing network in both the growth and churn phases. During the growth phase, the shared network memory is increased proportionally to match the demand and nodes are added to the routing table. During the churn phase, the nodes make adjustments by taking on more load, shown by an increase in memory rather than a decrease. Despite the significant reduction in number of nodes during churn (up to 50% network size reduction), the routing's link reorganization mechanism ensures that the routing table adjusts accordingly and the leafset remains stable. During the tests, the ability to increase the number of nodes was only limited by the resources available in the HPC cluster in terms of memory for the entire setup. Beyond this limitation, it is possible to scale up the network without any hindrance. Additionally, as long as a significant number of nodes is still online (possibly several hundred nodes), the network can adjust itself when there is a high churn rate.

Effect of the Replication Factor
Figures 10-18 depict the system dynamics with the focus narrowed down to the previously selected metrics for analysis. The deductions for the results are given in the summary provided in Table 8. We considered the performance of the system against the costs and discuss the relevant metrics in the following.

System Performance
The analysis of the performance of the system considered the number of nodes in the system throughout the experiment, the hop counts, the number of replicas in the system, the maximum data storage and retrieval times, the message sending and receive rates, the maximum used storage space and the maximum used system memory.

a)
Node count: Figure 10a-d portray the behavior nodes in the systems. The figures indicate successful monitoring of the network nodes, and in general the network, which is stable throughout the experimental period. Of note, is the slight reduction by about 100 nodes for both the 1000 and 2000 nodes experiments. A possible reason for this, evident from the logs, is that the nodes experienced insufficient memory errors resulting in overall node failure. However, this did not affect the aggregation of monitored statistics in the network. b) Hop count: The hop count for the experiments are given in Figure 11a-d. In general, the number of hops from sender to receiver is 1 because the number of nodes in the experiments do not exceed the maximum possible limit for the routing table of 160.20 = 3200 entries. Hence the average hop counts tends to be less than one in the network. It is however observed that as the network gets larger, there is less disparity in the network as the graphs show an overlap, as is the case with 1000 and 2000 node graphs. c) Number of replicas: As the number of nodes increases, as can be seen in Figure 12a-d, the number of data replicas data also increases as well. In addition, with increasing replication factor, there is an equivalent increase in the replicas with the same magnitude, which is expected. d) Storage time: From Figure 13a-d, two important observations can be made. First is that increasing the number of nodes results in increased maximum storage time, although the increase is in the order of milliseconds, hence can be ignored. The second more significant deduction is that increasing the replication factor results in increased maximum storage time. In general, the average storage times, as shown in Table 8, are less than one second, but we note the extreme cases of maximum storage time in the order of 10 s of seconds, particularly the 2000 node experiment with replication factor of 4, which recorded a maximum storage time of 91 s. Bearing in mind that storage time includes time to store replicas, it is possible that this discrepancy may be related to storage of data related to distributed data structures such as wall posts and forum threads, which must be distributed and replicated at the same time. e) Retrieval time: This is shown in Figure 14a-d. The same observations as with storage time are also seen here. Again, we make the deduction that large retrieval times are related to distributed data structures. Interestingly, even with more replicas in the network, the maximum retrieval time does not significantly drop. The average system retrieval times are still less than one second. A significant observation is that for the 100 node tests, for replication factor of 16, a higher maximum retrieval time of 15.12 s is seen, in comparison to 0.189 and 0.6 s for replication factors 4 and 32 respectively. It is probable this is the result of a DDS data retrieval and is an outlier. f) Message send/receive rate: From Figure 15a-d, for message sending rates and Figure 16a-d for message receiving rates, it is noted that a higher replication factor there is a drop in the average messaging rates. Also surprising is that when comparing the 100 and 200 node experiments, it seems that a replication factor of 16 seems to generate higher maximum rates than the lower replication factor tests. This result may be an indicator that when considering replication factor, there are values that are accounted as optimum for the network performance. This result may need further investigation. The average rates also dropped as the network became larger, and as the replication factor increased.

Cost Analysis
The three aspects to the costs incurred to be evaluated are storage space usage, memory usage and network data rate. The analysis follows.

a)
Used storage space: From Figure 17a-d, it is seen that if the replication factor is held constant, the average used storage space per node is almost invariably similar with deviation of about ±10 MB. The maximum used storage space on the other hand varies and no general trend can be deduced. In general, as the replication factor increases, used space increases, both on average and also the maximum used space. b) Used memory: In Figure 18a-d, the maximum used memory for the four node groups is shown. With fewer nodes in the network, there is very little disparity in maximum memory use for replication factor 4 and 16, but a replication factor of 32 shows a notably large maximum memory usage. For the 1000 and 2000 node tests, with lower replication factor of 4, there is a drastically higher maximum memory consumption than for replication factor 16. This may be because there are fewer replicas in the network thus requiring more requests generated leading to more memory consumption as a suitable replica is located. It is also possible that with fewer replicas, the storing nodes may also experience a bottleneck in replying to all the requests. The average memory consumption tended to follow the expected norm, i.e., increase in replication factor causes an increase in memory consumption. c) Network bandwidth: The last four columns of Table 8 show the average and maximum sending and receiving network data rates due to the workload. It is observed that increasing the nodes or replication factor leads to a corresponding increase in the needed network bandwidth as is shown by the increasing maximum data rates. The average data rates for both sending and receiving are nevertheless less than 1 KB/s except for the 2000 node test with replication factor of 16, which recorded values slightly greater than 1 KB/s. Evidently, the replication factor plays a significant role in the system performance. It is generally expected that, the higher the replication factor is, the better is the system performance, but at higher costs to the network. Invariably, it was observed that the system averages for storage and retrieval are much smaller than one second. Storage and retrieval times below one second are essential in meeting the user's needs in terms of quality of experience. High maximum storage/retrieval times are seen because of the DDS structures. In essence, the use of DDS, although allowing the system to handle complex data such as forums, albums and wall posts, has a considerable impact on the data access and storage times. When looking at average message sending/receiving rate, higher replication factors are desired because the rates drop, notwithstanding increasing maximum message rates, which can be also attributed to storage or retrieval of complex data types. The cost implications to replication factor increase are increased storage space usage and memory usage. In general, therefore, it can be said that the choice of replication factor in for storage must be carefully considered based on a performance-cost analysis. Higher replication values tend to have adverse effects on performance in smaller networks and lower replication values have a similar effect on large networks. Hence a balance must be made based on the network size.

Conclusions
As opposed to many other system tests carried out in the form of simulation tests, in this evaluation of LibreSocial, a P2P framework for OSNs, we endeavored to perform benchmarking for large networks of the actual application, reaching up to 2000 active network nodes. Our tests do not just measure the quality and costs of the system, but also demonstrate the possibility of having a wide set of OSN functions that work well together in our P2P solution. We demonstrate clearly that the plugins constituting the OSN application integrate well with framework layer which provides the required P2P elements for the plugins. Our contention is that other solutions do not have this proof and also not this range of functionality. In addition, to our knowledge, such large tests are heretofore yet to be performed with purely P2P-based OSNs, with a majority of the tests being simulations such as [19,21,[42][43][44]. The evaluation in general presents very insightful information. The P2P framework, and the OSN application functions designed in the form of plugin extension works well. We clearly show how these OSN plugins impact the messaging and data rates in the scheme of the overall network. In addition, tests were modeled to mimic the real environment for random behavior characteristics, its ability to scale up and remain stable under high churn, as well as review the effect of the replication factor on the system performance. The system's performance in a pseudo-real environment based on our randomization model was satisfactory, showing that the system works quite well with minimal errors. However, the network was observed to diminish, a behavior that is attributed to the design of the randomization algorithm. During the scalability and stability testing, there was evidence of dynamic adjustments to suit the situation in terms of memory requirements, routing readjustments and replica diversion as well as placement, which renders to the ability of the P2P OSN to smoothly scale up and also continue to function well under high churn. This behavior points to a network that can achieve stable state quickly. The replication factor chosen for the system plays a very important role in the performance of the system. The higher the replication factor, the higher the costs associated with it. However, as the network size grows, lower replication factors generally tend to result in poor performance. This non-linear behavior due to the replication factor on system performance may require a separate data-driven study to determine the optimal replication factor and ascertain the reason(s) for such any optimal behavior observed. In conclusion, we believe that this benchmark can be a useful reference source for future P2P OSNs in terms of performance and cost analysis.  Acknowledgments: Computational support and infrastructure was provided by the "Centre for Information and Media Technology" (ZIM) at the University of Dusseldorf (Germany).

Conflicts of Interest:
The authors declare no conflict of interest.