5.2. Experiment 1: Autonomy and Robustness
In the first experiment, which lasted for more than 120 days, Airchive was exposed to irregular power and network disruptions. We did not interfere in system restoration during the “down” incidents and let the system self-recover.
In this experiment, a moderate sampling frequency to data capture component was set, in order to investigate the system’s long-term storage capabilities. A measurement was retrieved from each sensor every 5 min. During this experiment, 183,850 measurements were gradually collected and served. In Table 2
, there is a summary of the 24 network outage events observed during this period. Outage events were logged with UptimeRobot [47
], a service that monitors web applications and notifies interested parties when an application is not accessible via the Internet. UptimeRobot was used only to log network failures, and did not interfere with our system.
During the first experiment, different users were submitting data queries to the system, in an ad-hoc manner, using the various interfaces: graph visualizations were requested by Web users, raw observations by SOS clients and derived metadata by OAI/PMH harvesters. We did not observe any malfunction for any of the client operations. Current and historical data were monitored, stored and disseminated appropriately, while the automated recovery worked as expected. OAI featured records were calculated on-the-fly, upon harvester requests in a timely fashion. We did not observe any notable delays in the capacity of the system to serve its clients.
5.3. Experiment 2: Stress
During the second experiment, we conducted a stress test, in order to provide more insights regarding the system limitations. We investigated the number of concurrent user requests, after which the Airchive
system delayed to respond. The sampling frequency was increased to 5 s. The experiment lasted for three days, and it collected and served more than 259,000 measurements. We utilized Locust [48
], an open source load testing tool written in Python. In Locust, a variable number of clients are deployed to submit concurrent requests to a service. Each Locust client submits a new request only when it receives a response to its previous request.
Locust takes as input the following parameters: (a) the number of concurrent clients; (b) the total number of requests; and (c) a url pointing to the requested resource. We set up three tests. In all cases, clients submitted a hundred requests altogether. The three tests involved the following requests over the Internet, via HTTP GET.
In the first test, clients request only the Airchive frontpage, which is a static HTML document. No transactions to the database were involved and the response size is constant (8740 bytes). Test 1 verifies that Airchive operates properly and examines if pressure on the Web services/dissemination components has an impact on the data capture component.
In the second test, clients request a set of 20 observations using Airchive’s API, and the response is formatted as a JSON document. This request requires an SQL query to be submitted to the database, and the response size is 538 bytes. Test 2 corresponds to the use case of a Web user that asks for a graph, as Airchive transmits the JSON document and the graph is rendered on the client-side.
In the third test, clients ask for the same set of observations as in Test 2, but this time over the SOS protocol, which returns an XML document. This request requires exactly the same SQL query to be submitted to the database but needs additional formatting for rendering the result in XML. The response size is 16,504 bytes. Test 3 corresponds to the use case of an SOS client asking data from Airchive.
We simulated four scenarios, in each of which we deployed a different number of concurrent clients. We tried one, five, 10 and 25 concurrent clients. This is a realistic assumption as the current system is not intended for large-scale deployment. We repeated the process five times for each test and scenario combination and reported two metrics in Table 3
: (a) the average response time (ART) in milliseconds
; and (b) the number of requests served per second (RPS).
Average response time (ART) is a proxy of the average delay to an external user request. For example, a user would have to wait 49.5 s (on average) plus the response time of their submitted request, under the scenario of the 25 concurrent clients for Test 3. As indicated by the results in Table 3
, average response time is linearly correlated with the product of (a) number of concurrent users; and (b) average response time achieved when one client submits requests. We verify that requests per second (RPS) depend on the type of the requested document, and is rather stable regardless of the number of concurrent clients.
The introduced overhead to the system response times depends on the request and format type. Requests involving dynamic content are roughly 30 times slower than requests of static content. In the case of dynamic content requests, JSON-formatted responses are served 16% faster than the equivalent in XML.
Interpreting the results, we derive the number of concurrent (human) Web users that the system may serve. Assuming that a human user should not wait more than 6 s, we conclude that Airchive can serve simultaneously up to two Web users of the SOS/XML Web service (Test 3), or three Web users of the API/JSON Web service (Test 2). In the case of static content (Test 1), Airchive is able to serve up to 82 clients simultaneously. The numbers above do not represent Airchive’s maximum capabilities, rather its capacity for serving content to Web users.
In contrast, software agents
interacting with such a system are usually not bound to any time limitation. We conducted further experiments to determine the threshold after which the system started failing to respond to requests. We increased the total number of requests to 500. We started increasing the number of concurrent users by multiples of 5, until requests started to fail. Airchive
can serve simultaneously, without failure up to 254 (Test 1), 141 (Test 2) and 138 clients (Test 3). In excess of the client numbers above, the system continued to respond with more than one failure. The results are summarized in Table 4
. These tests demonstrate Airchive
’s capacity to work reliably with a significant number of clients.
We underline that despite the heavy workload we introduced during the stress tests, AirPi continued to operate normally. In all cases, we verified with the database content that observations were recorded every 5 s without any loss in all the experiments above (i.e., the data dissemination does not interfere with data capture).
5.4. Incidents and Lessons
During experiment 1, network failures occurred quite often. Those failures, impeded only Web connectivity and apart from the web server, the rest of the Airchive components continued to operate properly. We verified that no data loss occurred by cross-checking the time down intervals logged with UptimeRobot with the actual observations stored in the database.
We observed that the system was able to handle power failures, and it self-restored without human intervention. For all 24 outage events during experiment 1, Airchive recovered properly by making the Web service available as soon as the Internet connection returned. In this respect, the system demonstrated its persistence and credibility as a repository.
During experiment setup, we stumbled upon a recurring security incident. Given that Raspberry Pi was constantly connected to the Internet, it attracted malicious users after its first boot. We experienced a brute force attack to the SSH protocol that was trying to get unauthorized access to the device. We toughened up Airchive
with a dedicated security software solution (fail2ban [49
]), which prevented any further security incidents of that kind.
Another lesson learned had to do with a potential issue that may arise when power and networks fail at the same time. Raspberry Pi lacks a Real-Time built-in Clock (RTC), and it synchronizes its system clock through the Internet. In the case that an Internet connection is not available upon system boot, the Raspberry Pi system time is misconfigured. In the general use case of Airchive, this will not be a problem, but, in our experiments, this will result in errors in the data capture component, which will assign incorrect timestamps to data sensed from the HAT. This problem can be overcome so that the data capture component retrospectively reviews these timestamps when the Internet becomes available. An RTC HAT can be purchased and applied to Raspberry Pi. However, this option increases the total cost.
Last but not least, during the setup phase, we experimented with booting Raspbian and running Airchive from the USB disk instead of the SD card. First of all, this is a task that requires advanced technical skills and is still an experimental option not endorsed by the Raspberry Pi makers, and performance is not guaranteed. USB disks provide a cheaper storage option but are prone to failure. We experimented with this option for one month, during which the filesystem was corrupted twice, requiring the operating system and Airchive to be re-installed. Observed data were not permanently lost, but their retrieval required technical skills. In contrast, no such incidents occurred when the system operated on an SD card for a much longer period.