@[email protected] is continuing its growth spurt! There are now 3.5 M devices participating, including 2.8 M CPUs (19 M cores!) and 700 K GPUs.
According to Figure 1
, the volunteer computing project [email protected]
had even access to more computing power than all TOP-500 supercomputers together could provide! Just two months earlier, only 30,000 devices contributed to this project. Thus, what caused this tremendous increase in processing power? It was COVID-19 and the willingness of hundreds of thousands of “nerds” to fight COVID-19 by supporting Sars-CoV-2 computational bioscience research that was processed by the [email protected]
Our research is mainly dealing with cloud computing and corresponding software engineering and architecture questions. However, like many others, we surprisingly faced the COVID-19 shutdown impacts and decided on 23 March 2020 spontaneously to support Sars-CoV-2 computational bioscience-related research. Our lab set up two 32 CPU virtual machines from our virtualization cluster that is usually used for our cloud computing research and student projects.
Moreover, we called our students for support, and our students boarded almost immediately providing their most valuable assets: Gaming graphic cards. On 1 May 2020, we ranked on place 2145 of 252,771 teams worldwide (https://stats.foldingathome.org/team/245623
). Thus, we were suddenly among the top 1% of contributors as a very modest-sized University of Applied Sciences! Even the local newspaper reported about this. Furthermore, it is essential to mention that such kind of teams not only grew up in Lübeck, but this also happened all over the world. In total, the biggest supercomputer on earth emerged somehow by accident.
We were impressed by our students and by these numbers. Nevertheless, we also looked at the usage statistics of our provided machines gathered by our virtualization cluster. However, these data were so-so (see Figure 2
). Our machines, although being configured to run at full power, had only usage rates between 40% or 50%. In other words, we had assigned two machines, but, on average, only one was used. At the end of March, this was even more severe. In that phase, many teams and individual contributors boarded and the [email protected]
network grew massively. However, the effect was that our machines ran at ridiculous low usage rates. We first checked whether we had misconfigured the machines, but our machines were operating correctly. The processing pipelines were empty, and the master nodes needed hardware upgrades to handle all these new processing nodes. The control plane of the [email protected]
project could not scale-out fast enough. In cloud computing, we would say that [email protected]
was not elastic enough. It took all of April to make full use of the provided volunteer resources (see Figure 2
One obvious question arises here. Would it not be better in such a situation to use the amount of contributed nodes for other VC projects? If we cannot fight COVID-19, we might contribute to mathematical research, climate research, cancer research, or any other computationally intensive kind of research. The most of contributors will contribute their resources regardless of the specific aim of a VC project. Sadly, the reader will see throughout this paper that VC is not well prepared for shifting processing resources across projects that can lead to such undesirable usage scenarios shown in Figure 2
Therefore, this paper deals with the question of how to improve the resource sharing of future VC projects. We have done some similar transfer research for another domain and asked what could be shifted auspiciously from the cloud computing domain to the simulation domain [1
]. Because we derived some stupendous insights for the simulation domain, we will follow a quite similar methodology here (see Figure 3
The main contribution of this paper is to provide and explain engineering ideas and principles taken from the cloud computing domain that have been successfully invented, optimized, and successfully implemented by so-called cloud-native companies like Netflix, Uber, Airbnb, and many more. COVID-19 demonstrated that VC has some shortcomings regarding elasticity and efficient use of shared resources. Precisely, these two points have been optimized throughout the last ten years in cloud computing [2
]. Cloud-native companies invented a lot of technology to make efficient use of shared resources elastically [3
]. These companies were forced to invent these resource optimization technologies because it turned out that cloud computing can be costly if used inefficiently.
Thus, this paper strives to transfer some of these lessons learned [4
] from the cloud computing domain to the VC domain that might be beneficial. Consequently, VC could be better prepared for flexible sharing of processing resources across different VC projects. VC could share more and not claim donated resources.
We start by reviewing the current state of VC in Section 2
doing the same for the cloud computing domain in Section 3
. Section 4
will analyze what both domains could learn from each other and will derive some requirements and promising architectural opportunities for future VC projects. Section 5
will present the corresponding related work from the cloud and VC domain to provide interesting follow-up readings for the reader. We will conclude about our insights in Section 6
and forecast more standardized deployment units and more integrated but decentralized control-planes for VC.
3. A Review of the Current State of Cloud Computing
The reader should be aware that this section is mainly a summary of [3
] to provide a more convenient reading experience. As it already has been stated, the COVID-19 case of [email protected]
disclosed some “lock-in” shortcomings of VC regarding elasticity and efficient use of shared resources. This “lock-in” shows some astonishing parallels with cloud computing. Precisely, these two points (elasticity and resource utilization) have been mainly optimized throughout the last ten years in cloud computing [4
]. Theses lessons learned originate in profane economic considerations of cloud-native companies like Netflix, Uber, Airbnb, and many more. However, the resulting technological solutions might be rewarding for VC because they address by accident some of the identified shortcomings. Therefore, we want to focus on these core insights because these insights might be used to address and solve some of the identified problems of VC in Section 2.3
According to our experiences and action research activities over the last ten years, cloud computing is dominated by two major long-term trends. In the first adoption phase of cloud computing, existing IT-systems were merely transferred to cloud environments. The original design and architecture of these applications were not changed. Applications have only been migrated from dedicated to virtualized hardware. Over the years, cloud system engineers implemented remarkable improvements in cloud platforms (PaaS) and infrastructures (IaaS). In particular, we investigate resource utilization improvements (Section 3.1
) and the architectural evolution of cloud applications (Section 3.2
3.1. A Review of the Resource Utilization Evolution
Cloud-native applications are built to be elastic. If not, cloud computing would very often not be reasonable from an economic point of view [65
]. Elasticity is understood as the degree to which a system adapts to workload changes. Over time, systems were designed intentionally for such elastic cloud infrastructures. Accordingly, the utilization rates of underlying computing infrastructures increased. New deployment and design approaches like containers, microservices, or serverless architectures evolved.
shows a noticeable trend over the last decade. Machine virtualization consolidated plenty of bare-metal machines and formed the technological backbone of IaaS cloud computing. Virtual machines might be more lightweight than bare metal servers. However, containers are much more fine-grained and improved two things: The way of standardized deployments, but they also increased the utilization rates of virtual machines. Nevertheless, containers are still always-on components. Thus, Function-as-a-Service (FaaS) approaches evolved and introduced time-sharing concepts in container platforms. Using FaaS, units are only executed if requests have to be processed. Therefore, FaaS enables scale-to-zero deployments and improves resource efficiency [66
]. Thus, the technology stack—although getting more complicated—followed the trend to run more workload on the same amount of physical machines by shrinking the size of standardized deployment units, be it virtual machines, containers, or functions.
This virtual-machine →container→ function resource utilization evolution is accompanied by corresponding architectural approaches:
Service-oriented architectures (SOA) fitted very well with monolithic deployment approaches that can be provided using standardized virtual machines (IaaS).
Microservice architectures are built on top of loosely coupled and independently deployable services. These services can be provided via much smaller and standardized containers. We could rate Microservices as a kind of standardized PaaS cloud service provision model.
Finally, serverless architectures
are mainly event-driven service-of-service architectures where their functionality is provided as “nano”-services via functions. Serverless and FaaS are the latest trends in cloud computing, so functions are not standardized yet. However, more and more Cloud-native computing foundation (CNCF) hosted serverless approaches like Kubeless (https://kubeless.io
), Knative (https://knative.dev
), or OpenWhisk (https://openwhisk.apache.org
) make use of containers to package and deploy functions. Thus, it seems likely that containers might evolve as the de-facto deployment unit format not only for microservices, but also for functions.
3.2. A Review of the Architectural Evolution
The reader observes that cloud-native applications aim for better resource utilization by applying more fine-grained deployment units—for instance, containers instead of virtual machines or functions instead of containers. Improvements in resource utilization rates always had an impact on cloud-native architecture styles. Let us now investigate the two major architectural trends that might be of most interest from a VC perspective.
3.2.1. Microservice Architectures
Microservices form “an approach to software and systems architecture that builds on the well-established concept of modularization but emphasizes technical boundaries. Each module—each microservice—is implemented and operated as a small yet independent system, offering access to its internal logic and data through a well-defined network interface. This architectural style increases software agility because each microservice becomes an independent unit of development, deployment, operations, versioning, and scaling .”
Faster delivery, improved scalability, and greater autonomy are often mentioned benefits of microservice architectures [67
]. Different services are independently scalable based on actual request stimuli. Because services can be developed and operated by different teams, they not only have a technological but also an organizational impact. Thus, localized decisions per service regarding programming languages, libraries, frameworks, and more are possible and enable best-of-breed approaches.
Besides the pure architectural point of view, the following tools, frameworks, services, and platforms form the current understanding of the term microservice:
Service discovery technologies decouple services from each other. Services must not explicitly refer to network locations.
Container orchestration technologies automate container allocation and management tasks.
Monitoring technologies enable runtime monitoring and analysis of the runtime behavior of microservices.
Latency and fault-tolerant communication libraries enable efficient and reliable service communication in permanently changing configurations.
Service proxy technologies provide service discovery and fault-tolerant communication features that are exposed over HTTP.
A complex tool-chain evolved to handle the continuous operation of microservice-based cloud applications [3
]. We should consider this for VC and took only the barely necessary concepts.
3.2.2. Serverless Architectures
The serverless computing model allocates resources dynamically and intentionally out of control of the service customer. To scale to zero resources might be the most critical differentiator of serverless platforms compared with other IaaS or PaaS-based cloud platforms. This scale-to-zero capability excludes the most expensive always-on usage pattern [65
]. Consequently, the term “serverless” is getting more and more attraction [67
]. However, where have all the servers gone? Processing resources must still exist somehow.
Serverless architectures make substantial use of Function-as-a-Service (FaaS) concepts and platforms [69
] and integrate more intensively third-party backend services. Figure 7
shows this evolution over the last ten years. FaaS platforms realize time-sharing of resources and increase the utilization factor of computing infrastructures. Cost reductions of 70% are possible [66
A FaaS platform is nothing more than an event processing system (see Figure 8
). Serverless platforms take an event and then determine which functions are registered to process the event [70
]. If no function is present, a new one is created. Event-based applications are especially very much suited for this approach [70
In summary, the following observable engineering decisions in serverless architectures are worth being mentioned:
Cross-sectional logic, like authentication or storage, is sourced to external third party services.
End-user clients or edge devices do the Service composition. Thus, service orchestration is not done by the service provider but by the service consumer via provided applications.
Endpoints using HTTP- and REST-based/REST-like communication protocols that can be provided easily via API gateways are generally preferred.
Only very domain or service-specific functions are provided on FaaS platforms.
Thus, the serverless design is generally more decentralized and distributed. It makes more intentional use of independently provided services and is therefore much more intangible compared with microservice architectures.
In particular, this distribution characteristic seems to make it more suitable for VC. Thus, serverless principles might be more preferable than microservice principles to consider for VC.
4. Discussion of Technological and Architectural Opportunities for Future Volunteer Computing
In Section 3
, the reader got to know how the design of so-called cloud-native systems have changed. Technologies that have been massively improved, integrated, and simplified throughout the last ten years are containers, image registries, service registries, and service proxies. The primary motivation for these changes has been economical. If this had not been done, cloud-deployed systems would waste valuable (and costly) cloud resources. Some similar efficiency problems could be observed during COVID-19 crisis in the [email protected]
project (see Figure 2
). Therefore, this section will investigate whether and how these cloud-native improvements could be used to evolve the VC model that has been summarized in Section 2.2
and Figure 6
In short, this perspective paper proposes to transform the current situation of isolated VC project networks into a more meshed variant (see Figure 9
). Therefore, all VC projects must share their master endpoints in a standardized way. The vision is that a worker should only know one IP address of the global VC network to gain information about all other existing and available VC projects and endpoints. It is then up to the worker to decide to which projects it would like to contribute. In addition, a worker should always be able to contribute to more than one project at a time. Thus, if a favored project does not request resources, the worker could fetch tasks from other VC projects according to a priority list or any different kind of prioritization. If this was possible, the COVID-19 case of the [email protected]
project (see Figure 1
) would not have happened. Unused resources had been automatically spent on other projects that address climate change, cryptography, number theory, or whatever.
4.1. Standardization of Deployment Units
Therefore, this perspective paper proposes in Figure 10
an evolved and container-based master–worker architecture extending the typical VC master–worker architecture (see Figure 6
The marked grey components are extending or changing specific parts of the VC reference architecture explained in Section 2.2
. First of all, the proposed architecture considers more than one VC project. All master nodes provide a standard Service Registry component that is shared by all VC projects to enable a complete VC project awareness for worker clients. Thus, a worker can choose which project it would like to provide its computing resources. Therefore, all VC projects must provide their workload in a standardized but flexible deployment format to make this freedom of choice possible. Therefore, the architecture proposes to make use of container images as deployment format. Signed images are provided via public Image Registries
. A VC project can operate a project-specific image registry. Alternatively, public image registries (like DockerHub https://hub.docker.com
) would work as well.
Thus, the worker client can be minimized to a standardized container runtime engine (for example Docker or any other OCI conform container runtime engine) that fetches provided images and executes them as Worker Functions
in a FaaS-like style (see Figure 8
). However, worker functions are accompanied by a so-called side-car container [67
] (VC proxy
) that handles common VC communication patterns and can be called by the VC Runtime Environment
(a lightweight wrapper around an OCI conform container runtime environment). VC proxy and the VC Runtime Environment are designed to be standardized components that must be not adapted by a VC project. The VC proxy provides a standard interface furthermore for worker functions. Obviously, the worker function is a VC project-specific part that realizes data processing. However, because the proposal makes use of container technologies, it can be provided as a standardized deployment unit. Container technologies enable furthermore polyglot programming, so there is a freedom of programming language choice for implementing worker functions. Current VC platforms like BOINC often enforce to include specific libraries which are written very often in C/C++.
summarizes the selected and discussed cloud-native technologies and explains how these technologies can be used to realize the proposed architecture (see Figure 10
) to mitigate mentioned VC open issues reported in the literature. To do this, well-tried technologies that form the backbone of modern cloud-native architectures could be adopted: Containers, even more fine-grained (but container-based) functions, image registries, and distributed service registries seem very promising here. Thus, like the BOINC-approach, a middleware-based approach is proposed to share resources between different VC projects. However, most parts of this middleware are already existing (the technologies mentioned above and listed in Table 2
). Thus, the proposal does not require a complete new middleware or framework. Still, the proposal requires a VC-specific integration of these technologies that go beyond the BOINC-approach (dating back to the early 2000 s, so to a pre-cloud era). This VC-specific integration is called VC Runtime Environment
in Figure 10
4.2. Client-Side Service Discovery Initiated Workflow
A publish–subscribe communication model between the client and the master nodes is nearby. However, the publish–subscriber model assumes to some degree that both client- and master-components are (not perfectly but to some degree reliable) always-on components. This assumption is not entirely the case in volunteer computing, especially not on the client-side. Thus, the proposed approach follows the publish–subscriber philosophy but evolves it as a straightforward and purely client-side triggered approach. In VC, the clients form the ephemeral parts of the complete system. Taking the insights of [72
], one can expect simply less error tracking efforts if the ephemeral components (clients) query and request the stable parts (masters).
In VC, we are talking about hundreds of VC projects operating thousands of master nodes that mainly operate in an always-on mode. On the worker side, we are talking about a much more volatile setting of millions of client nodes that are only sporadically available. Therefore, we generally advocate client-side service discovery and distributed server-side service registries because the unchanging parts are more on the master and less on the client-side. Thus, client-side service discovery has to query much less moving and changing roles in this setting (thousands of service endpoints on the master-side instead of millions of service endpoints on the client-side). Whenever a client is available, it can make (well cacheable) client-side service discovery and ask for VC tasks and process them according to the client preferences and priorities. Thus, client-side service discovery can be used to create a naturally occurring workload sharing across different VC projects (and not just within a VC project).
The client-side initiated workflow loops through the steps ➊ to ➏ shown in Figure 10
In Step ➊, the VC Runtime Environment of a worker node queries periodically (for example, each day, every six hours or similar) the distributed VC project discovery service that is formed by master nodes of various VC projects. This updates a worker node’s VC project awareness to decide which master nodes to ask for processing tasks.
In Step ➋, the VC Runtime Environment of a worker node selects a master node according to its updated project awareness and fetches a task (including the data to be processed). If this fails (for instance, the master node might be not available, has no jobs, etc.), another task from another master node (even from a different project) is fetched according to worker node preferences.
In Step ➌, the VC Runtime Environment analysis the task and triggers a corresponding Function pull from a public image registry to fetch (if not already present) and start the VC project-specific Worker function container image. Therefore, the address of the image registry, the unique image name of the Worker function, and image version must be part of the task description. Furthermore, the task description must contain the URL of the data to be processed.
In Step ➍, the VC Runtime Environment handles the control over to a VC proxy. This proxy does communication with the Worker function and decouples the runtime environment from the Worker function.
In Step ➎, the VC proxy calls the Worker function with the to be processed data and receives the result.
Finally, in Step ➏, the VC proxy can even do the result verification on the Worker-side (and not on the Master). Like the Worker function, the VC proxy is simply a container that is instantiated from a trusted image and may, therefore, contain signed result verification logic that cannot be tampered unnoticed. As a last step, the VC proxy transmits the result to the assimilation endpoint of the master node (this endpoint must also be part of the task description).
The process would go on with step ➋ (or step ➊ if a periodic update of the VC project awareness is necessary). It would address the resource sharing problem efficiently and with a simple client-side strategy.
5. Critical Discussion and Related Work
The proposed approach targets mainly volunteer grid computing projects and, in particular, the mentioned shortcomings of cross-project resource sharing. The main intention is to improve resource sharing across VC projects and to set up VC projects more easily making use of established and well-accepted cloud-native technologies. Therefore, it shares similar limitations like every other volunteer grid computing project. Thus, the architecture supports embarrassingly parallel tasks that need no or little communication among the tasks [5
]. If this is not the case, volunteer parallel computing projects or even HPC supercomputing might be a better fit. However, this paper does not focus on this kind of parallel VC or even HPC. The paper does not even claim to have or provide answers of value for these parallel computing or HPC supercomputing domains.
Furthermore, the proposed approach follows conceptually a middleware-based path and shares, therefore, comparable limitations to the BOINC-approach [13
]. To reach project awareness and enable necessary trust are complex tasks in themselves, even in a single-research VC project. Single-research projects might even have advantages here because contributors do not need any complex overview of various research or other projects. However, this aspect is getting more complicated for a middleware-based approach [74
]. The contributors need project awareness, and some project prioritization means. Some contributors want to support disease-related projects but might be not interested in supporting prime-number mathematical research. In the case of BOINC, a kind of trusted community already exists, and all BOINC-based projects have passed a kind of quality gate, and all BOINC-based projects are research-focused. This gatekeeper role of BOINC makes it easier for BOINC contributors to establish trust. However, projects that are not aware of the gatekeeper are not aware to contributors as well. Thus, multi-project gatekeeping in VC always has a kind of censorship.
Therefore, the proposed approach enables intentionally to set up VC infrastructures that could process arbitrary computations—for instance, doing mundane crypto-mining for purely monetary reasons. We do not think that crypto-mining (or other doubtable motivations) would be a useful and ethical form of VC. However, this should not be the decision of a non-democratic legitimized gatekeeping institution but the personal choice of every single VC contributor according to their own criteria [75
Therefore, this proposal intentionally does not assume a well-trusted gatekeeper that filters qualified from non-qualified projects. However, this missing gatekeeping role makes it more complicated for contributors to select VC projects that are worth being supported.
The reader should take additional surveys on VC, like [5
] or cloud computing [2
], into account to derive their own conclusions and discuss this perspective paper critically. For instance, Ref. [5
] provides a broad and excellent overview of several grid-based, cloud-based, mobile, and parallel VC projects, frameworks, and technological approaches. As the reader may have noticed, this perspective paper is highly influenced by [5
]. However, to the best of the author’s knowledge, there does not exist any survey that covers cloud computing and VC in parallel focusing particularly on the aspect of how to combine concepts of both domains to overcome the project exclusiveness problem in VC.
It is interesting to see that grid- and cloud-based projects form the majority of VC projects (see Figure 4
). Thus, there is some kind of attraction in adopting Cloud technologies for VC (see Figure 5
). In particular, recent cloud-native trends like standardization of fine-grained deployment units via containers provide exciting opportunities. “The efficient use of available resources and tight integration with the host operating system makes container technologies a plausible choice for VC. [...] More research is important to investigate the suitability and efficiency of container technologies for VC, specifically for volunteer cloud systems ”.
However, this paper shifted the focus less on volunteer clouds but postulated to adopt container technologies in VC to improve the overall share- and portability between VC projects to enable overflow processing between different projects. Projects like [77
] strive to do something similar by integrating supercomputing with VC and Cloud Computing. However, their focus is more on how to make the high-performance end of supercomputing data centers available for VC.
The COVID-19 pandemic created the largest volunteer supercomputer on earth. However, this largest supercomputer on the planet ran idle for a significant amount of time—what a waste of resources. Therefore, this perspective paper investigated how the sharing of donated resources across VC projects could be improved. If one cannot fight COVID-19, one might contribute to mathematical research, climate research, other disease research, or any different computationally intensive kind of research. Most of the VC donators will provide their resources regardless of the specific aim of a VC project.
This perspective paper proposes to tackle the disclosed resource sharing shortcomings of volunteer computing using technologies that have been invented, optimized, and adapted for entirely different purposes by cloud-native companies like Uber, Airbnb, Google, or Facebook. Such promising technologies might be containers, serverless architectures, image registries, distributed service registries that can address problems like hardware heterogeneity, sandboxing, code signing and updating, result verification, and most importantly to overcome project exclusiveness. All these cloud-native technologies mentioned have one thing in common: They already exist and are all tried and tested in large web-scale deployments.
However, the reader should keep in mind that this paper is a perspective paper. It does not present a validated solution proposal intentionally. Nevertheless, it offered a detailed list of technologies to overcome current shortcomings of VC that the COVID-19 case disclosed. This concrete strategy does not claim the “philosopher’s stone” but strives to foster discussions in the VC community on how VC could transform into a more global VC grid of cooperating VC projects that share and do not claim resources. In addition, a lot of interesting research questions will appear if this path is followed. This path compromises the fact that each VC project is still operating their own master nodes and control plane infrastructure but yielding access to donated resources.
Consequently, a win–win-situation can be expected: Future VC projects should gain access to a much broader set of donated resources. COVID-19 showed us that VC projects can easily build the biggest supercomputer on earth. In addition, VC donors would gain access to a much more multifarious spectrum of research projects. The middleware must simply be more standardized, and more focused on resource sharing. Cloud computing has followed this exact path successfully for a bit more than a decade. Perhaps VC should also have a look?