Cloud Computing for Climate Modelling: Evaluation, Challenges and Beneﬁts

: Cloud computing is a mature technology that has already shown beneﬁts for a wide range of academic research domains that, in turn, utilize a wide range of application design models. In this paper, we discuss the use of cloud computing as a tool to improve the range of resources available for climate science, presenting the evaluation of two different climate models. Each was customized in a different way to run in public cloud computing environments (hereafter cloud computing) provided by three different public vendors: Amazon, Google and Microsoft. The adaptations and procedures necessary to run the models in these environments are described. The computational performance and cost of each model within this new type of environment are discussed, and an assessment is given in qualitative terms. Finally, we discuss how cloud computing can be used for geoscientiﬁc modelling, including issues related to the allocation of resources by funding bodies. We also discuss problems related to computing security, reliability and scientiﬁc reproducibility.


Introduction
The continuous and rapid increase in computing power has been a major factor in the progress of numerous scientific disciplines over the last few decades.Increased computing power in the field of climate modelling is leading to more accurate assessments of the impact of climate change [1].This implies huge challenges from the point of view of both hardware and software, one of the most important being the ever-increasing volumes of data generated by both observation and simulation.This change, and therefore, the commensurate cycle of requiring ever-greater computational resources, is one that is happening across nearly all research domains [2], but is extremely prevalent in the grand challenge areas of climate and geoscience.
Over the last 20 years, many scientists have been running simulations in High Performance Computing (HPC) environments and transferring the output data to local systems for analysis.This was a perfectly reasonable proposition when data volumes were small.However, now that outputs from operational forecasting models are updated hourly and the amounts have reached more than 300 TB per day, this workflow is simply no longer possible.It is, therefore, necessary to bring computing and data together by no longer moving data to computing but computing to data.Hence, the data processing and analytical capabilities associated with the cloud and other distributed computing paradigms are an integral part of future climate modelling.
Cloud computing has emerged in recent years as both a new business model and a sensible technological choice, as it allows users to adapt resources to demand and/or budget relatively easily, reducing the need to manage a computing infrastructure on premises.In the private sector, the migration to cloud computing from traditional IT infrastructure is increasing and is expected to continue over the next few years [3,4].Beyond the private sector, cloud computing is also increasingly popular in research laboratories around the world [5,6].For example, in April 2015, Microsoft launched the Azure4Research Climate Data Award Program in support of the White House Climate Data Initiative, and the European Commission established a plan to develop an European Open Science Cloud by the end of 2016 [7] that continues to be developed.Institutions studying weather and climate have begun to explore the use of cloud platforms.For example, the United Kingdom Met Office is developing a distributed data analysis platform that obviates the need for scientists to move data around (https://aws.amazon.com/solutions/case-studies/the-met-office/) .
Such platforms are designed to run a Hadoop [8] cluster in a hybrid (i.e., local and remote) cloud that shares storage space with the HPC infrastructure, using Python notebooks as the primary interface for the user.The data cluster of the Science and Technology Facilities Council (STFC, https://www.stfc.ac.uk/)Centre for Environmental Data Archival works in part as a piece of the cloud computing infrastructure [9].NOAA has partially externalized its data storage and computing (for example, using Amazon AWS Lambda) through partnerships with major vendors in the framework of the Big Data Project [10].This approach is also used by the Met Office [11].Other applications of cloud computing in atmospheric and ocean sciences have been compiled [12] and range from data storage and analysis to visualization.However, these examples correspond more to what is known as Infrastructure as a Service (IaaS).In contrast, the use of cloud services to substitute pure computing power (HPC as a Service (HPCaaS)) has not been explored to the same degree.Already more than ten years ago, a single assessment of a basic cloud computing system was performed [13].More recently, other experiments have been carried out on weather forecasting [14,15] and climate modelling [16][17][18][19][20].
It is interesting to note the need for and advantages of cloud computing technologies in the increasing framework of climate services.When delivering climate data and information to stakeholders, one of the main issues is the compatibility of formats [21] and IT infrastructure between the provider (for example, a national weather service) and the client.The use of cloud computing as a shared platform between the two parties could help to solve such problems.
However, decisions to move from traditional in-house HPC to cloud computing need careful preparation and studies [6].Some factors to consider include: • the suitability of hardware to the particular computing task (e.g., massively parallel tasks, IO intensive tasks); • the overhead from using virtualization and the ability to optimize code on cloud resources; • the cost of computational time; • the requirement of storing data long term and data transfers out of the cloud (and related costs); • the ability to process and analyse data within the cloud.

•
user interface and ease of use.
For climate research, issues such as reliability and trust in the results are essential for ensuring that the results provided by the cloud services correspond to the computation that was originally requested.Errors from potential hardware failures are common to all kinds of computational systems.The impacts of these on the results, and how to work around the problems caused, have been exposed.To address such issues, a kind of backup infrastructure is proposed in which cloud computing can provide an optimal solution because, by their nature, they are located in hardware facilities much larger than those requested by a single user [22].
Here, to shed some light on the possibilities offered by cloud computing, we explore and discuss these issues, including the computational performance of the various options for running climate models, their monetary cost, security and possible influences on funding models and scientific reproducibility.We focus on HPCaaS because the goal here is to highlight the ability of cloud services to substitute or complement local computing facilities.In our analysis, we use solutions offered by the three leading market providers according to a previous report [23]: Amazon Web Services (AWS, https://aws.amazon.com),Google Cloud Platform (GCP, https://cloud.google.com/computeand Microsoft Azure (https://azure.microsoft.com).

Evaluation of Climate Model Performance
In this section, we split our analysis into two different parts: single and multiprocessor climate simulations (spcs and mpcs, respectively).There were several reasons for doing it in this way.First of all, the nature of the experiments that we performed in each cloud platform was different: in spcs, we tested the performance of climate simulations deployed as binary files running on single cores of a processor, avoiding compilation tasks in the cloud.However, for the mpcs case, we directly compiled a model on the cloud platform.Because of the focus of the cloud solution offered, Google Compute Engine (GCE) fit much better to address the mpcs problem than AWS or Azure.Furthermore, something to note is that AWS and Azure are marketed focusing on clients with similar profiles and different from the ones of GCE, therefore making it reasonable to balance AWS with Azure, but not to balance them with GCE.This is addressed in some way later, in the section about user experience, and is clear from statistics about cloud adoption [24,25].

Single Processor Climate Simulations
In order to evaluate the options for using cloud services to run models, we performed several experiments focusing on computing performance and cost.The first of these was developed adopting the well-known ClimatePrediction.net (CPDN) infrastructure [26] running Weather@Home2 [27] computational tasks, which uses BOINC [28] as a tool to distribute the computing work.A previous assessment, along with the technical details of this framework can be found in our earlier work [17], in which experiments using AWS and Azure were presented.Details of the configuration are included in Appendix A.
For this purpose, we ran a set of thirteen month climate simulations for CPDN in Azure and AWS using a range of different Virtual Machines (VMs) (with different hardware and allocated resources) in an optimised configuration.The simulations were run using the Met Office Hadley Centre regional atmospheric circulation model HadRM3P [29], at a resolution of 50 km that covered the South America CORDEX (Coordinated Regional Downscaling Experiment) region (e.g., [30]), nested within the global atmosphere-only model HadAM3P.These simulations were run on a single processor and lasted between three and five days depending on the VM type.
The results are shown in Figure 1.In order to obtain a (theoretically) similar performance between the two cloud providers, we focused on two similar VMs: Azure Standard F4 and AWS c4.xlarge.Details of the hardware are available in Tables 1 and 2. The results showed that the Azure F4 VM was 13.9% faster; however, the AWS c4.xlarge was 4.6 times less expensive.However, we should note that the AWS simulations were run using reduced-cost VMs.These VMs are known as spot instances (https://aws.amazon.com/ec2/spot)and are not available from any other vendor.They let us configure the maximum price of the VMs (which changes according to the demand that a given AWS region experiences) and to run or stop the model according to such a limitation.Furthermore, it must be noted that the cost of running over on-demand instead of on spot instances can be up to ten times more expensive [31].Another conclusion is that the cost of using on-demand AWS VMs is slightly higher than for Azure.
It must be noted that we were not strictly comparing like with like in this instance.Beyond the small technical differences between the servers available in each platform, subtle distinctions in data transfer could also have an impact on the results obtained.For example, an AWS vCPU (virtual/abstracted computing capacity) is a single thread rather than a dedicated CPU core.To select the best solution, as a means of optimizing performance, a user must evaluate the bottlenecks in a given application considering a wide range of hardware and configuration options.For example, if it is necessary to complete a large ensemble of climate simulations with a model that struggles to give a good performance because of the communication between cores (in a given CPU or full instances), the user could decide on the best provider of the cloud service taking into account such a limitation.On the other hand, the possibility of using less CPUs to run each member of the ensemble and to run several members simultaneously could be assessed.

Multiprocessor Climate Simulations
A different technical approach lies in the possibility of running a climate model directly over the cloud by deploying several VMs working as a cluster.An analogy with a Supercomputer (SC) was undertaken using the FinisTerrae II from the Centro de Supercomputación de Galicia, in Spain, and the cloud services provided by the Google Compute Engine (GCE), using Debian GNU/Linux 7 Wheezy as the operating system in the VMs.Again, the technical details can be found in Appendix B. The model selected to perform the test was WACCM [32].Several versions of WACCM have been run in the past in FinisTerrae I (the Finisterrae I SuperComputer has been in service from 2007 to 2015 and ranked 101 on the Top500 in 2007; https://www.top500.org/system/175541/)and the resulting simulations used for international reports and research papers [33][34][35][36][37][38].A summary of the details of past performance is also available [39].The results for the simulations performed here are shown in Figure 2.
It can be seen that the GCE gave a better performance than the SC for a smaller number of cores/MPItasks.For example, for 32 cores, the performance was approximately 200% better, but this was considerably reduced when the simulation was more demanding and when more MPI tasks were used.The model throughput showed clearly how the SC performance was better after approximately 100 cores.Apart from technical differences between the processors of the SC and the GCE, it was plausible that the main cause of the inferior performance of a public cloud computing solution for bigger tasks was the interconnection network.The GCE features a speed of 1.9 Gbits/s when connecting computing nodes, while the SC has an Infiniband delivering 19 Gbits/s.Similar conclusions were reached for testing in AWS and ARCHER (U.K. National Supercomputing Service) when running the HadGEM3 model [40].
An analysis of the costs associated with these simulations showed how the GCE was both systematically and substantially less expensive than the SC, based on standard rates charged by the supercomputing centre to external users and GCE pricing (see Figure 2).

User Experience of Cloud Vendors
Different vendors of cloud computing services offer different products, meaning that one might fit a user's needs better than another.This could be related to user experience and quality of service, which can be assessed under the umbrella of what is known as "cloud resource orchestration" [41].For example, after running our simulations, we could make the following specific observations:

•
The prices previously described were based on standard rates; however, different discounts and specific payment plans can be discussed and negotiated directly with providers.Running simulations has costs associated with storage and transfer data.In some cases, these associated costs can be completely insignificant [17], but also can be slightly more expensive than for an SC (e.g., comparing the ARCservices (https://help.it.ox.ac.uk/arc/services) provided by the University of Oxford to AWS) [31].

•
For simulations using large ensembles with BOINC, for example, the main limiting factor is the CPU, not the memory [17].However, when running a model directly over a cloud service (as, in this case, for the GCE), constraints very similar to a supercomputer are found (parallelization, network communication and memory).However, a given vendor could provide solutions for the issue of memory and CPU without any problems.These details can be negotiated directly with providers.

•
AWS API calls (and related tools) are well documented and easy to integrate (different SDKs are available).Azure's API (and tools) have good documentation, but still have some way to go to achieve the same level as AWS.

•
Writing code for Azure seems to be more oriented towards .NET developers than towards the general public, which made it difficult for us to create extensive automation for our simulations such as the agnostic/generic management of hundreds of VMs.

•
In the same vein as AWS, the GCE provides an infrastructure that simplifies both the deployment of simulations and the use of VMs.• AWS, Azure and GCP provide similar basic security mechanisms and systems: access control, audit trail, data encryption and private networks [42,43].This was relevant for our tests as we wanted to assure the reproducibility and data validation (as well as the results' distribution), so it was required that the data integrity was guaranteed.All the evaluated cloud providers have data encryption available for both local and distributed (AWS S3, Google Cloud Storage (GCS) and Azure Storage).The security features (for the three providers) are easy to setup (and sometimes just out-of-the-box, like on the distributed storage).It is worth mentioning that the tested providers manage and process very sensitive data (such as governments' and medical information), so they have to comply with the highest security standards like SOC (Service Organization Control) or ISO/IEC 27001 and pass periodic audits [44,45].
On a more high-level approach, selecting a platform/infrastructure (SC vs. cloud) is not trivial, it requires the evaluation of different aspects that will probably have specific weights depending on the model and the experiment or simulation [6].In Table 3, we provide some general observations, with the main advantages and disadvantages of an SC and the different vendors assessed.

Supercomputer
• Well known and very predictable environment.
• Limited elasticity and scalability.
• Usually, shared environment.• Better institutional support and budget.
• Expected high queue wait times.
• Cost optimization can be complex to understand.• Best support.Biggest number of solutions and integrations.
• Services are tailored to AWS; easy to get into a vendor lock-in situation.

Azure
• Best option for Windows-based software.
• GNU/Linux-based simulations are not the ideal case for Azure.• Very competitive pricing and waivers.
• Generally speaking, less mature than AWS.

GCP
• Appealing and comprehensive pricing model based on usage.
• Some of the services are still in the very early stages.• In many cases, services are easier to manage than with other providers.• Very vanilla; this can also be seen as an advantage in some cases.

Discussion
While it might appear that for (public) cloud computing, there are no limitations on the computing power that a user can access, this perception is misleading.With cloud computing, there is a shift away from competing for computational resources with other users of the same SC to being limited by the computational resources that a user can afford.This can, in turn, make funding bodies and researchers more aware of the real and total cost of funding the research.Indeed, cloud computing was recommended some years ago as an option for inclusion when budgeting for research projects with HPC needs [46].This idea is consistent with the definition of cloud computing as a business model and points to the growing importance of moving to a public cloud computing infrastructure as a form of privatization or externalization of part of the research process.
It must be remembered that the use of an SC option implies large overheads for manual operations and thus a need for in-house staff dedicated to solving technical issues, rather than providing support for activities that maximize the scientific output of a project, such as more complete or additional analysis of the data, or better organization.An SC option also implies the need for regular hardware updates and upgrades.
A real example in the field of atmospheric sciences (and rather an exception, as it was built from scratch instead of using an external provider) is the model CloudMUSC supported by the Norwegian Meteorological Institute and run on cPouta (https://www.csc.fi/en/web/atcsc/-/pilvilaskentaavuksi-saamallien-kehitykseen), a cloud service based on OpenStack (https://research.csc.fi/poutauser-guide).However, researchers in the atmospheric sciences have been using models since the early studies in the 1950s [47], and transferring these to a cloud-based system could require a considerable upfront investment.Indeed, it is expected that over the next few years, the migration of applications to cloud computing services is a must, and most of the investment will be necessary for migration of applications or development from scratch to adapt them to the cloud [4].
One possible scenario is where data are stored in a cloud service.This can limit the cost of maintaining infrastructure for providers of large datasets (e.g., satellite data, reanalysis).In this way, users can contribute to the maintenance of the repository through payment for data transfer to their local machines or provision of VMs in the cloud to perform research using the data.Data transfers might be faster where mirror copies are established in geographical regions.In such a scenario, the budget allocation could shift partly from data providers to project funding, because budgeting for projects could provide a more realistic idea of the cost of using the data.A real example using a commercial vendor (although without any associated fee for the user) is the availability of the recent ERA5 Reanalysis [48] produced by the European Centre for Medium-Range Weather Forecasts (ECMWF) from AWS (https://registry.opendata.aws/ecmwf-era5/).This is done through the known as "AWS Public Dataset Program (https://aws.amazon.com/opendata/public-datasets/)".Furthermore, the CESM LENS climate simulations [49] are made available by the National Center for Atmospheric Research (NCAR) from AWS (http://ncar-aws-www.s3-website-us-west-2.amazonaws.com/CESM_LENS_on_AWS.htm).
It has been claimed that the adoption of cloud computing for research purposes could and should increase [50,51].In 2012, the U.S. National Science Foundation (NSF) funded several projects concerned with applying this technology in environmental sciences, which had already been evaluated as early as for the period 2009-2011.
In general terms, a boost in investment in cloud computing might be expected over the next few years to consolidate infrastructure as part of attempts to improve data-sharing services in the scientific community.As an example, EU Horizon2020 planned to devote billion euros to cloud computing [7], and projects like the European Open Science Cloud-Synergy (https://www.eosc-synergy.eu) or the NSF BIGDATA program [52] have been very recently launched.Success stories are the EarthCube program, active since 2013 [53], or JASMIN (http://www.jasmin.ac.uk).

Conclusions
Whilst we have not assessed all the market providers for cloud computing services, we can nevertheless state that to be successful in the field of meteorological and climate research, cloud computing should deal effectively with some of the major concerns for any new technology: cost, improvement of daily work and the generation of new opportunities.The costs of cloud services continue to be high, mainly associated with permanent data storage and transfer, but also with computing.A more affordable option could be a private cloud solution [54].However, cloud computing provides flexibility and is a sensible option when considering responsiveness.With cloud computing services, it is possible to perform tasks very quickly, making research results relevant and timely.For example, the combination of cloud computing with BOINC [17] has the potential to "democratize" access to computing resources by researchers or institutions that do not have the capacity to host and maintain an in-house HPC facility.
However, it should be borne in mind that beyond any monetary arrangements made by institutions or organizations (as key accounts), the low cost of cloud computing services could be affected by the existence of market challengers.Market challengers have a loss related to the price they offer, intending to gain market share.Therefore, any migration of infrastructure to a cloud service should be undertaken with caution, taking into account that prices could increase in the future to reflect actual costs.In order to considerate all the variables, several methodologies to assess the "return on investment" of migrations to cloud computing services have been proposed and are available (e.g., [6,55,56]).Furthermore, each provider uses a different billing scheme [57].
All this is also the reason why the analysis performed here was not comprehensive from the point of view of performing every single simulation across all the cloud computing platforms used.That is, this was a feasibility and options study.A complete comparison would not make sense because the best solution for each case depends on the model used, its code, the infrastructure offered by a vendor at a given time and the price available.Therefore, an apples to apples comparison would not be possible, and consequently, it would not be more informative than the experiments exposed here.
Moreover, users need to evaluate issues related to security when deciding whether a cloud computing environment is the best solution for them, or when considering which approach to cloud computing best fits their needs.Methodologies to evaluate risks associated with the use of cloud computing have already been proposed (e.g., [58]).The perception persists that the security of in-house computing is better than it is for cloud services [59].However, sometimes, such perceptions are wrong.Commercial providers usually have certifications such as ISO/IEC 27001 (https://www.iso.org/standard/54534.html)that are rarely obtained by in-house HPC facilities; for example, the European Union Agency for Network and Information Security is currently developing Cloud Certification Schemes related to security.Furthermore, where necessary, it is generally possible to deploy mixed environments with both private and public cloud services as an intermediate solution [60].It is usually the case that the providers of cloud services care about physical security and the issues related to infrastructure.However, issues related to data transfer, applications, etc., are the responsibility of the customer [61].
Another issue is the reproducibility of research.Scientists are working hard to increase the level of reproducibility of published research.Because some computing applications are now inherent to this process, how we make use of them is key to assuring reproducibility.Related to the previous section, the externalization of computational resources could lead to some scepticism about the reproducibility of the results.However, this should not be a problem if providers of cloud computing are audited and receive certification regarding how the computational resources provided comply with reproducibility practices, such as the use of free software [62][63][64].Indeed, applied in the right way, cloud computing could be seen as an opportunity to improve trust in research results.
Finally, although the ideas and results expressed here might appear to encourage the adoption of cloud computing, it has been pointed out that at least in the industry, the benefits of such adoption are usually below expectations [65].Therefore, we suggest that approaches to cloud computing for HPC and its use in geoscientific modelling must be carefully evaluated.

•
Step 4: Configure Instance Details and select Request Spot Instances, selecting the maximum price to pay.

•
Step 5: In the instance details, in advance, we added the script that installs, initializes, and runs the BOINC client automatically in the instance boot time [31].The content of the script is: Step 6: Add the necessary storage, 64 GB.

•
Step 7: Give a name to the instance (for better identification).

•
Step 8: Select a security group (in this case, by default, having port 22 open is enough).

•
Step 9: Review parameters and Launch.
Please note that shared storage setup is not described here, and S3 buckets (and directories) need to exist before running this script.

Appendix A.2. Microsoft Azure
Azure is the name given to the collection of Microsoft's cloud services, which includes Virtual Machines (VMs, for computing) and shared storage; the former is where the CPDN simulations are run, and the latter will save reports and data from the experiments.
Command-line tooling and Linux integration are nit as mature as in AWS (for instance, when our experiments were performed, the shared filesystem was done over SMB; VM cloning was not a simple and atomic operation; or metadata access within VMs was limited), but it is undergoing continuous improvement.
All these steps are done on the Azure Portal (https://portal.azure.com),under Virtual Machines (Add):

•
Step 3: Give a name, user name and password (used for SSH access).

•
Step 4: Select VM type/size.Please note that Azure shared storage setup is not described here, and endpoints (and directories) need to exist before running this script.
REGION=us−c e n t r a l 1 −a # 1 .V e r i f i e s t h a t Google ' s u t i l i t i e s a r e i n s t a l l e d .I f not , t h e program e x i t s .command −v g c u t i l >/dev/ n u l l 2>&1|| { chho >&2 ' ' g c u t i l needs \ t o be i n s t a l l e d but i t couldn ' t be found .Aborting .' ' ; e x i t 1 ; } # 2 .S e t s t h e p r o j e c t name .p r o j e c t I D = ' gcloud c o n f i g l i s t | grep p r o j e c t | awk ' { p r i n t $3 }# 3 .S e t s t h e number o f nodes .numNodes=$ {NUMNODES} # 4 .S e t s machine type and image .machTYPE=$ {INSTANCETYPE} imageID= h t t p s ://www.g o o g l e a p i s .com/compute/v1/ p r o j e c t s /debian−\ cloud/ g l o b a l /images/debian−7−wheezy−v20140807 # 5 .Adds nodes t o t h e c l u s t e r and wait u n t i l they a r e running .nodes=$ ( e v a l echo machine { 0 . .$ ( ( $numNodes − 1 ) ) } ) g c u t i l a d d i n s t a n c e −−image=$imageID −−machine_type=$machTYPE\ −−zone=$ {REGION} −−w a i t _ u n t i l _ r u n n i n g $nodes # 6 .Uploads t h e f i l e i n s t a l l .sh t o t h e s l a v e nodes .f o r i i n $ ( seq 1 $ ( ( $numNodes − 1 ) ) ) ; do g c u t i l push machine$i i n s t a l l .sh .done # 7 .E x e c u t e s p re vi o us s c r i p t i n each node and checks i f # t h e c o n f i g u r a t i o n ended s u c c e s s f u l l y i n every machine .f o r i i n $ ( seq 1 $ ( ( $numNodes − 1 ) ) ) ; do g c u t i l ssh machine$i "/ bin/bash ./ i n s t a l l .sh machine$i >&\ i n s t a l l .l o g .machine$i " & done f o r i i n $ ( seq 1 $ ( ( $numNodes − 1 ) ) ) ; do g c u t i l ssh machine$i " grep DONE i n s t a l l .l o g .machine$i " done # 8 .F i n a l l y , c o n f i g u r e s ssh keys t o allow t h e c o n n e c t i o n from # t h e master node without password .clave_pub = ' g c u t i l ssh machine0 ' ' sudo c a t ~/.s s / i d _ r s a .pub ' ' ' f o r i i n $ ( seq 1 $ ( ( $numNodes − 1 ) ) ) ; do echo ' ' $clave_pub ' ' | g c u t i l ssh machine$i ' ' c a t >> \ ~/.ssh/authorized_keys ' ' done c a t << EOF > c o n f i g Host * S t r i c t H o s t K e y C h e c k i n g no UserKnownHostsFile=/dev/ n u l l EOF c a t c o n f i g | g c u t i l ssh machine0 " c a t >> ~/.ssh/ c o n f i g " rm c o n f i g

Figure 1 .
Figure 1.CPDN simulation in Azure and AWS.Orange bars highlight the more similar VMs between vendors.

Figure 2 .
Figure 2. Performance and price for WACCMruns in Google Compute Engine (GCE) versus FinisTerraeII.

Table 1 .
Microsoft Azure Linux virtual machines' technical specifications.

Table 2 .
Amazon Web Services instances' technical specifications.

Table 3 .
Summary of the platforms pros and cons.