1. Introduction
With the increased prevalence of cloud computing and the increased use of the Infrastructure as a Service (IaaS), various APIs are provided to access storage. The Amazon Simple Storage Service (S3) API [
1] is the most widely adopted object storage in the cloud. Many Cloud storage providers, like IBM, Google, and Wasabi, offer S3 compatible storage, and a large number of Scale-Out-File Systems like Ceph [
2], OpenStack Swift [
3] and Minio [
4] offer a REST gateway, largely compatible with the S3 interface. HPC applications often use a higher-level I/O library such as NetCDF [
5] or ADIOS [
6] or still the low-level POSIX API. Under the hood, for the interaction with the storage system, MPI-IO and POSIX are still widely used, while object storage APIs such as DAOS [
7] are emerging. If the performance characteristics of S3 are promising, it could be used as an alternative backend for HPC applications. This interoperability would foster convergence between HPC and Cloud [
8,
9] and eventually lead to consistent data access and exchange between HPC applications across data centers and the cloud.
In this research, a methodology and a set of tools to analyze the performance of the S3 API are provided. The contributions of this article are: (
a) The modification of existing HPC benchmarks to quantify the performance of the S3 API by displaying relevant performance characteristics and providing a deep analysis of the latency of individual operations; (
b) the creation of a high-performance I/O library called S3Embedded [
10], which can be used as a drop-in replacement of the commonly used libs3 [
11], and which is also compatible and competitive with the HPC I/O protocols, and optimized for use in distributed environments.
The structure of this paper is as follows:
Section 1.1 presents related work.
Section 2 describes the test scenarios and defines the relevant metrics that will be addressed using our benchmarks. We extend the scope of two existing HPC MPI-parallel benchmarks, namely the IO500 [
12] and MDWorkbench [
13], to assess the performance of the S3 API.
Section 3 describes the experimental procedure, the used systems, and the methodology of the evaluation conducted in this work. The tests are performed on the Mistral [
14] Supercomputer.
Section 3.2.3 analyzes the obtained latency results.
Section 4 introduces the S3Embeded library and its possible use. The last section summarizes our findings.
1.1. Background and Related Work
After the release of the Amazon S3 service in 2006, many works were published to assess the performance of this offering. Some of them [
15,
16] focused only on the download performance of Amazon S3, most of them [
15,
16,
17,
18,
19] never published or described the used benchmarks, others [
15,
16,
20] are not able to assess S3 compatible storage. The Perfkit benchmarker from Google was used in [
21] to compare the download performance of AWS S3 in comparison with Google Cloud Storage (GCS) and Microsoft Azure Storage. However, since the tests were accomplished in the cloud, the obtained results depend heavily on the VM machine type in use, hence, on the network limitation enforced by the cloud provider; adding to the confusion, and in contrast, [
19] found in their tests, which were also run in the AWS cloud on different EC2 machines against the Amazon S3 implementation, that “there is not much difference between the maximum read/write throughput across instances”.
In [
22], the performance of the S3 interface offered by some cloud providers is evaluated. However, the test scenario covered only the upload and download of a variable number of files. Access patterns of typical HPC applications, like the performance of the metadata handling, are not covered.
Gadban et al. [
23] investigated the overhead of the REST protocol when using Cloud services for HPC Storage and found that REST can be a viable, performant, and resource-efficient solution for accessing large files, but for small files, a lack of performance was noticed. However, the authors did not investigate the performance of a cloud storage system like S3 for HPC workloads.
The benchmarks we found in the literature for the analysis of S3 performance do not provide a detailed latency profile, nor do they support parallel operations across nodes, which is a key characteristic of HPC applications. As such, the lack of published tools that cover HPC workloads pushed us to enhance two benchmarks already used for HPC procurement, namely IOR and MD-Workbench, by developing a module capable of assessing the performance of S3 compatible storage. The IO500 benchmark (
https://io500.org (accessed on 23 April 2021) simulates a variety of typical HPC workloads, including the bulk creation of output files from a parallel application, intensive I/O operations on a single file, and the post-processing of a subset of files. The IO500 uses the IOR and MDTest benchmarks under the hood, which comes with a legacy backend for S3 using the outdated aws4c [
24] library that stores all data in a single bucket; as such, a file is one object that is assembled during write in the parallel job using multipart messages. However, since most recent S3 implementations do not support this procedure, the use of a most recent library is required. We also explore the performance of interactive operations on files using the MD-Workbench [
13].
2. Materials and Methods
We aim to analyze the performance of the S3 interface of different vendors in an HPC environment to assess the performance potential of the S3 API. To achieve this, a five step procedure is implemented by (1) identifying suitable benchmarks; (2) modifying the benchmark to support S3; (3) defining an HPC environment supporting the S3 API to run them; (4) determining a measurement protocol that allows us to identify the main factors influencing the performance of S3 for HPC workloads; and (5) providing alternative implementations for S3 to estimate the best performance.
2.1. Benchmarks
We extend two HPC benchmarks, the IO500 and MD-Workbench, to analyze the potential peak performance of the S3 API on top of the existing HPC storage infrastructure.
The IO500 [
12] uses IOR and MDTest in “easy” and “hard” setups, and hence performs various workloads and delivers a single score for comparison, and the different access patterns are covered in different phases:
IOEasy simulating applications with well-optimized I/O patterns.
IOHard simulating applications that utilize segmented input to a shared file.
MDEasy simulating metadata access patterns on small objects.
MDHard accessing small files (3901 bytes) in a shared bucket.
We justify the suitability of these phases as follows: B. Welch and G. Noer [
25] found that, inside HPC, between 25% and 90% of all files are 64 Kbytes or less in size, as such a typical study of the performance of object storage inside HPC should also address this range, rather than only focusing on large sizes, which are expected to deliver better performance and only be limited by the network bandwidth [
21,
23]. This is why, using the IOR benchmark, we highlight this range when exploring the performance for files of size up to 128 MB in
Section 3.2 and
Section 4. Large file sizes are also addressed since the performed IO500 benchmarks operate on 2 MiB accesses creating large aggregated file sizes during a 300 s run.
MD-Workbench [
13] simulates concurrent access to typically small objects and reports throughput and latency statistics, including the timing of individual I/O operations. The benchmark executes three phases: pre-creation, benchmark, and cleanup. The pre-creation phase setups the working set, while the cleanup phase removes it. A pre-created environment that is not cleaned can be reused for subsequent benchmarks to speed up regression testing, i.e., constant monitoring of performance on a production system. During the benchmark run, the working set is kept constant: in each iteration, a process produces one new object and then consumes a previously created object in FIFO order.
2.2. Modifications of Benchmarks
For IO500, an optimistic S3 interface backend using the libS3 client library is implemented for IOR in the sense that it stores each fragment as one independent object, and as such, it is expected to generate the best performance for many workloads.
For identifying bottlenecks, it supports two modes:
Single bucket mode: created files and directories result in one empty dummy object (indicating that a file exists), every read/write access happens with exactly one object (file name contains the object name + size/offset tuple); deletion traverses the prefix and removes all the objects with the same prefix recursively.
One bucket per file mode: for each file, a bucket is created. Every read/write access happens with exactly one object (object name contains the filename + size/offset tuple); deletion removes the bucket with all contained objects.
As such, the libs3 implementation gives us the flexibility to test some optimistic performance numbers. The S3 interface does not support the “find” phase of IO500, which we, therefore, exclude from the results.
MD-Workbench recognizes datasets and objects and also offers two modes:
In both modes, objects are atomically accessed fitting directly the S3 API.
The libS3 used in IO500 is the latest one which only supports AWS signatures v4 [
26] while the current release of MD-Workbench supports an older version of libs3, which uses the AWS signature v2. As such, it is ideal for the benchmarking of some S3 compatibles systems that only support the v2 signature, like the one found at DKRZ (The German Climate Computing Center).
2.3. Utilizing S3 APIs in Data Centers
Data centers traditionally utilize parallel file systems such as Lustre [
27] and GPFS. These storage servers do not natively support the S3 API. In a first iteration, we explore MinIO to provide S3 compatibility on top of the existing infrastructure. MinIO offers various modes, one of which is a gateway mode providing a natural deployment mode: in this mode, any a S3 request is converted to POSIX requests for the shared file system. We show the effect of the object size for the different MinIO modes, and we compare the obtained results to the native REST protocol and Lustre. To achieve the convergence between HPC and the Cloud, it needs to be possible to move workloads seamlessly between both worlds. We aim to allow the huge number of S3 compatible applications, the possibility to benefit from HPC performance while using the same compatibility offered by the S3 interface. Therefore, we create an I/O library called S3Embedded based on libs3, where parts of the S3 stack were replaced or removed to optimize the performance, this library also performs the translation between S3 and POSIX inside the application address space. Our aim is to make it compatible with standard services, competitive with the HPC I/O protocols, and optimized for use in distributed environments. Additionally, the extended library S3remote is introduced, it is intended to provide a local gateway on each node—an independent process similar to MinIO—but uses a binary protocol over TCP/IP for local communication instead of HTTP as HTTP might be the cause of the observed performance issues. In
Section 4, we assess the performance of S3Embedded/S3Remote inside HPC.
2.4. Measurement Protocol
We measure the performance on a single node and then on multiple nodes while varying the size of the object and the number of processes/threads per node. To assess the performance of the different modes, we establish performance baselines by measuring performance for the network, REST, and the Lustre file system. Then, the throughput is computed (in terms of MiB/s and Operations/s) and compared to the available network bandwidth of the nodes.
2.5. Test System
The tests are performed on the Supercomputer Mistral [
14], the HPC system for earth-system research provided at the German Climate Computing Center (DKRZ). It provides 3000 compute nodes each equipped with an FDR Infiniband interconnect and a Lustre storage system with 54 PByte capacity distributed across two file systems. The system provides two 10 GBit/s Internet uplinks to the German research network (DFN) that is accessible on a subset of nodes.
2.6. MinIO Benchmarks in HPC
To create a reference number for the performance of S3 and explore the possible ways to optimize performance, we first use the MinIO server (release: 2020-08-18T19-41) to accomplish our tests using the modified benchmarks inside our HPC environment.
MinIO Deployment
MinIO supports the following modes of operation:
Alongside these three modes, we introduce two modes by inserting the Nginx [
28] (v1.18.0) load balancer in front of the distributed and gateway configurations, and we refer to these setups as
srv-lb and
gw-lb, respectively. Both variants can utilize a cache on the Nginx load balancer (
-cache).
4. S3Embedded
Because of the scalability limitation of the introduced local-gw mode, we create an I/O library called S3Embedded based on libs3, where parts of the S3 stack are replaced or removed to optimize the performance. By easily linking the S3Embedded library at compile time or at runtime to a libS3 compatible client application, it is possible to use the capabilities of this library. Assuming the availability of a globally accessible shared file system, S3Embedded provides the following libraries:
libS3e.so: This is an embedded library wrapper that converts libs3 calls to POSIX calls inside the application address space.
libS3r.so: This library converts the libs3 calls via a binary conversion to TCP calls to a local libS3-gw application that then executes these POSIX calls, bypassing the HTTP protocol.
In
Figure 9, we display the results of the
IOR benchmark while using the libraries mentioned above, in comparison with direct Lustre access and MinIO operating in the local-gw mode already described in
Section 3. Note that some values are missing in the MinIO-local-gw results, despite the benchmark being repeated several times. This is because this setup does not scale well with the number of clients, as noted in
Section 3.
Using S3Embedded helped us to pinpoint a performance problem in the IOR S3 plugin: we noticed that the delete process in IO500 is time-consuming since when trying to delete a bucket, our developed IOR S3 plugin tries to list the content of the entire bucket—calling S3_list_bucket()—for each file to be deleted to clean the fragments; however, since, in case of S3Embeded, all files are actually placed in a single directory, this ought to be very time-consuming. One workaround is to use the option bucket-per-file that effectively creates a directory per file. However, since this workaround does not cover all test workloads in the IO500, we proceed and introduce an environment variable called “S3LIB_DELETE_HEURISTICS”, specific to the IOR S3-plugin. It defines at which file size of the initial fragment the list_bucket operation is to be executed; otherwise, a simple S3_delete_object is performed. While this optimization is not suitable for a production environment, it allows us to determine best-case performance for using S3 with the IO500 benchmark.
The results delivered by S3Embedded are very close to the ones obtained for the Lustre direct access—mainly for files larger than 32 MB—far superior to the ones supplied by MinIO-local-gw, they are also free from timeout errors.
Benchmarking with
IO500 reflects the performance improvement delivered by S3Embedded/S3Remote, as shown in
Table 3.
We notice that Lustre’s performance with POSIX is often more than 10x faster than MinIO-local-gw, and that the error rate increases along with the number of Nodes/PPN. In contrast, the S3 API wrappers deliver much better performance, which is closer to Lustre native performance, and are more resilient to the number of clients, as shown in
Table 4.
Even with 10 or 50 Nodes, as seen in
Table 5 and
Table 6, the S3embedded library yields a performance closer to Lustre, but a performance gap remains.
Some MDTest results show a better performance in the case of S3Embedded than Lustre, although for both Lustre and S3Embedded, the stat() call is used. This might be due to the way S3Embedded implements S3_test_bucket(), where the size and rights for the directory and not of the actual file are captured, which seems to be faster.
The radar chart in
Figure 10 shows the relative performance of S3embedded and S3remote in percent for three independent runs of all benchmarks. Note that the three Lustre runs are so similar that they overlap in the figure. The graph clearly shows the performance gaps of the two libraries. For the sake of comparison, all relative performance numbers for the run with s3embr-3 are listed. Only for the ior-hard-write phase, the number is close to 100%, while it often achieves 5–10% of Lustre performance. The embedded library also lacks performance for some benchmarks, but is much better.
5. Conclusions
The S3 API is the de facto standard for accessing Cloud storage; this is why it is the component of choice when building cloud-agnostic applications. By amending IO500 to benchmark the S3 interface, we broaden the scope of the IO500 usage and enable the community to track the performance growth of S3 over the years and analyze changes in the S3 storage landscape, which will encourage the sharing of best practices for performance optimization. Unfortunately, the obtained results in
Section 3 indicate that S3 implementations such as MinIO are not yet ready to serve HPC workloads because of the drastic performance loss and the lack of scalability.
We believe that the remote access to S3 is mainly responsible for the performance loss and should be addressed. We conclude that S3 with any gateway mode is not yet a suitable alternative for HPC deployment as the additional data transfer without RDMA support is pricey. However, as the experimentation in
Section 4 shows, an embedded library could be a viable way to allow existing S3 applications to use HPC storage efficiently. In practice, this can be achieved by linking to an S3 library provided by the data center.
By introducing S3Embedded, a new light-weight drop-in replacement for libs3, we investigate the cause of the performance loss while providing a road toward Cloud-HPC agnostic applications that can be seamlessly run in the public cloud or HPC.
Future Work
In the future, we aim to improve the S3embeded library further and explore the conversion from S3 to non-POSIX calls. We also intend to run large-scale tests against Cloud/Storage vendors on their HPC ecosystems to compare the S3 API performance. Ultimately, our goal is to identify which APIs are needed for HPC applications to gain optimal performance while supporting HPC and Cloud convergence.