An Empirical Performance Evaluation of Multiple Intel Optane Solid-State Drives

: Cloud computing as a service-on-demand architecture has grown in importance over the last few years. The storage subsystem in cloud computing has undergone enormous innovation to provide high-quality cloud services. Emerging Non-Volatile Memory Express (NVMe) technology has attracted considerable attention in cloud computing by delivering high I/O performance in latency and bandwidth. Speciﬁcally, multiple NVMe solid-state drives (SSDs) can provide higher performance, fault tolerance, and storage capacity in the cloud computing environment. In this paper, we performed an empirical evaluation study of performance on recent NVMe SSDs (i.e., Intel Optane SSDs) with different redundant array of independent disks (RAID) environments. We analyzed multiple NVMe SSDs with RAID in terms of different performance metrics via synthesis and database benchmarks. We anticipate that our experimental results and performance analysis will have implications for various storage systems. Experimental results showed that the software stack overhead reduced the performance by up to 75%, 52%, 76%, 91%, and 92% in RAID 0, 1, 10, 5, and 6, respectively, compared with theoretical and expected performance.


Introduction
Cloud computing is widely used since it provides flexibility to users and increases system utilization [1][2][3]. Clouds, or clusters of distributed computers, provide on-demand resources and services over a network [4]. Cloud computing systems provide various, high-performance, and large-scale clustered storage devices to handle a large amount of data [5]. In addition, emerging Non-Volatile Memory Express (NVMe) technology has garnered considerable attention in cloud and enterprise storage subsystems to deliver higher I/O performance in terms of latency and bandwidth [6,7]. With this development, industries have begun to adopt NVMe SSDs in various places. For example, because of its performance benefits compared with SSDs that use traditional interfaces (e.g., SATA, SAS), cloud platforms (e.g., Cloud Platform [8], Amazon Web Service (AWS) [9]) provide NVMe options for their storage solutions. Table 1 shows a simple benchmark result with Flash-based SATA SSDs, Flash-based NVMe SSDs, and Optane SSDs. We used a Flashbased SATA SSD (Micron CT250MX500SSD1), Flash-based NVMe SSD (Intel P3700), and Optane SSD (Intel Optane 900P). As shown in the table, the Optane SSD outperformed the Flash-based SATA SSD by up to 10 times. The read performance of the Flash-based NVMe SSD was close to that of the Optane SSD; however, the write performance of the Optane SSD outperformed the performance of the Flash-based NVMe SSD. The Optane SSD supports consistent performance on both a buffered and direct I/O path, sequential and random patterns, and even read and write operations. Moreover, the Optane SSD performs in-place updates in 3D XPoint memory that do not incur GC overheads. In this article, we evaluated and analyzed the performance of Optane SSDs developed by Intel, focusing on the environment of RAID via synthesis and database benchmarks. The Optane SSD has a capacity of 480 GB and a performance of up to 2.5 GB/s for both read and write operations in the case of 512 KB with multiple threads. We compared the performance of Optane SSDs with various RAID schemes, which have different characteristics. Moreover, we analyzed the results of the synthesis and database benchmarks. By doing so, we identified software bottlenecks that were exposed by the rapid storage growth. (This article is an extended version of our paper published in the International Conference on Information Networking (ICOIN '21) in January 2021 [10].) The contributions of our work are the following: (1) the results of our performance study on a real Optane SSDs array, (2) the performance analysis of the Optane SSD array according to workload behavior to show the feasibility and benefits of the Optane SSD array, and (3) identifying bottlenecks in the storage stack. The rest of this article is organized as follows: Section 2 discusses the background and related work. Section 3 explains the experimental setup. Section 4 evaluates and analyzes the performance using a synthesis benchmark and database benchmark. Section 5 provides the summary and implications of the research. Finally, Section 6 concludes this article.

Performance Evaluation of Fast Storage Devices
There have been several research efforts to optimize and evaluate the performance of non-volatile memory storage devices. Some works have optimized and evaluated the file systems and memory management for fast storage devices [6,[11][12][13]. Wu et al. [14] analyzed a popular NVM-based block device: the Intel Optane SSD. They formalized the rules that Optane SSD users need to follow. They provided experiments to present the impact when violating each rule and examined the internals of the Optane SSD to provide insights for each rule. The unwritten contract provided implications and pointed to directions for potential research on NVM-based devices. Kim et al. [15] explored the opportunities for PCM technology within enterprise storage systems. They compared the latest PCM SSD prototype to an eMLC Flash SSD to understand the performance characteristics of the PCM SSD as another storage tier, given the right workload mixture. They conducted a modeling study to analyze the feasibility of PCM devices in a tiered storage environment. Xu et al. [7] presented the analysis and characterization of SSDs based on the the Non-Volatile Memory Express standard for storage subsystem. They showed that there was a benefit to be gained from re-architecting the existing I/O stack of the operating system. They also verified the rated, raw performance of the NVMe SSDs using synthetic and DB benchmarks. Son et al. [6] evaluated the NVMe SSD's performance with micro benchmarks and database workloads in a number of different I/O configurations and compared it with the SATA SSD's performance. Bhimani et al. [16] investigated the performance effect of increasing the number of simultaneously operating docker containers that were supported by the docker data volume on a stripped logical volume of multiple SSDs.
Our study was in line with these studies in terms of evaluating fast storage devices. In contrast, we focused on evaluating and analyzing the performance of fast storage devices in different RAID schemes.

Study on Redundant Array of Independent Disks Schemes
There have been many studies on RAID schemes [17][18][19][20]. Menon et al. [17] described their approach that supported compression and cached RAIDS called log-structured arrays (LSA). They gave some performance comparisons of RAID-5 designs that support compression only in the cache (not on disk) versus LSA, which supports compression on disk and in cache. Chen et al. [19] contributed a study of the design issues, the development of mathematical models, and an examination of the performance of different disk array architectures in both small I/O and large I/O environments. Le et al. [20] proposed the analytical model to quantify the reliability dynamics of an SSD RAID array. They compared the reliability dynamics of the traditional RAID-5 schemes and the new Diff-RAID scheme under different error patterns and different array configurations. Our study is similar to these studies in terms of evaluating the RAID environment. In contrast, we focused on evaluating and analyzing the performance of fast storage devices with different RAID schemes.

NVMe SSD
Compared with conventional SATA SSDs, the current generation of peripheral component interconnect express (PCIe) Flash SSDs is immensely popular in modern industrial environments due to their higher bandwidth and faster random access [21]. However, PCIebased SSDs have also used non-standard specification interfaces. For faster adoption and interoperability of PCIe SSDs, industry leaders have defined the NVM Express standard, arguing that standardized drivers reduce verification time. NVMe is a specification for accessing SSDs connected to PCIe buses. It defines optimized register interfaces, command sets, and function sets [6]. These NVM technologies provide significant performance levels for persistent storage. One example is Intel's 3D XPoint memory [22,23] from Intel and Micron, available on the open market under the brand name Optane [24]. The Optane SSD is based on 3D XPoint memory, which is claimed to provide up to 1000 times lower latency than NAND Flash [23]. The influence of this extremely low-latency and high-bandwidth storage on computing is significant.

Redundant Array of Independent Disks
Initially, RAID [18] was designed to use relatively inexpensive disks as a disk array due to the rapid increase of CPU and memory performance. However, these RAID schemes have been popular in cloud computing and big data environments because of their characteristics, providing higher fault tolerance, performance, and other benefits [25][26][27].
Moreover, RAID has three main strategies. First, striping divides the flow of data into blocks of a specified chunk size and writes one on RAID. These strategies improve the performance of RAID. Second, mirroring stores the same copy of the requested data simultaneously on a RAID member. These strategies improve fault tolerance and performance. Finally, parity is calculated for a specific parity function for a block of data. The parity strategy recalculates data via the checksum method to provide RAID fault tolerance. In addition, RAID has various characteristics depending on the level. We chose five of the RAID types that are used in general. The selected RAID types are briefly described as follows: • Raid 0 provides a striping strategy; • Raid 1 provides a mirroring strategy; • Raid 5 provides striping with parity; • Raid 6 provides striping with double-parity; • Raid 10 provides a stripe of mirrors (a combination of RAID 1 and 0).

Setup
For the experimental setup, we used servers equipped with an Intel(R) Xeon(R) Gold 6242 (2.8 GHz) CPU, which has 64 physical cores and 64 GB of memory. For the storage devices, we used eight Intel Optane SSD 900P (480 GB) with RAID-0, 1, 5, 6, and 10 [18]. Our server had eight PCIe sockets, so we could use up to eight Optane SSDs. We used Linux Kernel 5.4.1, Ubuntu 20.04 LTS, and EXT4 [28] for the filesystem. We evaluated the performance of Optane SSDs in different RAID environments via synthesis and database benchmarks. We used the FIO benchmark [29] for the synthesis benchmark. For the database benchmark, we used DBbench with RocksDB v.6.1.18 and TPC-C [30] with MySQL 5.7 InnoDB [31] to evaluate the Intel Optane SSDs in terms of bandwidth and latency. Note that all experimental results were the average of five runs. Table 1 shows the baseline performance of a single Intel Optane SSD via the FIO benchmark. The configuration of the FIO benchmark is explained in Section 4.1. As shown in Table 1, no significant difference existed in performance across the direct I/O, buffered I/O, read operation, and write operation. Note that the maximum throughput of the Intel Optane SSD 900P was 2.5 GB/s for sequential read and 2.4 GB/s for sequential write in the case of direct I/O, respectively. The maximum throughput of buffered I/O was 2.5 GB/s for sequential read and 2.4 GB/s for sequential write.

Synthesis Benchmark
As mentioned earlier, we used the FIO benchmarking tool as the synthesis benchmark. We executed 64 threads that wrote 4 GB files each with a 512 KB block size, the default chunk size for RAID using the FIO benchmark. We evaluated both buffered and direct I/O to analyze different paths of I/O. Figure 1 shows the throughput and the latency for different numbers of devices in a RAID 0-level environment for different options. Theoretically, when constructing RAID 0 with N devices, the capacity of RAID 0 is (N × one device capacity), the expected write performance is (N × write per f ormance o f one device), and the expected read performance is (N × read per f ormance o f one device). Figure 1a shows experimental results with the buffered read operation. In this configuration, no significant difference existed in the read performance between random and sequential access patterns. Moreover, the read throughput increased and the latency decreased as the number of devices increased. As shown in the figure, the read performance of RAID 0 showed up to 4.8 GB/s, 7.2 GB/s, 8.1 GB/s, and 9 GB/s in the case of 2, 4, 6, and 8 devices, respectively. This result was because the RAID 0 level uses a striping strategy, as mentioned, resulting in improved performance as the number of devices increases. However, the read throughput was still lower than the expected read performance of RAID 0. Figure 1b shows experiment results with the buffered write operation. As shown in the figure, the write performance according to the number of devices was similar. Even if the number of devices increased, no significant difference existed in the write performance. There was also no significant performance difference between sequential and random writes. Figure 1c shows experimental results with the direct read operation. As shown in the figure, the performance increased as the number of devices increased. The read performance showed up to 5.1 GB/s, 9.9 GB/s, 12.4 GB/s, and 16.5 GB/s in the case of 2, 4, 6, and 8 devices, respectively. Figure 1d shows experimental results with the direct write operation. As shown in the figure, the write performance also increased as the number of devices increased. The write performance improved from 4.5 GB/s for two devices to 16.9 GB/s for eight devices. Since the direct I/O bypasses the Linux kernel page caching layer, it can reduce the influence of the OS. With less influence of the OS, direct I/O performance outperformed buffered I/O performance. Moreover, the direct I/O performance was close to the expected performance of RAID 0.  Figure 2 show the throughput and the latency for different numbers of devices in a RAID 1-level environment with different options. Theoretically, when constructing RAID 1 with N devices, the capacity of RAID 1 is (1 × one device capacity), the expected write performance is (1 × write per f ormance o f one device), and the expected read performance is (N × read per f ormance o f one device). Figure 2a shows experimental results with the buffered read operation. In this scheme, the performance of sequential reads was similar to that of random reads. It also shows that the read throughput increased and the latency decreased as the number of devices increased. As shown in the figure, the read performance of RAID 1 showed up to 4 GB/s, 6.8 GB/s, 7.1 GB/s, and 7.7 GB/s in the case of 2, 4, 6, and 8 devices, respectively. This outcome was because the RAID 1 level used a mirroring strategy as mentioned above, resulting in improved read performance as the number of devices increased. Figure 2b shows experimental results with the buffered write operation. The write performance of two devices was lower than that of a single device (1.7 GB/s). Furthermore, the performance did not change even if the number of devices increased (1.5 GB/s for eight devices). As the write operations were conducted using a mirroring strategy, the data were written to one device, and the copies of the data were written to all other devices. Even though this mirroring strategy reduced the write performance, the strategy increased the fault tolerance and read performance. Figure 2c shows experimental results with the direct read operation. As shown in the figure, the read performance increased as the number of devices increased. The read performance showed up to 5.1 GB/s, 9.8 GB/s, 13.3 GB/s, and 15.2 GB/s in the case of 2, 4, 6, and 8 devices, respectively. The read performance of direct I/O was significantly higher than that of the buffered read. However, the read performance of direct I/O did not reach the expected read performance of RAID 1. Figure 2d shows experimental results with the direct write operation. As Figure 2b, the write performance did not increase as the number of devices increased. The write performance showed 2.3 GB/s for two devices to 1.9 GB/s for eight devices. Since the copies of the data were written to all other devices, the write performance decreased slightly as the number of devices increased. In the case of the write performance, we could not observe the upper-bound of the performance. Since the mirroring strategy used one data device and other devices to mirror devices, RAID 1 could not reach the upper-bound of the buffered I/O write performance (4.2 to 4.6 GB/s in Figure 1).   Figure 3 shows the throughput and the latency for different numbers of devices in a RAID 10-level environment with different options. Theoretically, when constructing RAID 10 with N devices, the capacity of RAID 10 is (N × one device capacity ÷ 2), the expected write performance is (N × write per f ormance o f one device ÷ 2), and the expected read performance is (N × read per f ormance o f one device). Note that RAID 10 requires a combination of RAID 0 and 1; therefore, RAID 10 included at least four devices. Thus, we configured the number of devices as 4, 6, and 8 devices. Figure 3a shows experimental results with the buffered read operation. As shown in the figure, the read performance showed up to 7 GB/s, 8.3 GB/s, and 8.3 GB/s in the case of 4, 6, and 8 devices, respectively. Contrary to expectations, the read performance did not scale enough as the number of devices increased. The results were similar to RAID 1's buffered read result. Figure 3b shows experimental results with the buffered write operation. From the experimental results, the write performance was affected by the redundant write operations of the RAID 1 scheme. The write performance of RAID 10 was very similar to that of RAID 1 in terms of throughput and latency. The write performance did not increase as the number of devices increased. Figure 3c shows the experimental results with the direct read operation. As shown in the figure, the read performance showed up to 9.0 GB/s, 10.9 GB/s, and 13.9 GB/s in the case of 4, 6, and 8 devices, respectively. The read performance of direct I/O was higher than that of the buffered read performance. Since the RAID 10 scheme was nested and consisted of RAID 0 and 1, the read performance showed results similar to RAID 1 and RAID 0. The experimental results revealed that the read performance was affected by the overhead from the page cache (as the results of the RAID 0 and 1 schemes). Figure 3d shows the experimental results with the direct write operation. As shown in the figure, the write performance showed up to 4.4 GB/s, 6.4 GB/s, and 7.9 GB/s in the case of 4, 6, and 8 devices, respectively. Compared to Figure 3b, we could observe differences in the performance between buffered write and direct write. The write performance of buffered I/O showed results similar to RAID 1; however, the write performance of direct I/O showed results similar to RAID 0.   Figure 4 shows the throughput and the latency for different numbers of devices in a RAID 5-level environment with different options. Theoretically, when constructing RAID 5 with N devices, the capacity of RAID 5 is ((N − 1) × one device capacity), the expected write performance is ((N − 1) × write per f ormance o f one device ÷ 4), and the expected read performance is ((N − 1) × read per f ormance o f one device). Figure 4a shows experimental results with the buffered read operation. In this scheme, the performance of the random read was higher than the performance of sequential reads. Moreover, the read throughput increased and the latency decreased as the number of devices increased. As shown in the figure, the read performance of RAID 5 showed up to 6.1 GB/s, 7 GB/s, 8 GB/s, and 8.7 GB/s in the case of 3, 4, 6, and 8 devices, respectively. Note that RAID 5 should include at least three devices. Thus, we configured the number of devices as 3, 4, 6, and 8 devices. Figure 4b shows experimental results with the buffered write operation. The write performance for all the cases in RAID 5 was significantly lower than that of a single device. The write performance of RAID 5 showed up to 499 MB/s, 611 MB/s, 750 MB/s, and 672 MB/s in the case of three, 4, 6, and 8 devices, respectively. This was because the write operation was performed with block-interleaved distributed parity. This scheme needs to perform the parity operations to increase fault tolerance for every write operation, and the computational results are also stored on other devices. This result demonstrated that these operations can significantly reduce the performance. Figure 4c shows the experimental results with the direct read operation. Similar to the previous experimental results (RAID 0, 1, and 10), the read performance increased as the number of devices increased. The difference in the read performance between buffered I/O and direct I/O was also similar to previous experiments. Figure 4d shows the experimental results with the direct write operation. The write performance showed up to 329 MB/s, 321 MB/s, 342 MB/s, and 384 MB/s in the case of 3, 4, 6, and 8 devices, respectively. For the write performance, the direct write performance was lower than that for the buffered write. Since the RAID 5 write operation was performed with block-interleaved distributed parity, RAID 5 should calculate the parity when processing the write operation. Buffered I/O uses the page cache, which can accelerate the parity calculation by the page hit. Thus, for the parity strategy, the buffered write can outperform the direct write.
With the results of Figures 1-3, the performance of direct I/O was significantly higher than that of buffered I/O even if, in the general cases, the buffered I/O performance was higher than the direct I/O [32]. According to our observation, a page caching issue occurred within the Linux kernel. Therefore, as the issue did not depend on the file system, when direct I/O was used, page cache overhead could be avoided since direct I/O bypassed the page cache. Figure 5 shows the throughput and the latency for different numbers of devices in a RAID 6-level environment with different options. Theoretically, when constructing RAID 6 with N devices, the capacity of RAID 6 is ((N − 2) × one device capacity), the expected write performance is ((N − 2) × write per f ormance o f one device ÷ 6), and the expected read performance is ((N − 2) × read per f ormance o f one device). Note that RAID 6 should include at least four devices. Thus, we configured the number of devices as 4, 6, and 8 devices. Figure 5a shows experimental results with the buffered read operation. As shown in the figure, the read performance of RAID 6 showed up to 7.1 GB/s, 7.8 GB/s, and 8.5 GB/s in the case of 4, 6, and 8 devices, respectively. Figure 5b shows experimental results with the buffered write operation. The write performance for all cases in RAID 6 was significantly lower than that for the single device write performance similar to RAID 5. The write performance of RAID 6 showed up to 493 MB/s, 646 MB/s, and 570 MB/s in the case of 4, 6, and 8 devices, respectively, because the write operation was performed by striping with double-parity. Each write operation in this scheme must read the data, read the first parity, read the second parity, write the data, write the first parity, and then, finally, write the second parity.   , and 8 devices, respectively. The difference in performance was similar to the experimental results for RAID 5. As mentioned above, the direct write performance was lower than the buffered write performance due to page caching.

Database Benchmark
As mentioned earlier, we used DBbench and TPC-C [30] to evaluate the Intel Optane SSDs. DBbench is a popular benchmark provided by RocksDB to evaluate the KV stores, and it provides various I/O operations. We evaluated four operations fill random (Fill-Rand), fill sequential (FillSeq), read random (ReadRand), and read sequential (ReadSeq). We configured the maximum read/write buffer number at sixty-four and the batch size at eight and used one-hundred million read and write operations. We also used direct I/O for the flush, compaction, and write and read operations. Figures 6 and 7 show experimental results under various RAID schemes.  Figure 6 shows the DBbench results on the Intel Optane SSDs under different RAID schemes such as RAID 0, 1, and 10. Figure 6a shows the performance results for RAID 0. Each operation showed 1.3 MOPS, 1.8 MOPS, 1.1 MOPS, and 60.1 MOPS and 46.3 ms, 35.2 ms, 52.7 ms, 1.06 ms for fill random, fill sequential, read random, and read sequential, respectively. Figure 6b shows the performance results for RAID 1. The MOPS for each operation were 0.7 MOPS, 1.74 MOPS, 1.1 MOPS, and 60.8 MOPS, and the microseconds per operation of each operation were 83.9 ms, 36.5 ms, 56.6 ms, and 1.05 ms for fill random, fill sequential, read random, and read sequential, respectively. Figure 6c shows the performance results for RAID 10. The MOPS of each operation were 0.9 MOPS, 1.8 MOPS, 1.1 MOPS, and 52.7 MOPS, and the microseconds per operation of each operation were 69.3 ms, 34.8 ms, 54.5 ms, and 1.2 ms for fill random, fill sequential, read random, and read sequential, respectively. As we expected, similar strategies showed similar results. Thus, RAID 0 was fastest in most operations. Since there a transaction overhead existed in the database application, the performance of DBbench could not reach the full performance of the storage array (Figures 1-3). Figure 7 shows the DBbench results on the Intel Optane SSDs under different RAID schemes. Figure 7a shows the performance results for RAID 5. The MOPS of each operation were 0.49 MOPS, 1.7 MOPS, 1.1 MOPS, and 57.8 MOPS, and the microseconds per operation of each operation were 129.5 ms, 36.3 ms, 54.2 ms, and 1.1 ms for fill random, fill sequential, read random, and read sequential, respectively. Figure 7b shows the performance results for RAID 6. The MOPS of each operation were 0.9 MOPS, 1.8 MOPS, 1.1 MOPS, and 59.6 MOPS, and the microseconds per operation of each operation were 159 ms, 36.8 ms, 55.5 ms, and 1.07 ms for fill random, fill sequential, read random, and read sequential, respectively. Comparing Figure 6 with Figure 7, no significant difference in performance existed among various RAID schemes, and almost all performance except for the sequential read workload was in the range of 84.4 to 214.8MB/s. The performance overhead from the database concealed the characteristics of various RAID schemes that were observed in the previous experiments (Figures 1-5). The TPC-C benchmark is a mixture of read and write (1.9:1) transactions [30] that simulates online transaction processing (i.e., OLTP) application environments. We configured the user buffer size to 1 GB, page size to 16KB, flushing method to direct I/O (O_DIRECT), ramp-up time to 180 s, and execution time to 10 min. Figure 8 shows the TPC-C results on the Intel Optane SSDs under different RAID environments. The performance of each scheme was 143,027, 94,375, 42,183, 34,581, and 112,536 tmpC for RAID 0, 1, 5, 6, and 10, respectively. In the case of RAID 0, tmpC was the highest because all data were only striped. The second-highest performance was RAID 10, which was due to the advantages of striping (RAID 0) and the increased read performance through mirroring (RAID 1). For RAID 1, the results showed high performance due to the increased read performance, although the write performance was very low because the TPC-C benchmark for OLTP workloads had a larger ratio of reads than writes (the readto-write ratio was 1.9:1). As mentioned earlier, RAID 5 exhibited lower performance than other RAID schemes due to the overhead from parity operations. Moreover, RAID 6 also had lower performance than RAID 5 due to the overhead of the double-parity operations.

Performance Comparison under Multiple OSes and SSDs
We performed the evaluation and compared the results on multiple OSes (i.e., Ubuntu 20.04 and CentOS 7) and multiple SSDs (i.e., Intel Optane SSDs (900P) and Intel Flashbased NVMe SSDs (P3700)). Figure 9 shows experimental results on multiple OSes and SSDs in the case of RAID 0. Note that the single-device performance is depicted in Table 1. Figure 9a shows experimental results of RAID 0 under the Intel Optane SSD (900P) and the OSes. As shown in the figure, there was almost no significant performance difference between Ubuntu and CentOS since both CentOS and Ubuntu have a similar I/O path (e.g., page cache layer). Figure 9b shows the experimental results of RAID 0 under the Intel Flash-based NVMe SSD (P3700) and the OSes. As shown in the figure, there was still no significant performance difference between Ubuntu and CentOS. In the case of the performance comparison between the two SSDs, the Flash-based NVMe SSDs had better performance by up to 22.1% compared to the Optane SSDs in the case of read operations. This showed that the Flash-based NVMe SSDs can outperform the Optane SSDs in terms of the read operations and the RAID configuration. Meanwhile, the Optane SSD had better performance by up to 18.4% compared with the Flash-based NVMe SSD in the case of the write operations. As mentioned earlier, the 3D Xpoint technology of Optane SSD does not incur GC overhead, so that it can provide a higher and consistent write performance. Though the experimental results, we can recommend the Optane SSD (900P) for write-intensive workloads, and we also can recommend the Flash-based NVNe SSD (P3700) for read-intensive workloads.

Comparison with Related Works
In this section, we provide a comparison of our results and those of related works in the case of a single device, as shown in Table 2. All the works used the same Intel Optane SSD (900P), but they used different configurations. Each experiment configuration was as follows: Zhang et al. [33]    For the FIO configuration, Zhang et al. [33] ran FIO during 30s and stored the fixedsize data (i.e., 20 GB) in the SSD before the performance evaluation. Yang et al. [34] ran FIO with a 4K block size, four threads, and a 32 iodepth. Our experimental setup was described in Section 3. As shown in the table, there was no significant difference except for the latency. We assumed that the larger number of iodepths in Yang et al. [34] could increase the latency compared with other works. Since the number of cores in our setup was larger than those of related works, we could assume that our evaluation results provided slightly higher throughput and lower latency compared with others due to the higher parallelism.

Summary and Implication
In this section, we summarize the implications of our evaluation study performed in this article. Our main findings and insights were as follows: • Through the performance baseline of the Intel Optane SSD in the Section 3, we showed no difference in performance between direct I/O and buffered I/O in a singledevice environment. However, when compared in a RAID environment, through Figures 1-5, there was a difference in performance. This showed that the storage stack had an overhead; • As shown in Figure 3, nested RAID (RAID 10) showed that the performance was similar to RAID 1 ( Figure 2). This means that there were certain bottlenecks in the software RAID environment; • As shown in Figures 4 and 5, the parity operations caused serious overhead to the write performance in RAID. This was because the parity operation should read the data, read the parity, write the data, and, finally, write the parity.

Conclusions and Future Work
In this paper, we evaluated and analyzed the Optane SSDs' performance via synthesis and database benchmarks in different RAID schemes. First, we presented the results of our performance study on a real Optane SSD array. Second, we analyzed the performance of the Optane SSD array according to the workload behavior to examine the feasibility and benefits of the Optane SSD array. Finally, we identify bottlenecks in the storage stack. Our experimental results showed that the software stack overhead reduced the performance by up to 75%, 52%, 76%, 91%, and 92% for RAID 0, 1, 10, 5, and 6, respectively, compared to the expected performance. This result showed the bottleneck of the OS software stack, which came from the result of the significant performance improvement of the storage devices. In the future, we will plan to analyze and optimize the storage stack to improve the performance in different RAID schemes.