UHNVM: A Universal Heterogeneous Cache Design with Non-Volatile Memory

: During the recent decades, non-volatile memory (NVM) has been anticipated to scale up the main memory size, improve the performance of applications, and reduce the speed gap between main memory and storage devices, while supporting persistent storage to cope with power outages. However, to ﬁt NVM, all existing DRAM-based applications have to be rewritten by developers. Therefore, the developer must have a good understanding of targeted application codes, so as to manually distinguish and store data ﬁt for NVM. In order to intelligently facilitate NVM deployment for existing legacy applications, we propose a universal heterogeneous cache hierarchy which is able to automatically select and store the appropriate data of applications for non-volatile memory (UHNVM), without compulsory code understanding. In this article, a program context (PC) technique is proposed in the user space to help UHNVM to classify data. Comparing to the conventional hot or cold ﬁles categories, the PC technique can categorize application data in a ﬁne-grained manner, enabling us to store them either in NVM or SSDs efﬁciently for better performance. Our experimental results using a real Optane dual-inline-memory-module (DIMM) card show that our new heterogeneous architecture reduces elapsed times by about 11% compared to the conventional kernel memory conﬁguration without NVM.


Introduction
Solid state drives (SSDs) are often used as a cache to expand the memory capacity for the data-intensive applications [1][2][3][4][5]. SSDs provide an order of magnitude larger capacity than DRAM, offering a higher throughput than HDDs. For this reason, placing SSDs between DRAM and HDDs is a reasonable choice to boost up application speeds economically. However, the long latency of SSDs compared to DRAM inevitably delivers a performance decrease especially for the applications with a large memory footprint. As a method to overcome such a limitation, NVM is actively considered to enlarge the memory capacity of the computing systems [6][7][8][9]. NVM not only has a higher density than DRAM but also provides a comparable access latency to DRAM; NVM reduces the performance penalty for the memory scalability compared to the SSDs. Recently, Intel released the commercial NVM technology product, called Intel Optane DIMM, the NVM technology is drawing increasing attention in large-scale internet service providers.
However, there is a significant limitation to use NVM in real systems. To take advantage of the NVM, the existing software should be revised to identify the data that would be suitable for NVM and to maintain them in NVM using the proprietary libraries. As an example, Intel offers a set of libraries to maintain data within NVM such as a persistent memory development kit (PMDK) [10], which include the implementation of the various data structures and logging functionalities tailored for the NVM. This lack of compatibility requires a laborious and time consuming development process, significantly hindering the adoption of the NVM in practical systems [7,8].
Motivated by this observation, we present a transparent cache design, which resides between the user application and the kernel layer, and transparently maintains data over the heterogeneous memory. This property eliminates the need for the existing software to be modified to exploit NVM, making the use of NVM much easier. The universal heterogeneous cache hierarchy design for NVM (UHNVM) interposes on the system call of the user-level applications to access the file data. Then, we check whether the requested data resides in NVM, which is a file cache of the disks. If the data are found in NVM, it is directly serviced to the user program; otherwise, the data are retrieved from storage.
To increase the cache hit ratio of UHNVM, we further optimize the caching policy by taking into consideration the physical characteristics of the NVM. Our empirical analysis reveals that NVM (i.e, Intel Optane DIMM) is more sensitive to the workload change than DRAM. Specifically, NVM decreases the performance by 80% for the random workloads compared to that for the sequential workload, while DRAM only has a 20% of the performance gap between two workloads [4]. This implies that the NVM is more suitable to cache the data blocks that are likely to be accessed in a sequential pattern rather than a random one. Based on this observation, UHNVM internally predicts the access patterns for the data blocks by the program context (PC) of the request, which refers to the call stack that finally invokes the file access operations [11,12]. Because the access patterns for the data blocks are highly dependent to their invocation contexts, PC is known to be a good indicator to predict the access patterns to the data blocks [11,12]. By making use of this information, UHNVM classifies the access patterns of the data blocks into four types: looping, cluster-hot, sequential, and random. UHNVM favorably caches the first two categories because they have a high locality and thus are likely to increase the cache hit ratio. For the latter two, UHNVM prefers the data blocks with the sequential patterns to those with the random pattern, so as to take full advantage of the NVM.
The rest of this article is organized as follows. We explain the background of Intel Optane DIMM in Section 2. Section 3 describes the motives and main thoughts of UHNVM. Section 4 introduces the design of UHNVM. The experimental results are shown in Section 5. Finally, we conclude in Section 6 with a summary.

Background
Before the appearance of commercial NVM, researchers emulate NVM by implementing software emulators over DRAM [13][14][15][16]. DRAM has to be managed by software to simulate NVM. This process is somewhat distorted. Some tried to use hardware emulators which include Intel's persistent memory emulator platform that mimics latency and bandwidth of NVM [17,18], plain DRAM designs [19][20][21][22], and battery support DRAM designs [23]. Those hardware emulators are rather complicated to perform NVM features. Despite the numerous efforts, the previous studies conducted without real NVM devices often fail to produce meaningful results, because most of their experiments are based on the emulation mechanisms that do not model actual behaviors and performance of product NVMs accurately [4].
The release of the real NVM product, Intel Optane DIMM, opens a possibility to overcome this limitation, enabling studies under the more realistic environments. Intel Optane DIMM is connected to the memory bus like DRAM, but it provides the persistence of data like storage. However, compared to other persistent storage, Intel Optane DIMM has an order of magnitude lower latency. Intel Optane DIMM can be configured in two operating modes, memory mode and app direct mode, so that it can simultaneously satisfy the various demands on it. Figure 1 briefly shows data access paths in memory mode ( Figure 1a) and app direct mode (Figure 1b).
The memory mode provides an abstract of the extended main memory, using DRAM as a cache of the NVM. This mode does not expose the presence of the NVM to the user applications. This means that this mode requires no modification of current applications but the user applications cannot make use of the NVM as they intend. In contrast, app direct mode supports the in-memory persistence, which opens up a new approach to new Intel Optane DIMM allows more software to be bypassed since the application program can read from or write to an Optane DIMM without using either a driver or the file system (see Figure 1b). Four access paths are for Optane DIMM. There are two standard file accesses through new NVM-aware file system [24] and traditional file system, respectively. The NVM-aware file system is designed to take advantage of the new features of the Optane DIMM, and its performance is superior to that standard file system. Application programs can also communicate directly with the Optane DIMM as if it were standard host DRAM. All these are handled by the direct access channel, which is usually abbreviated to "DAX". For the three access modes, load and store access, raw device access, and standard file access in Figure 1b, their software overheads increase gradually and the load and store access used in our design achieves the minimum overhead.

Motivation
This section investigates the performance benefits obtained by adopting the Optane DIMM as a storage cache in terms of the system architecture Section 3.1 and the performance characteristics of Optane DIMM with respect to the workload patterns indicated by PC technique exhibited in Section 3.4, which can be exploited when maintaining a cache within it.

Memory Architecture with Optane Dimm
Using the fast storage medium (such as SSDs) as a cache storage is a common method to scale up the memory capacity of the computing systems. With the advent of a highdensity memory device such as Optane DIMM, many computing systems consider the use of scalable memory device as a storage cache, as an alternative or supplement to the fast storage. Figure 2 compares the data access flow when traditional SSD device is used as a cache of the storage (Figure 2a) in comparison to that when Optane DIMM is deployed (Figure 2b).   Figure 2a shows the hierarchical memory system where the SSD serves as a cache between DRAM and HDD. In this architecture, the host will experience a performance drop when a page fault occurs. The application should be blocked until the in-kernel page fault handler load the requested pages from SSD (or HDD in the worst case) to DRAM. For the applications with a large memory footprint, the number of page faults highly increases, leading to a serious performance degradation. When the data accesses are random and in small size, the overhead of the page fault further aggravates because the SSDs performs I/Os in a block size granularity (i.e., 4 KB) [3,[25][26][27]. Figure 2b shows the parallel memory architecture where Optane DIMM is deployed as a cache of the storage. As opposed to the SSD-based cache architecture, Optane DIMM has no data load operation because it allows the direct access of the applications to it. This advantage enables to not only avoid the software overhead that occurs in the process of going through the multiple in-kernel stacks but also eliminate the unnecessary data movement between DRAM and disks.

Related Work of NVM
In this section, we present an overview of previous studies relating to persistent memory programming. Prior NVM studies has almost spanned the entire system stack. Some designs built intricate NVM-related data structures for special actions in applications, such as logging, transaction proceeding [28][29][30]. Some tried to custom NVM file systems by adding support for NVM to existing file systems [20,31,32]. For the NVM-related structures, application programmers should be careful in the process of application design, which is inconvenient and difficult to deploy. As for the NVM file systems, it is uncontrollable that how effectively legacy files systems can evolve to accommodate NVM and how much effort is required to adapt legacy applications to exploit NVM [33]. Comparing to above studies, UHNVM is able to balance performance and complexity as its convenient interfaces in user space and PC technique.

Performance Analysis of Optane DIMM
To investigate the performance characteristics of Optane DIMM, we measure the read and write latency in 256GB Optane DIMM, generating one million of read and write operations for the individual load instructions (8-byte unit). in different patterns. The Optane DIMM is deployed with the app direct mode. Figure 3 shows the performance of Optane DIMM compared to that of DRAM. In the figure, we can observe that Optane DIMM has a higher read and write latency than DRAM due to its inherent physical characteristics. One interesting finding is that Optane DIMM has the sequential-random gap that is almost 2X for reading and 4X for writing, while DRAM has almost same performance across workload patterns. This result indicates that Optane DIMM is more sensitive to the workload change than DRAM and Optane DIMM is more appropriate to service the sequential workloads than the random one. This feature increases the possibility of further optimizing the Optane DIMM management mechanism.

Efficient Data Patterns Indicated by Program Contexts (PCs)
Due to the importance of data patterns, classifying data patterns is a key approach to optimize Optane DIMM. As far as we know, most of the existing studies tag data patterns by programmers according to their code understandings [24,33]. Our PC technique in UHNVM is able to separate data automatically according to application regulations described in Section 4.2 without compulsory code understanding compared to traditional designs needing manual code modifications. UHNVM is a lightweight and efficient design. To further illustrate the efficiency of PC, we plot Figure 4 to show the space-time graph of data references from two applications, cscope and fio-random executing concurrently. Observing from Figure 4, it is hard to separate these data only by considering coordinate axes, time, and block number, while PC technique is able to indicate and identify them. For example, those data encircled by the top dotted-black-circle are represented by PC1 (looping pattern), and data in bottom dotted-red-circle are represented by PC2 (random pattern).

Design of UHNVM
An overview of the UHNVM architecture system is exhibited in Figure 5a which is comprised of two main modules, 'Data classification module' in Section 4.2 and 'Data distribution module' in Section 4.3. The Optane DIMM in this design is configured in app direct mode with Intel 'libpmemblk' library. In the following subsections, we give detailed explanations of UHNVM design. Firstly, we compare architectures between Figure 5a,b to illustrate these frame differences in detail in Section 4.1. Secondly, the process of the program context technique producing better granularity is described in Section 4.2. Thirdly, in order to explain 'Data distribution module' well, a read I/O request example is used to exhibit the basic data structures in Section 4. 3 Figure 5 shows detailed architectures of UHNVM and traditional designs for NVM. NVM is the common concept representing non-volatile memory, while the UHNVM means our new architecture with NVM device. In Figure 5a, our unified interfaces begin with 'U_' head, which is the abbreviation of the word 'universal', while the 'S_' head (abbreviation of 'standard') represents standard interfaces from common GNU libraries. Those interfaces with 'O_' header (abbreviation of 'Optane') stand for the basic Intel 'libpmemblk' library, only for Optane DIMM operations. UHNVM packages interfaces with both 'S_' and 'O_' headers, and provides new universal interfaces with 'U_' heads to developers.

Architecture Comparison
Through 'libpmemblk' library, Optane DIMM in UHNVM has been configured and treated as one storage pool. The size of each block unit in this pool is 4 KB. In Figure 5a, those universal interfaces, with 'U_' head, are able to automatically split data stream into Optane or host DRAM by judging the pattern results from 'Data classification module'. The red dotted line represents the bridge of Optane DIMM and host DRAM to migrate data, which is described in Section 4.3. The red dotted box in Figure 5a is named as 'added universal layer' (an AU layer), which is created as a static library after being compiled. With the AU layer, developers can use batch scripts to replace standard interfaces by universal interfaces of UHNVM, because our universal interfaces use the same function parameters comparing to their counterpart traditional interface functions. After adding AU layer to an application, the programmers no longer need to manually modify its source code again.
Under traditional architecture in Figure 5b, data classification for Optane DIMM only can be executed during the code design process of each application. 'Load' and 'Store' instructs in Figure 5b are the basic commands for developers to manipulate Optane DIMM. For example, a 'Load' instruct is to read a complete XPLine into XPBuffer [4], and on the contrary, a 'Store' instruct flushes and fences data to persistent. In this way, it is inconvenient and inefficient to rewrite all legacy applications for Optane DIMM.

Data Classification Module
The key theoretical hypothesis behind our data classification module is the strong correlation between program contexts (PC) and data patterns [34,35]. In order to display this correlation and PCs calculation, we plot an example of function routines from two applications running at the same time in Figure 6. All function routines are above our AU layer which is first mentioned in the red dotted box of Figure 5a. In Figure 6, we can find that a single terminal function (funct4 or functx) can invoke a read or write function. However, knowing a request only coming from a single function is not enough to explore the detailed data flow in an application. For example, there are at least two functions, funct2 and funct3, invoking funct4. They make up two different function routines, 'funct1->funct2->funct4' and 'funct1->funct3->funct4'. During the running process of an application, if the first routine represented by funct2 is accessed frequently, while the second routine represented by funct3 is only accessed once, then their access frequencies are different.
Since a single funct4 has no ability to detect two kinds of data access frequencies, some policies use all functions instead of a single function in each function routine to distinguish routines. In applications, every function has its corresponding function address. After adding all function addresses in a routine, a program context (PC) indicator value is created to represent this routine. In our example, there are four routines in application1 and one routine in application2. Then all I/O operations will be indicated by different PCs. However, adding all function addresses might lead to new problems. For example, calculating an indicator in recursive functions results in unbearable overhead. Selecting limited function candidates in each routine is necessary. Our experiential function candidate number is from five to eight. Five function candidates are enough to identify all routines in experiments of this article.
Basing on the theory above, the data detection process is comprised of two steps, creating a unique PC indicator for every routine first, and then figuring out the pattern of each routine by statistical analysis. After two steps, each routine pattern is then connected with its unique PC indicator. The first step is to sum up the limited function addresses. However, the second step, pattern prediction of a PC is somewhat complicated. In 'Data classification module', UHNVM statistically analyzes all pages in every request. There are four pattern classification criterion in UHNVM. First, a group of consecutive repeated pages indicated by the same PC value are tagged as a looping pattern. Second, the sequential pattern means consecutive pages only appear once with one PC. Third, a group of repeatedly accessed pages in the same PC, but not in sequential, are detected as a cluster-hot pattern. Finally, in addition to above three patterns, the remaining data are all considered as random pattern.

An Example of Data Distribution Module
In order to describe the 'Data distribution module' in detail, in Figure 7, we give an example of a concrete read request with nine pages from file2. As formerly mentioned, pages in a file might be divided into several patterns, i.e., the Redis AOF log file only merges and updates the multi-operation commands together for the same data, not all log commands. In order to simulate this complicated situation, in this example, we settle its first four pages (from page1 to page4) in a looping pattern enclosed by dotted line, which is suitable for the Optane DIMM storage. The left pages are in random pattern which should be stored in DRAM space.

Hash Func
Hash Func

Basic Data Structures Illustration
In Figure 7, the two-level hash map is design to show the mapping relationship between file pages and blocks in Optane. According to the hash map, UHNVM is able to know what pages are stored in which blocks of Optane DIMM. The file node in a file list contains several members, such as file_inode and file_path. The page node in the page list has five variables, page_number (4 bytes), pmemeblk_num (4 bytes), start_offset (12 bits), end_pos (12 bits), and dirty_flag (1 bit), which totally occupy 11 bytes. These definitions of Electronics 2021, 10, 1760 9 of 15 five variables can be known from their names. The file_inode is used to search and locate files. As we treat Optane DIMM as a cache in 4 KB unit, the members, start_offset and end_pos only need 12 bits to cover 4KB space. With the pmemeblk_num, 'Data distribution module' can locate the block position in the Optane cache. The dirty_flag is used to discard or flush back to SSD/HDD when new pages come and the Optane space is full. After referencing the two-level hash map in step 1 in Figure 7, if a page hits in a hash map, then the target data can be obtained from the Optane DIMM through the step 2. Then the remaining pages can be read from the host DRAM through step 3. Based on the block status in Optane, we build a LRU list to organize all empty blocks in Optane DIMM. Then, UHNVM can understand which blocks are already occupied (O) and which blocks are empty (E) to accommodate coming data.

Cache Data Migration Process
From the request example in Figure 7, excepting direct accesses, there are also two special pages in file2. One is the fourth page and another is the seventh page. According to our assumption, the pages enclosed by the dotted line represent looping pattern data. Therefore, the page4 of file2 should be stored into Optane DIMM. However, it is discovered in DRAM. The page7 of file2 should stay in DRAM as its random pattern. However, observing from our two-level hash mapping table, it stays in Optane now. For the page4, UHNVM should first read it from the host DRAM, and then move it into Optane DIMM. Lastly, after data migrating, UHNVM will update Optane hash map information. On the contrary, for the page7, UHNVM should first read it from Optane DIMM and expel it to SSD/HDD. This eviction process needs to consider the dirty mark of the seventh page. If it is a clean page, just discard it from Optane, then update the hash map. Otherwise, it should be flushed back to SSD/HDD.

Small Access (<4 KB) and File Feature Update in UHNVM
Byte-addressability is available in Optane DIMM. That is to say UHNVM can access Optane at both byte and block units. However, as Optane's pattern-dependent feature and physical limitations, we prefer to avoid those trivial writes. A block size (4 KB) is UHNVM read unit. A smaller read request (<4 KB) in Optane will read the whole block, and then only pick out requested part. A smaller write (<4 KB) is citing the whole block and modifying the aimed data by a read-modify-write operation. The last concern is file feature parameters modification, such as changing file size and file pointer position by write request, and modifying file pointer position by read request. These entire file parameters should be updated in real time when data operation happens in Optane DIMM. UHNVM updates file parameters by citing the lseek and fallocate functions.

Function Blacklist of UHNVM
There are also some cases that are not suitable for Optane DIMM, for which we build up a blacklist for some interfaces in UHNVM. For instance, the mmap interface is a direct address mapping operation, rather than a real data movement. UHNVM should treat it as usual in host DRAM. When application encounter the mmap interface, according to the blacklist, UHNVM prevents data related to mmap interface from entering Optane DIMM. If there exist some data in Optane DIMM, UHNVM needs to move data from Optane to SSD/HDD. That is to say, the data from mmap interface only can be stored in SSD/HDD as usual, regardless of data patterns. We also allow developers who adopt the AU layer to add special files name into this blacklist if they need to bypass Optane DIMM for some specialized files. This blacklist will be saved as a configuration file after host power-off. Note that all the other Optane metadata, two-level hash map and empty-block list, are also saved as configuration files. During host start-up (power-up), all these configuration files will be read from Optane DIMM to configure and set up Optane in UHNVM.
In conclusion, UHNVM is composed of two main components, 'Data Classification module' to identify data patterns and 'Data Distribution module' for distributing data to Optane DIMM or DRAM through our modified interfaces. Except for two main modules, there are also some key data structures to manage data flow, such as 'Two-level Hash Map', and 'Function Blocklist' to bypass Optane DIMM for specific data.

Experimental Setup
We have implemented UHNVM by C program language in a library format, including commonly used interfaces. Our computer is equipped with a 6-core Intel based i5-8600 CPU running at 3.6 GHz, 32 GB of DRAM. The Intel Optane DIMM has been configured in app direct mode with a capacity of 128 GB. In this design, we use two workloads to test interfaces. One is the dd utility experiment and another is the fio experiment. They are designed and analyzed in three aspects: (1) the importance of pattern recognition ability; (2) the influence of different data granularities; (3) the impact of the different proportions of Optane DIMM and host DRAM. Through these experiments, it can be proved that UHNVM is able to improve performance of data-intensive applications by adjusting Optane DIMM and host DRAM according to our refined data patterns.

dd Utility Experiment
In order to quantify the benefits of our data pattern recognition ability to storage performance, we conduct the dd test. The dd application is a command-line utility for Unix and Unix-like operating systems whose main purpose is to convert and copy files. In this test, the ratio of looping data to random data is set to 1:10. The Optane DIMM and host DRAM sizes can be configured arbitrarily. In this test, the size ratio is set as 1:1, 2 GB Optane DIMM, and 2 GB host DRAM. First of all, we keep account of the time in milliseconds when we run dd utility with the same instructions. This setting can create the same data size, about 20 GB for three designs, 'Random NVM', 'No NVM', and 'Intelligent NVM'. Due to the same data size, we can evaluate their performances by comparing completion time. As the no NVM design is in traditional configuration, we normalized other two designs to the 'No NVM' design. This experiment is design to prove the necessity of data pattern identification.
The importance of pattern recognition ability. The first test is to randomly store data in Optane DIMM or host DRAM during the running process of dd application. In the second experiment, the data are only stored in the host DRAM. Since no Optane DIMM is used, we configure the host DRAM to be 4 GB. In the third experiment, we evaluate the dd application with UHNVM, which can distinguish the looping data and store looping data into Optane DIMM. The performance comparisons of these three tests are shown in Figure 8. They are called as 'Random NVM', 'No NVM', and 'Intelligent NVM', respectively. The results are normalized to 'No NVM'. The third test with UHNVM outperforms the second test without UHNVM, because UHNVM can distinguish and keep repeated data into Optane DIMM without frequent transmission between host DRAM and storage device. However, the performance of the first test is the worst. Compared to the second test, the time cost of the third test can be reduced by up to 11%. This comparison proves the importance of pattern recognition ability, which can promote the intelligent use of Optane DIMM.
Since UHNVM is composed of two parts, Optane DIMM (NVM) and traditional DRAM cache, we define the number of the cache hit blocks in Optane DIMM as N ho and the number of the cache hit blocks in DRAM as N hd . To calculate the entire cache hit ratio, H r , both N ho and N hd are summed together and are divided by the total number of blocks (N t ). Then the cache hit ratio can be derived from equation, H r = (N ho + N hd ) / (N t ).
From Figure 9, we can observe that the intelligent NVM outperforms other two designs. The reason is that 'Intelligent NVM' has a better pattern-identification ability which helps to make more rational decisions on cache pre-fetching and cache eviction.
In summary, UHNVM can distinguish memory accesses with high data-reuse and keep them in Optane without worrying about power failure. By contrast, if Optane accesses are dominated by low data-reuse, such as in the random pattern, Optane DIMM will fail to promote storage efficiency. This experiment has proved the importance of pattern recognition.

fio Experiment
In addition to dd workload, we also evaluate the performance of UHNVM under fio application to explore optimal configuration parameters for Optane DIMM. The fio is a representative tool that can spawn a number of processes to perform particular I/O operations specified by the user, which is flexible enough to allow us to run it with different interfaces, different data patterns and different data granularity. According to different fio engines, we choose two pairs of interfaces replaced by UHNVM, Write/Read and Pwrtie/Pread. There are two related configurations of Optane DIMM that need to be discussed, data granularity and the ratio of Optane DIMM and host DRAM.
The influence of different data granularity. To understand how different data granularity affects Optane DIMM performance, we vary the granularity setting from 1 KB to 8 KB while Optane DIMM and host DRAM sizes are fixed. Our test workbenches consist of a looping-simulated thread (repeated sequential thread) and a random thread. In Figure 10, two pairs of interfaces have been compared. The results are all normalized to 4 KB unit size. We find that the larger the access unit, the faster performance could be. For Write-Pwrite interfaces, UHNVM run in 8 KB unit increased the speed by 4.1×, 4×, respectively, compared to the speed in 1 KB. With regard to the Read-Pread interfaces, UHNVM running in 8 KB unit increases speed by 7.3×, 7.1×, respectively, compared with the speed in 1 KB.
From the Write/Pwrite test results in Figure 10b,d, we can also discover that the writing process did not increase linearly like the Read/Pread test results, which are shown in Figure 10a,c. The reason is that Optane DIMM in our design is in the 4 KB unit (one block size). An Optane DIMM writing operation, less than 4 KB size, is comprised of reading a block from Optane DIMM to buffer, modifying entire block data in a buffer, and writing back this modified block into a new Optane DIMM block, which we conclude as a read-modified-write operation. When the write granularity is increased, there is no obvious speed improvement as the read-modified-write action takes up most of an entire writing operation. However, the Optane DIMM read is simple in Optane DIMM which accounts for only a small proportion of an entire read operation. Therefore, increasing reading granularity can significantly result in enhanced reading speed. This test proves that data granularity for Optane DIMM is a key factor in NVM deployment.    The impact of the different proportions of Optane DIMM and host DRAM. In order to explain the influence of the proportion between Optane DIMM and host DRAM, we dynamically adjust the proportion of Optane DIMM to host DRAM, ranging from 0.1 to 0.5. The testing workbench is collected by running a simulated looping thread and two random threads. In this workbench, the looping part takes up about 30% and the remaining random data accounts for about 70%. Figure 11 shows the performance changes under different configurations on the x-axis. As shown in Figure 11, only when the Optane DIMM and host DRAM proportion is configured as 0.3, matching the looping-data proportion (0.3) in workload, then the experiment under this configuration performs the best. That is why all the results are normalized to the result at 0.3 position of the x-axis. Our data distribution strategy only allocates Optane DIMM to pointed patterns, such as looping, and sequential patterns. In our design, even Optane DIMM has free space, it is still not suitable for random data, especially for random writes. Therefore, for UHNVM, the size configuration between Optane DIMM and host DRAM has better be set in advance after experiential evaluation if purchasing the best performance.   For the Pread interface, not given enough Optane DIMM space, the looping data will be transferred between the storage device and Optane DIMM, which results in greater overheads. For example, in Figure 11a, when we test at 0.1 position of the x-axis, looping data have to be read from the storage device and then written into Optane DIMM frequently. At 0.4 position of the x-axis, there is not enough space to host DRAM, then data should be transferred between the storage device and host DRAM frequently, because Optane DIMM only works for looping pattern even when there is idle space, which leads to worse performance. The performance of the Pwrite interface shown in Figure 11b has the same reason as the Pread interface.
In the Figure 12b, the hit ratio increases gradually according to the increasing of Optane DIMM proportion. Because UHNVM prefers to store those repeated blocks which will be accessed in the future in the Optane DIMM. This strategy will increase the cache hit ratio. In addition, we can get another conclusion that the cache hit ratio is not the only index benefiting I/O performance by comparing Figure 12a,  Cache hit ratio comparisons among existing designs. In order to show the efficiency of UHNVM, three existing designs, LRU, CFLRU, I/O-Cache are compared with UHNVM. We run the dd utility to create a complex workload which is composed of looping, sequential, random patterns, and set the proportion of Optane DIMM and DRAM as 1:1. In the Figure 13, the UNHVM performs better and more stable than other designs. Note that, I/O-Cache is more appropriate in a sequential-dominant workload, not that workload with complicated patterns. However, UHNVM is able to identify data patterns in advance, even under a small cache size. Furthermore, as mentioned earlier, UHNVM has updated the traditional linear memory architecture into a parallel architecture which will overcome an obvious cache overhead, frequent data migration between NVM, DRAM, and SSD/HDD. The existing LRU, CFLRU and I/O-Cache strategies belong to traditional linear architecture. Finally, UHNVM has proved its effectiveness in storage system and Optane DIMM deployment. In order to further maximize our UHNVM, it is better for developers to dynamically adjust those two configuration parameters discussed above according to workload's patterns.

Conclusions
In this paper, we propose a universal cache management technique, which is able to intelligently classify and arrange different patterns data into NVM or host DRAM without requirements of understanding application codes. This design is flexible and automatic enough to help users to deploy NVM devices. In order to better classify data patterns, we move the program context technique from conventional kernel space to user space. The results show that UHNVM not only releases the host buffer burden economically, but also improves the performance of storage systems.

Conflicts of Interest:
The authors declare no conflict of interest.