Practical Enhancement of User Experience in NVMe SSDs

: When processing I / O requests, the current Linux kernel does not adequately consider the urgency of user-centric tasks closely related to user experience. To solve this critical problem, we developed a practical method in this study to enhance user experience in a computing environment wherein non-volatile memory express (NVMe) solid-state drives (SSDs) serve as storage devices. In our proposed scheme


Introduction
Non-volatile memory express (NVMe) is an open logical device interface specification for accessing fast storage media, such as solid-state drives (SSDs) [1][2][3][4]. For SSDs to provide high-speed I/O, NVMe supports up to 64K submission and completion queues capable of queuing up to 64K commands [5][6][7][8][9][10]. Such a scalable architecture facilitates the full utilization of the internal parallelism of SSDs [11][12][13][14][15][16]. A multi-queue block I/O layer was introduced in the recent Linux kernel to efficiently support the NVMe SSDs in the host. This layer uses two levels of queues to improve scalability. One level is the software queues (SWQs) to alleviate the lock contention problem in multi-core environments, and the second level is the hardware queues (HWQs) to deal with some storages that support multiple dispatch queues, such as the NVMe SSDs [17][18][19].
In the previous single-queue block I/O layer, all I/O requests originating from tasks running on each CPU core were handled via a single request queue (Figure 1a). This resulted in a performance bottleneck due to the lock contention for accessing the single request queue [18,20]. In addition, it could not sufficiently exploit the potentiality of storages that support multiple dispatch queues. To alleviate these critical problems, the multi-queue block I/O layer employs two levels of queues (Figure 1b): SWQs and HWQs. The I/O requests originating from tasks running on any CPU core are first sent to This paper presents a practical scheme to enhance user experience by modifying the multi-queue block I/O layer. The main focus of this study is on the fast I/O handling of user-centric tasks, such as foreground or interactive tasks, due to their large impact on user experience [25]. The current Linux kernel does not adequately consider the urgency of the user-centric tasks, which especially issues I/O requests. To solve this problem, we first assigned higher process priorities to the user-centric tasks and identified the I/O requests originating from them in the block I/O layer. Subsequently, the structure of the multi-queue block I/O layer is modified to handle the aforementioned I/O requests as quickly as possible. Particularly, an NVMe feature that dispatches I/O requests from multiple SQs in a round-robin fashion is considered. The results obtained from various experiments demonstrated that the proposed scheme significantly enhances user experience from various perspectives.
The remainder of this paper is organized as follows. Related works and the details of our proposed scheme are presented in Section 2 and 3, respectively. In Section 4, we present the evaluation results of our proposed scheme with an I/O benchmark tool-fio, flexible I/O tester-and five open source programs. The paper is then concluded in Section 5.

Related Works
Several studies on improving the multi-queue block I/O layer for NVMe SSDs have been reported in the literature. Joshi et al. [17] implemented a mechanism that supports four I/O priorities using two features. One feature is the I/O scheduling classes of Linux that consist of real-time, besteffort, none, and idle; the other feature is the WRR of the NVMe SSDs, a method for NVMe SSDs to retrieve more I/O requests from SQs with higher priorities. The authors increased the number of SQs in a single set allocated to each CPU core from 1 to 4, and each SQ belonging to the set had one of the This paper presents a practical scheme to enhance user experience by modifying the multi-queue block I/O layer. The main focus of this study is on the fast I/O handling of user-centric tasks, such as foreground or interactive tasks, due to their large impact on user experience [25]. The current Linux kernel does not adequately consider the urgency of the user-centric tasks, which especially issues I/O requests. To solve this problem, we first assigned higher process priorities to the user-centric tasks and identified the I/O requests originating from them in the block I/O layer. Subsequently, the structure of the multi-queue block I/O layer is modified to handle the aforementioned I/O requests as quickly as possible. Particularly, an NVMe feature that dispatches I/O requests from multiple SQs in a round-robin fashion is considered. The results obtained from various experiments demonstrated that the proposed scheme significantly enhances user experience from various perspectives.
The remainder of this paper is organized as follows. Related works and the details of our proposed scheme are presented in Sections 2 and 3, respectively. In Section 4, we present the evaluation results of our proposed scheme with an I/O benchmark tool-fio, flexible I/O tester-and five open source programs. The paper is then concluded in Section 5.

Related Works
Several studies on improving the multi-queue block I/O layer for NVMe SSDs have been reported in the literature. Joshi et al. [17] implemented a mechanism that supports four I/O priorities using two features. One feature is the I/O scheduling classes of Linux that consist of real-time, best-effort, none, and idle; the other feature is the WRR of the NVMe SSDs, a method for NVMe SSDs to retrieve more I/O requests from SQs with higher priorities. The authors increased the number of SQs in a single set allocated to each CPU core from 1 to 4, and each SQ belonging to the set had one of the following SQ priorities; urgent, high, medium, and low. By mapping the I/O scheduling classes to the SQ priorities, differentiated I/O services were provided according to the I/O classes.
Lee et al. [21] solved a write interference problem, a situation in which the small number of write requests in a read-intensive workload negatively affects the performance of the workload. The problem was solved by splitting and inserting I/O requests into different SWQ according to the I/O type; that is, read or write. The I/O requests isolated in the different SWQ are also sent to different HWQ and SQ. This suggestion alleviates the write interference and consequently increases read performance by 33%.
Qian et al. [12] analyzed runtime behaviors in nonuniform memory access (NUMA) architecture consisting of multiple CPUs and NVMe SSDs in terms of I/O performance and energy efficiency. Based on this, the authors proposed an energy efficient I/O scheduler that manages I/O threads accessing NVMe SSDs, not only to reduce energy consumption and CPU usage, but also to guarantee I/O throughput and latency. Ahn et al. [26] also studied systems based on NUMA. The authors proposed an I/O resource management technique-weight-based dynamic throttling-to facilitate an efficient sharing of I/O resources in Linux cgroup on NUMA multi-core systems that use high-performance NVMe SSDs.
Kim et al. [27] solved a problem of the multi-queue block layer of the current Linux kernel being unable to reflect process priority when the process requests I/O operations sent to NVMe SSDs. The authors added additional queues between existing SWQs and HWQs to hold I/O requests that are issued by processes and lack opportunities-called to a token in their paper-to send the I/O requests to the NVMe SSDs at that point. Considering these works, it is clear that several studies have solved various problems, especially those caused by the structure of the current multi-queue block layer. However, no study has proposed an appropriate solution at the level of the Linux kernel for I/O-intensive and user-centric tasks to achieve more services, whether or not SSDs already process a large number of I/O requests issued by non-user-centric tasks.

Redesign of the Multi-Queue Block I/O Layer to Improve User Experience
This section describes the redesign of the multi-queue block I/O layer to improve user experience. The exclusive focus of the multi-queue block I/O layer on the I/O bandwidth may result in a bad user experience. To address this concern, this study aims at optimizing the I/O processing time of user-centric tasks by swiftly sending I/O requests issued by user-centric tasks to the SSD device through the complex multi-queue block I/O layer. We first assign a higher process priority to the user-centric tasks than non-user-centric or background tasks. When a program is launched for the first time or a program running in the background switches to the foreground, the user-centric tasks are automatically assigned a high process priority via a modified shell program. The modified shell can differentiate foreground and background tasks, and easily modify the process priority by using the setpriority() system calls. This facilitates a faster execution of user-centric tasks compared to non-user-centric ones through the task scheduling of a CPU scheduler, such as the completely fair scheduler. Consequently, a faster issuance of I/O requests from user-centric tasks to the multi-queue block I/O layer is achieved [28,29].
However, the current Linux kernel does not support any I/O service differentiated by process priority. Thus, this approach is still inadequate to preferentially process I/O requests issued by the user-centric tasks. Furthermore, process priority information disappears at the level of the block I/O layer by default. As the first step to process I/O requests from user-centric tasks preferentially, we passed the priority information to the block I/O layer by adding it to bio and request, which are basic structures used for I/O processing in the multi-queue block I/O layer. Once the I/O requests in the SWQ are passed to the SQ via the HWQ, the host loses control over them. To process I/O requests issued by user-centric tasks first, we divide the SWQ into two for every core: one queue is for I/O requests from user-centric tasks, and the other queue is for I/O requests from non-user-centric tasks. By referring the priority information passed through bio and request structure, each I/O request is sent to the appropriate SWQ. If there are I/O requests in the SWQ for user-centric tasks, they are first moved to the HWQ.
I/O requests located in the HWQ are moved to the SQ immediately if there is sufficient space in the SQ. If there are other pending I/O requests in the HWQ or/and SQ, such as HWQ 2 and SQ 2 in Figure 2, the I/O requests from user-centric tasks, such as T U in Figure 2, cannot be served until other I/O requests are retrieved from the queue holding them and processed. Moreover, NVMe SSDs typically dispatch I/O requests in multiple SQs in a round-robin fashion. Therefore, the I/O requests from user-centric tasks should be moved to HWQ and SQ with the smallest number of pending I/O requests to facilitate their processing in a minimal time. To this end, we first modified the NVMe device driver of the Linux kernel, as the current NVMe device driver cannot obtain the number of I/O requests pending in the SQ. We measure this number using two pieces of information for each SQ: a head that is recorded to SQ head pointer included in the CQ entry, and a tail managed by the NVMe device driver.  Figure 2, the I/O requests from user-centric tasks, such as TU in Figure 2, cannot be served until other I/O requests are retrieved from the queue holding them and processed. Moreover, NVMe SSDs typically dispatch I/O requests in multiple SQs in a round-robin fashion. Therefore, the I/O requests from user-centric tasks should be moved to HWQ and SQ with the smallest number of pending I/O requests to facilitate their processing in a minimal time. To this end, we first modified the NVMe device driver of the Linux kernel, as the current NVMe device driver cannot obtain the number of I/O requests pending in the SQ. We measure this number using two pieces of information for each SQ: a head that is recorded to SQ head pointer included in the CQ entry, and a tail managed by the NVMe device driver.   Suppose that a user-centric task TU is running on CPU2, and other tasks notated as TNU are simultaneously running on other CPUs. In this example, a single I/O request issued by the TU is passed to a separate software queue SWQ that is assigned to handle user-centric tasks and mapped to CPU2. In the original kernel (Figure 2), an I/O request issued by TU should wait until all I/O requests  Suppose that a user-centric task T U is running on CPU 2 , and other tasks notated as T NU are simultaneously running on other CPUs. In this example, a single I/O request issued by the T U is passed to a separate software queue SWQ U 2 that is assigned to handle user-centric tasks and mapped to CPU 2 . In the original kernel (Figure 2), an I/O request issued by T U should wait until all I/O requests pending in the SWQ, HWQ, and SQ are processed. On the other hand, in our proposed scheme, the I/O request does not need to wait in SWQ NU 2 , as it uses SWQ U 2 dedicated for user-centric tasks. In addition, unlike the original kernel, it is migrated to HWQ 1 and SQ 1 instead of HWQ 2 and SQ 2 as SQ 1 has the least number of pending I/O requests. Consequently, the I/O request can be processed swiftly, compared to other I/O requests.   Suppose that a user-centric task TU is running on CPU2, and other tasks notated as TNU are simultaneously running on other CPUs. In this example, a single I/O request issued by the TU is passed to a separate software queue SWQ that is assigned to handle user-centric tasks and mapped to CPU2. In the original kernel (Figure 2 As our scheme basically tries to process I/O requests issued by user-oriented tasks first, there will be concerns that I/O requests issued by non-user-oriented tasks might suffer from starvation if the SSD is heavily loaded. I/O requests from a user-centric task find and go to the HWQ and SQ with the shortest queue length each time, so I/O requests from non-user-centric tasks can go to other queues instead. In addition, as all SQs are dispatched in a round-robin or weighted round-robin way, pending I/O requests from non-user-centric tasks can be processed sometime. In summary, when the SSD drive is overburdened, especially, even if the user-centric task issues the I/O requests infinitely, the I/O requests issued by non-user-centric tasks may wait for a long time, but there would be no infinite waiting situation.

), an I/O request issued by TU should wait until all I/O requests
Details of the operations in our modified multi-queue block I/O layer are shown in Figure 4. After an I/O request from task running on n-th CPU reaches the block I/O layer, a bio that is a data structure for describing a single I/O operation is changed to a new request (r n ), or it is merged to already existing requests. The block I/O layer then determines whether r n is requested by user-centric processes. If it is, an x-th SQ (SQ x ) that contains the smallest number of pending I/O requests is selected as a target SQ to insert an NVMe I/O command for r n instead of the initially mapped SQ. This quick selection of the target SQ x is achieved because the current kernel determines SWQ, HWQ, and SQ for r n at this level by default. Subsequently, r n is enqueued to the HWQ x mapped to the SQ x via n-th SWQ for user-centric processes (SWQ U n ). After dequeuing from HWQ x , in the level of the NVMe device driver, r n is changed as a format of the NVMe I/O command and enqueued to the SQ x if it is not full. The NVMe device driver finally notifies the insertion by updating a doorbell for the SQ x . If r n is requested by non-user-centric processes, it is enqueued to SWQ NU n . Before enqueuing r n to the HWQ n , it is confirmed whether there are pending I/O requests in the SWQ for the user-centric processes, SWQ U n . If they exist, the I/O requests located in SWQ U n are first dequeued from the SWQ and enqueued to HWQ n to be processed before I/O requests from the non-user-centric processes. Owing to these various approaches, the modified block I/O layer can process I/O requests issued by the user-centric processes faster. the shortest queue length each time, so I/O requests from non-user-centric tasks can go to other queues instead. In addition, as all SQs are dispatched in a round-robin or weighted round-robin way, pending I/O requests from non-user-centric tasks can be processed sometime. In summary, when the SSD drive is overburdened, especially, even if the user-centric task issues the I/O requests infinitely, the I/O requests issued by non-user-centric tasks may wait for a long time, but there would be no infinite waiting situation. Details of the operations in our modified multi-queue block I/O layer are shown in Figure 4. After an I/O request from task running on n-th CPU reaches the block I/O layer, a bio that is a data structure for describing a single I/O operation is changed to a new request (rn), or it is merged to

Performance Evaluation
The experimental environment is presented in Table 1. To emulate the I/O intensive applications, we used a fio benchmark tool that is widely used for generating I/O workloads with various configurations [30,31]. For enough I/O workloads, 50 fio tasks continuously generating random read requests were executed: one of them was set as a user-centric task, and the others were set as non-user-centric tasks. As mentioned earlier, the type of tasks is determined by the priority of each task. To verify the effectiveness of the ideas employed in the proposed scheme, performance evaluations were performed under the various combinations of ideas as described in Table 2. Note that we repeated all experiments 10 times to make the results reliable. Figure 5 depicts the performance of the proposed scheme in terms of execution time, input/output operations per second (IOPS), and I/O bandwidth. This plot only shows the average value of the repeated experiments because the deviation is negligibly small. It can be observed that merely increasing the priority of user-centric tasks resulted in a significant performance boost (high-priority). In this case, the execution time, IOPS, and I/O bandwidth of user-centric tasks improved by 10.50%, 11.79%, and 11.49%, respectively, compared to the original kernel. In addition, when additional SQs were assigned to handle user-centric tasks (separated) and I/O requests from user-centric tasks were sent to the shortest submission queue (shortest), all metrics improved by up to 14.62% and 16.75%, respectively. When all ideas were employed together (proposed), all metrics of user-centric tasks improved by up to 19.54%. Likewise, the use of the proposed scheme also seemed to improve the performance of non-user-centric tasks by up to 2.89%. It seems to be because, in our experiment environment, after a fio task executed as a user-centric task outputs the results and exits, fio tasks executed as non-user-centric tasks used the remaining resources. This behavior could be also observed in the average I/O latency, which was measured in the block I/O layer. The average latency of I/O requests from user-centric tasks is improved from 23.57us to 20.53us, while that of I/O requests from non-user-centric tasks is slightly improved from 23.57us to 22.20us.

Notation Description original
Evaluates with an unmodified kernel and shell program high-priority Raises the process priority of the user-centric tasks through the modified shell program separated Provides a separated software queue for user-centric tasks based on high-priority shortest Delivers I/O requests from user-centric tasks to the shortest-sized SQ based on high-priority proposed Includes all ideas: high-priority, separated, and shortest Figure 6 depicts the launch times of five widely used Linux programs (Table 3), which is a crucial metric for users. As they are rather large programs using window, many files such as executables, configurations, and libraries should be read for the start-up. It inevitably entails a long latency and it can make users wait longer if the storage is overburdened. We measured the launch times while several fio workloads ran in the background. We can measure the start time of application by monitoring when exec() in shell is called, but it is not easy to clearly define and measure the completion of a launch. In our experiments, as all applications are using window, we measured the completion time of a launch by monitoring when window is created with wmctrl. Compared to using the original scheme, the use of the proposed scheme improved the launch times of the five programs on average by 28.42%, 48.71%, 63.54%, 59.83%, and 65.13%, respectively. As shown in the figure, the deviation of launch times is quite large depending on the system situation, but the proposed scheme consistently shows a significant performance improvement in all experiments. As the launch time can significantly affect user experience, these improvements are more substantial than the results of previous experiments, which were on accumulated performance improvement.
improve the performance of non-user-centric tasks by up to 2.89%. It seems to be because, in our experiment environment, after a fio task executed as a user-centric task outputs the results and exits, fio tasks executed as non-user-centric tasks used the remaining resources. This behavior could be also observed in the average I/O latency, which was measured in the block I/O layer. The average latency of I/O requests from user-centric tasks is improved from 23.57us to 20.53us, while that of I/O requests from non-user-centric tasks is slightly improved from 23.57us to 22.20us.   Table 3. Target programs to measure launch time.

Program Description
firefox Open source web browser supporting multi-platform [32] totem Gnome's desktop movie player [33] writer LibreOffice's word processor [34] calc LibreOffice's spreadsheet program [34] impress LibreOffice's presentation program [34] the original scheme, the use of the proposed scheme improved the launch times of the five programs on average by 28.42%, 48.71%, 63.54%, 59.83%, and 65.13%, respectively. As shown in the figure, the deviation of launch times is quite large depending on the system situation, but the proposed scheme consistently shows a significant performance improvement in all experiments. As the launch time can significantly affect user experience, these improvements are more substantial than the results of previous experiments, which were on accumulated performance improvement.  Table 3. Target programs to measure launch time.

Conclusions
This paper presents a scheme to enhance user experience in a computing environment using NVMe SSDs as storage devices. By assigning a higher priority to user-centric tasks and modifying the shell, multi-queue block I/O layer, and NVMe device driver, I/O requests issued by user-centric tasks can be preferentially serviced. The results of various experiments performed in this study reveal that the proposed scheme significantly improves user experience for user-centric tasks by assigning to them higher priority in terms of I/O processing. In the future, we will continue to study, on the operating system level, supports for improving user satisfaction by prioritizing the user-centric tasks. As we discussed the effect of process priority in the performance evaluation section, more exquisite CPU scheduling level support is very important to improve the response time of user-centric tasks in a multi-core environment. In addition, memory allocation for user-centric tasks can be delayed by

Conclusions
This paper presents a scheme to enhance user experience in a computing environment using NVMe SSDs as storage devices. By assigning a higher priority to user-centric tasks and modifying the shell, multi-queue block I/O layer, and NVMe device driver, I/O requests issued by user-centric tasks can be preferentially serviced. The results of various experiments performed in this study reveal that the proposed scheme significantly improves user experience for user-centric tasks by assigning to them higher priority in terms of I/O processing. In the future, we will continue to study, on the operating system level, supports for improving user satisfaction by prioritizing the user-centric tasks. As we discussed the effect of process priority in the performance evaluation section, more exquisite CPU scheduling level support is very important to improve the response time of user-centric tasks in a multi-core environment. In addition, memory allocation for user-centric tasks can be delayed by non-user-centric tasks due to lock contention, so it should be also considered to optimize the response time. We believe that the synthetic analysis of the relationship among different layers for complicated environments using multi-core and storages with a great many queues is necessarily required and a cross-layer design considering all of them is still an open problem that should be continuously studied.