NetAP: Adaptive Polling Technique for Network Packet Processing in Virtualized Environments

: In cloud systems, computing resources, such as the CPU, memory, network, and storage devices, are virtualized and shared by multiple users. In recent decades, methods to virtualize these resources efﬁciently have been intensively studied. Nevertheless, the current virtualization techniques cannot achieve effective I/O virtualization when packets are transferred between a virtual machine and a host system. For example, VirtIO, which is a network device driver for KVM-based virtualization, adopts an interrupt-based packet-delivery mechanism, and incurs frequent switch overheads between the virtual machine and the host system. Therefore, VirtIO wastes valuable CPU resources and decreases network performance. To address this limitation, this paper proposes an adaptive polling-based network I/O processing technique, called NetAP, for virtualized environments. NetAP processes network requests via a periodical polling-based mechanism. For this purpose, NetAP adopts the golden-section search algorithm to determine the near-optimal polling interval for various workloads with different characteristics. We implement NetAP in a Linux kernel and evaluated it with up to six virtual machines. The evaluation results show that NetAP can improve the network performance of virtual machines by up to 31.16%, while only using 32.92% of the host CPU time used by VirtIO for packet processing.


Introduction
Artificial intelligence, cloud computing, big data, and the Internet of Things are the main technologies of the fourth industrial revolution. The importance of these technologies is emerging as an international issue that requires significant involvement by academia and industry. Cloud computing provides the main infrastructure of the fourth industrial revolution; it lays the foundation for other major technologies. The computing environment of enterprises and individuals is rapidly moving to the cloud. Many companies already have their own cloud infrastructures, and others are using public cloud platforms, such as Amazon EC2. Moreover, several artificial intelligence services, big data processing, and Internet of Things applications are currently performed on the cloud.
Virtualization is a key technology that enables and sustains cloud computing. Through virtualization, computing resources, such as the CPU, memory, storage, and network devices, of multiple physical nodes are consolidated and managed. Virtualization provides resources that are required by users in the form of virtual machines (VMs) or containers. However, because the resources are virtualized, applications running in cloud computing experience performance deterioration, in contrast with applications executed in a native environment. The virtualization overhead for CPUs has been addressed with some success, but the overhead for I/O devices, such as network and storage devices, remains. This overhead is a major bottleneck for overall system performance [1,2].
The efficient virtualization of I/O devices is a critical factor in cloud computing. In a cloud system, many users share a small number of I/O devices. If the virtualization overhead increases in proportion to the number of users, the efficiency of the entire system decreases. Cloud computing service providers then need to install additional equipment to meet user demand. This situation leads to a decrease in competitiveness in cloud computing with an increase in the overall cost.
Another issue in I/O virtualization is that the virtualization overhead may appear in the form of CPU consumption. The current virtualization techniques are heavily supported by hardware, but the role of processing I/O devices is still often handled by software. For example, Linux incorporates a kernel-based virtual machine (KVM) virtualization software and adopts an I/O driver for virtualization, called VirtIO [3,4]. VirtIO is responsible for processing requests and responses to and from I/O devices. Therefore, if VirtIO inefficiently uses the CPU for processing I/O operations, it results in additional cost by wasting CPU resources that might otherwise be provided to users in the cloud system.
In this study, we implement NetAP, a novel network I/O processing method for virtualized systems in KVM-virtualized environments. NetAP introduces a periodic polling technique in VirtIO in order to maximize the use of network and CPU resources. The existing VirtIO processes I/O requests in a batch manner when the number of requests is above a certain threshold. Upon meeting the condition, the VM generates a virtual interrupt (vIRQ) to notify VirtIO to process the requests. However, this mechanism causes expensive VM-Exit and VM-Entry operations, which decreases the I/O performance [5,6]. The periodic polling technique of NetAP regularly monitors the I/O request buffers and handles the network requests in a timely manner. This mechanism prevents vIRQs from the VM, because the requests in the buffer are processed before the number of requests reaches the threshold.
NetAP uses the golden-section search algorithm to dynamically find the near-optimal polling interval for various workloads. The golden-section search technique helps NetAP to find the maximum or minimum value within a given range of a unimodal function. The performance evaluation results of NetAP show that it successfully finds a near-optimal polling interval for various scenarios consisting of multiple VMs. NetAP improves the network performance by up to 31.16%, while only using 32.93% of the CPU time that the original VirtIO used for packet processing in the host system.
Our contributions can be summarized as follows: • We provide a new I/O processing mechanism that is based on adaptive periodical polling; this method uses the network and CPU more efficiently than traditional interrupt-based mechanisms in virtualized environments. • We provide a fast and accurate adaptation technique that is based on the golden-section search algorithm. Our evaluation demonstrates that the adaptation technique successfully finds a near-optimal polling interval that maximizes network and CPU utilization for various workloads.
This paper is organized, as follows. In Section 2, we provide background information about VirtIO and describes the related work. Section 3 presents the design of NetAP, including the new polling-based mechanism for I/O processing and the adaptation mechanism that is based on the golden-section searching algorithm. The experimental results are presented in Section 4. Section 5 presents our conclusions.

VirtIO
VirtIO modules are para-virtualized drivers for virtualized systems based on KVM; such modules are widely used for network and block devices in KVM. Figure 1 shows the structure of VirtIO with request and response flows. When a user application running in a guest OS requests a packet transmission via a system call, the network stack of the guest OS prepares a packet for transmission and delivers the packet to the VirtIO front-end device driver, which is a virtual network device. The front-end driver uses shared memory for data communication between the VM and the host; the network packets are buffered into the queue in the shared memory region. A vIRQ is then generated in order to notify the host OS that packets are ready for transmission. The vIRQ is generated when the number of packets in the queue exceeds a certain threshold, rather than generating a vIRQ for each packet. Upon receiving the vIRQ, the host OS schedules the virtual host (vHost) user thread to deliver the packet fetched from the shared queue to the network stack of the host in kernel space. The vHost thread adopts a combined interrupt-polling method, which combines the interrupt and polling mechanisms. After the first vIRQ interrupt, the vHost thread fetches consecutive packets from the shared queue with a polling mechanism and then delivers them to the network stack. The fetched packets are sent to the MacVTap bridge, which is a virtual interface that transmits packets through the physical device. Packet reception is similarly performed in the opposite direction.
The main overhead occurs in the process of requesting packets from the VM to the host. The VM notifies the host system, using a vIRQ, that the packet to be processed is ready in the shared memory. The host system then receives the interrupt and wakes up the sleeping vHost thread to perform packet processing. This leads to frequent VM-Exit and VM-Entry operations to transfer the control between the VM and KVM, which leads to a substantial virtualization overhead [5,6].

Interrupt-Based and Polling-Based I/O Processing Techniques
Interrupt-based and polling-based techniques in I/O processing have the advantages and disadvantages of complementary relationships; furthermore, such techniques are used selectively depending on the workload characteristics. For example, G. Lettieri et al. analyzed the producer-consumer model in I/O processing in a virtualized environment [7]. They compared the busy-waiting method and the sleeping method. The busy-waiting method uses a polling-based technique that continuously processes network packets on a dedicated CPU core. The sleeping method periodically executes polling operations in order to process the stored network packets in a buffer [7]. As a result, it was found that it is efficient to use different I/O processing methods depending on the characteristics of the workload. Specifically, they found that the I/O processing efficiency varied, depending on the period of the polling operations in the sleeping method, which motivated our research.
Recently, high-speed I/O devices, such as 10 GbE and NVMe SSDs, have emerged. As interrupts occur at very high speeds in these devices, the overhead of interrupt delivery becomes a bottleneck in the I/O processing path [8]. To alleviate the overhead, busy-waiting polling techniques are used in many systems that focus on high-speed network processing [9,10]. The busy-waiting techniques successfully improve network performance by processing incoming network packets continuously. However, more than one CPU core should be dedicated to the busy-waiting workers for packet processing. The workers occupy the dedicated CPU cores even though there are no incoming packets. This leads to inefficient CPU management, because other processes cannot utilize the dedicated CPU cores. This paper proposes an adaptive polling technique that occupies CPU cores when there are incoming packets in order to overcome the inefficient CPU management of busy-waiting techniques.
There have been similar approaches to enhancing the busy-waiting techniques for storage devices. For example, T.Y Kim et al. proposed the use of a periodic polling technique for NVMe SSDs [11]. However, their evaluation results show that the periodic polling technique is less efficient than interrupt handling. This is because the storage used in the research only handles 400 K requests per second. On the other hand, this paper focuses on a 10 GbE network device handling 14,880 K requests per second for 64 B packets, which is approximately 30 times faster than the storage device. Even for 1500-B packets, there are 812 K requests in network devices, which imposes more interrupt-processing overhead than in the case of the storage device. Additionally, previous engineering work suggests adaptive polling, which is similar to NetAP, for block devices on KVM environments [12]. They limit the time for busy-waiting polling algorithm and increase or decrease the polling duration through a simple condition loop. When an I/O event occurs, the polling starts for a certain duration, for example, 32 us (microseconds). If additional I/O events occur, the polling duration increases. Otherwise, the polling duration decreases, which is very simple optimization. The simple optimization does not guarantee finding the optimal time for polling and lacks the mathematical proof and runtime analysis, which is different from the golden-section search algorithm that we adopt. NetAP determines the near-optimal polling interval while using the golden-section search algorithm, which is much more efficient and effective than the simple optimization.

Efficient I/O Processing in a Virtualized Environment
Research for improving the I/O processing performance in virtualized environments can be categorized into two types: one is to use additional hardware devices to increase I/O processing performance and the other is to reduce the data-copying overhead in I/O processing.
First, single root I/O virtualization (SR-IOV) is a technique that allows VMs to directly access physical network devices without the intervention of the host system [13][14][15]. Allowing the VMs to process I/O requests directly increases the I/O processing performance significantly. However, in SR-IOV, I/O processing is based on interrupts; therefore, it incurs a large overhead when the interrupts occur at high speeds. To reduce the overhead, there have been attempts to minimize the VM-Exit and VM-Entry operations using software methods [16] or hardware methods [17]. Furthermore, previous studies propose assigning specific CPU cores to dedicated vHost threads to mitigate the interrupt handling overhead [18,19]. However, assigning CPU cores for I/O processing degrades the CPU utilization at moderate I/O processing rates [20]. Thus, in existing cloud environments, SR-IOV is only used for users who require very high or stable network performance at high prices.
The second approach is to reduce the data-copying overhead. To this end, one study proposed a zero-copy technique that eliminates memory copying in the packet forwarding path to speed up processing and reduce CPU usage [21]. Complementary with the proposed method, it would be more efficient if the adaptive polling of NetAP and a zero-copy method were simultaneously used.

Research for Optimizing the Polling Interval
Many previous studies have tried to overcome the limitations of interrupt-based and polling-based techniques. Some studies [22] suggest changing the period of polling operations by creating and evaluating a simple cyclic transformation equation through modeling and simulation. They focus on general network processing, rather than on I/O processing in virtualized environments. Hence, the results of previous studies [22][23][24] are difficult to apply to virtualized environments. Furthermore, in a real system environment, there are other considerations, such as the period of performance monitoring and the change in the packet-processing speed of the VM, which are not included in the previous studies. Therefore, more in-depth research is necessary in order to optimize the polling intervals in virtual environments. This study provides a search method to optimize the polling interval and a detailed analysis of the period of performance monitoring, which is essential for the search method.

Design
This section describes the design goals, the periodic polling technique, and the algorithm for determining the optimal polling interval of NetAP.

Design Goals
The design goals of NetAP include improving deployability and obtaining the optimal polling interval for each VM, which are explained, as follows: First, to improve deployability, NetAP only modifies the vHost thread of the host system. If we modify the front-end device driver of each guest OS, the user may experience an inconvenience from re-compiling the front-end driver and re-installing the driver module into the guest OS. Therefore, by modifying only the vHost thread of the host OS, we preserve the interface with an unmodified front-end driver.
Second, NetAP tries to obtain the optimal interval for each VM to reflect its individual workload characteristics. A vHost thread is dedicated to its own VM and it fetches network requests from its dedicated shared queue. Because each VM generates and consumes packets at a different speed, each vHost thread needs to have different polling intervals. Therefore, we apply a novel algorithm to determine the polling interval for each vHost thread.

Periodical Polling for Network Packet Processing
NetAP develops a periodic polling technique in the vHost thread to maximize the utilization of network and CPU resources. The modified vHost thread in NetAP periodically polls the shared queue instead of waiting for a vIRQ from the front-end device driver. At each polling interval, the vHost thread is scheduled, and it processes packets sent from the guest OS. After the vHost thread completes packet processing, it sleeps until the next polling interval. During this sleep time, the guest OS generates packets, and they are buffered into the shared queue. At the next interval, the vHost thread wakes up and processes the buffered packets in the shared queue.
The main difference between NetAP and the original VirtIO is that NetAP frequently monitors the shared queue and proactively handles the buffered packets before a vIRQ is generated from the front-end driver. In original VirtIO, the front-end driver produces packets and inserts them into the shared queue. When the number of buffered packets in the queue is more than a certain threshold, a vIRQ is delivered to the vHost thread to wake up the thread. However, NetAP processes the buffered packets at an appropriate rate with periodic polling before the number of packets exceeds the threshold. Therefore, the front-end driver does not generate vIRQs. This mechanism can prevent expensive VM-Exit and VM-Entry operations and, therefore, improve network performance with efficient CPU use.

Periodic Polling Algorithm
Algorithm 1 details the periodic polling technique.
The core function is the_packet_handler_o f _a_host_system(), which processes packets in a continuous loop. When the loop is executed once, it processes the packets loaded in the RX and TX queues (line 21 and 22), sleeps for an interval (line 24), and starts the next loop. Until it sleeps, interrupts are disabled for packet arrivals (line 8 and 23). A round is a time unit for observing performance. To adapt NetAP for a given workload, we change the polling interval for each round according to the observed performance for the previous round. Furthermore, round_length is the predetermined length of a round (e.g., one second). Line 14 sets the start time of the next round by adding round_length to current_time, and line 9 checks the time. At the start of a new round, the interval is updated by running the golden_section_searching() function on line 10. The parameters are the number of the current round, the size of the packet processed in the round (bytes), and the current interval. The function is explained in detail in the next section.
The algorithm also provides a function to restart adaptation in response to characteristic changes in the workload after adaptation was formerly completed. Line 11 checks whether the size of transferred data changes greater than the predefined ratio MinChange, when the interval is no longer updated. Line 17 checks whether the current round exceeds the predefined maximum number of rounds. If one of the two conditions is satisfied, then adaptation is started again by setting the round number to 0. This allows for NetAP to keep tracking the appropriate interval continuously against the changes of workload characteristics within the running VM.

Adaptation Using Golden-Section Search
We use the golden-section search algorithm to find the near-optimal polling interval for various workloads dynamically. The golden-section search technique searches the maximum or minimum value within a given range in a unimodal function [25]. The method to find the maximum value for the target function f (x) is as follows: First, we set two x-coordinates x l and x u to search. We assume that the maximum value is between x l and x u . Second, we choose These three x-coordinates are golden-section triplets. Third, we set the new x-coordinate, x u are chosen. Subsequently, we iterate this process until the outer x-coordinates are closer than , which is the predetermined minimal distance to terminate the algorithm.
The main advantage of the golden-section search algorithm is the efficiency. The evaluation of f (x) requires a considerable time for the real-world workloads. The golden-section algorithm only requires one additional evaluation for each iteration. Moreover, the complexity of the algorithm is logarithmic. The big-O notation of the algorithm is shown in Equation (1), where N is the number of iterations required to find the solution [26,27].

Considerations for Alternative Algorithms
We choose the golden-section search rather than other line search algorithms, such as the bisection, Newton's, Quasi-Newton's, and Nelder--Mead methods, because Golden-section searching is unconstrained and derivative-free [28].
First, we need an unconstrained method, because the target function to be searched is not specified. Because NetAP should supports various workloads with different characteristics, we cannot assume any constraints for the searching algorithm. We only assume that the target function is one-dimensional, so we avoid using complex multi-dimensional algorithms, such as the gradient, random search, and Nelder-Mead methods [29].
Second, we need a derivative-free method, because we do not know the complete function in the mathematical form. We only know few points that are collected from the execution results during the round. In addition, the derivative cannot be correctly calculated in the Linux kernel, because the floating point operations are not allowed in the kernel mode in order to maintain the user context in the floating point unit [30]. Thus, we avoid algorithms that are based on derivatives, such as the bisection search, Newton's method, and Quasi-Newton's method [31].

Algorithm 1: Network packet handler with periodical polling
Data: round indicates the number corresponding to the current round, round_length is a predefined length of a round in seconds, next_time indicates the next time for updating the interval, current_time() returns a current wall-clock time, bytes is the sizes of processed packets in the current round, interval is a polling interval in the current round, MaxRound is the maximum number of rounds without changing the interval, MinChange is the minimum ratio of processed packet sizes between previous and current rounds, RX_queue, and TX_queue are the queues for the packets received and packets to be transmitted.

Preliminary Examination for Golden-Section Search Algorithm
Before we apply the algorithm to solve our problem, there are two requirements to examine: the target function must be unimodal, and a near-optimal value must exist within a range of possible intervals to perform a search. Thus, we demonstrate in advance whether our problem can be solved by the golden-section search algorithm.
We performed a preliminary examination to determine whether a golden-section search can find near-optimal polling intervals. As a result, we made two observations. First, the change in performance over various polling intervals appears as a unimodal function. Second, there is a range of periods that can be commonly applied to various workloads.
The environment of the experiment is as follows: the host system is equipped with a CPU with six physical cores, 16 GB of memory, and a 10 GbE NIC. Simultaneous multi-threading was disabled, and an additional 1 GbE NIC for controlling the host system was installed to increase the reliability of the experiment. The Linux kernel used corresponded to version v5.1.5, released in May 2019. We ran several VMs on this host system and performed Netper f benchmarks on each VM that was connected to a separate Netperf server system. In the experiment, 64B packets were transmitted using TCP, and we measured the aggregated network bandwidth in Mbps.
In this preliminary experiment, We executed one, two, three, and six Netperf threads on each of one, two, three, and six VMs. We increased the polling interval by 500 every 60 s, from 500 us to 5000 us, and observed the performance change that corresponded to each interval for each workload. We did not evaluate each interval separately, but continuously changed the interval. This is to determine how rapidly the effect of changing the polling interval becomes apparent. Figure 2 shows the result. The first item on the X-axis shows the performance of unmodified VirtIO without applying the periodic polling technique. Furthermore, the figure shows the performance that corresponds to up to 5000 us with gradual increments of 500 us. The performance is shown at five-second intervals. The overall trend forms a unimodal function in every case, and we can find the point where the performance is maximized. As summarized in Table 1, intervals up to a maximum of 3000 us maximized the performance for all workloads. The bandwidth comparison in Table 1 is the relative performance of the best network bandwidth as compared to the performance of unmodified VirtIO.
Through the preliminary examination, we determined that the golden-section search algorithm can be applied to find the near-optimal polling interval in NetAP. We also found that the network performance changed immediately after the polling interval was changed.

Golden-Section Search Algorithm
Algorithm 2 obtains the near-optimal interval using golden-section search. First, R is the so-called golden ratio, which is 0.618, and MaxInterval and MinInterval are the predetermined maximum and minimum intervals, respectively. MinDistance is a predetermined minimum distance to stop the algorithm.
The golden_section_searching() function is called by Algorithm 1 with the round, bytes, and previous interval as arguments. In the first and second rounds, interval low and interval high are set using MaxInterval and MinInterval, respectively, with the same values of distance, and the performance is collected for a round_length. From the third round, the function checks which value of interval low or interval high achieved higher performance. If interval low has higher performance, interval high is replaced by interval low , and interval low is replaced by (interval high + distance). Conversely, if interval high provides higher performance, interval low is replaced by interval high , and interval high is replaced by (interval low − distance). Subsequently, the updated interval is returned. Furthermore, distance is decreased by the golden ratio in each round. When distance is less than MinDistance, the algorithm terminates, and the previous interval is returned without an update.
Through the aforementioned process, we can quickly search for the interval between MinInterval and MaxInterval that maximizes the performance. For the implementation in the Linux kernel, R and distance were calculated and entered in advance as constant values. This is because floating-point operations cannot be performed within the Linux kernel. MinInterval and MaxInterval were set to 100 and 4000, respectively. The numbers are chosen according to the results of our preliminary examination, which showed that the near-optimal polling intervals can be found between 500 and 3000 us. In addition, MinDistance was set to 30, so that the number of searches was limited to 11 iterations. The set of all distances is as follows: 2410, 1490, 921, 569, 352, 217, 134, 83, 51, 32. Note that the first distance is used twice, at the first and second rounds.

Evaluation
We implemented NetAP on Linux v5.1.5 and evaluated the performance. We attempted to answer the following questions. The experimental environment was the same as that described in Section 3.4.2. In addition, for all experiments, the existing unmodified VirtIO processing was run at the beginning. Subsequently, we applied NetAP after 30 s (in Section 4.1) or 20 s (in Section 4.2 and 4.3) to observe the impact of NetAP on the performance of VMs easily. In the first and second experiments, we measured network bandwidth for 64B packets only. This was because a substantial amount of data center traffic consists of small packets, which consume more CPU resources than large packets [32,33]. Furthermore, per f and pidstat were used to measure CPU utilization.

First Experiment: Round Length
The first experiment evaluated the adaptation performance of NetAP depending on the round length. To find the near-optimal polling interval, the golden-section searching algorithm collects performance over a period (i.e., a round length). Subsequently, the changed interval is applied to VirtIO processing until the next round. If the round is very short, the polling interval changes frequently, which makes it difficult for the algorithm to find the near-optimal polling interval. In contrast, a long round length can degrade the average performance because inappropriate polling intervals can affect the performance negatively. As a result, it is necessary to determine a proper round length for the golden-section search algorithm.
We have four experimental scenarios: 1VM-6T, 2VM-3T, 3VM-2T, and 6VM-1T. This indicates the number of Netperf threads and the number of VMs on which the aforementioned threads are executed. For example, 3VM-2T means that two Netperf threads run on each of three VMs. In total, six Netperf threads and three vHost threads are executed simultaneously for the 3VM-2T scenario. We ran all experiments for 120 s, and unmodified VirtIO processing was executed in the first 30 s. Subsequently, we measured the bandwidth of the 10G NIC shared by all VMs every 5 s. The lengths of the rounds selected for the experiment were 0.25, 0.5, 1, 2, 3, and 5 s. Figure 3 shows the performance change in aggregate bandwidth, depending on round length. In most cases, NetAP shows performance improvement compared to the unmodified VirtIO running for the first 30 s of the experiment. In addition, we observe that the performance of VMs decreases during the time for the searching algorithm to find the near-optimal polling interval when the round length is longer than one second. This is the aforementioned negative effect of a long round length. During the long round, an inappropriate value calculated by the golden-section search algorithm may be applied to the polling interval; this will result in performance degradation. Table 2 depicts the performance shown in Figure 3 relative to the best bandwidth measured in Table 1. We divide the results presented in Figure 3 into two parts: the second quarter (30-60 s) and the last quarter (90-120 s). The last quarter shows the performance when the adaptation is completed, and the second quarter shows the performance when the negative effect of adaptation is maximized.
In Table 2, we find that NetAP can find the near-optimal polling interval and achieve a bandwidth close to that of the best case when the round length is longer than one second. However, in 6VM-1T, the relative performance is less than 90% for all round lengths, which indicates that NetAP has difficulty in finding the near-optimal polling interval. We assume the reason is that the golden-section search algorithm is executed in a distributed manner; furthermore, it does not consider the effects of the interference between multiple vHost threads that are executed in parallel. We plan to overcome this limitation of NetAP by enhancing the algorithm in future work.
In addition, Table 2 shows the performance under different round lengths during adaptation (i.e., 30-60 s). Unlike the results after adaptation (i.e., 90-120 s), the performance obtained for round lengths longer than two seconds is relatively low. This is because the inappropriate polling interval reduces the performance of VMs for a long period, as illustrated in Figure 3. Based on this result, we selected one second as the default round length for NetAP and performed the following experiments.   Table 2. Performance relative to the best bandwidth in Figure 2.

Second Experiment: CPU Utilization and End-User Latency
This subsection identifies the impact of NetAP on CPU utilization and end-user latency. We constructed four scenarios, 2VM-4T, 3VM-3T, 4VM-2T, and 5VM-1T, which were not tested in Section 4.1. We measured the aggregate bandwidth and polling intervals every second for 60 s. In this experiment, NetAP starts to run at 20 s after the unmodified VirtIO has run for the first 20 s. Figure 4 illustrates the aggregate network bandwidth (left y-axis) and the polling interval (right y-axis) of each VM over time. In all scenarios, the aggregate bandwidth increases when NetAP is applied at 20 s. NetAP periodically checks whether there are pending packets for processing before a vIRQ is generated. This procedure prevents expensive VM-Exit and VM-Entry operations and improves the overall network performance. Furthermore, we find that NetAP offers more stable performance than the unmodified VirtIO. In the unmodified VirtIO, a vIRQ to notify the vHost thread can be generated after the corresponding VM is scheduled by the CPU scheduler. Therefore, the vIRQ can be delayed, and this may cause unstable performance.
In the figure, the polling interval of each VM varies for approximately 10 s when NetAP is applied. This is because the algorithm searches for the near-optimal polling 11 times for an adaptation. After the near-optimal interval is found, NetAP continuously maintains the interval.
In addition, the near-optimal polling interval of each VM varies in Figure 4 when multiple VMs run concurrently. As described in Section 4.1, the reason is that the golden-section searching algorithm is executed in a distributed-manner. Each searching algorithm separately runs per vHost thread during the round. However, NetAP still improves the aggregate bandwidth of VMs by adjusting the polling intervals of each VM independently.   Tables 3 and 4 show the aggregate bandwidth, CPU utilization, and latency for every 20 s in each scenario; 0 to 20 s is the time for which unmodified VirtIO runs (VirtIO period), 20 to 40 s is when the algorithm searches for the near-optimal polling interval (Adapting period), and the last 20 s correspond to the time after the near-optimal interval has been applied (Adapted period). First, NetAP in the Adapted period shows that the aggregate bandwidth is improved by up to 30.37% compared to that in the VirtIO period. In each Adapted period, the CPU utilization of the host is decreased compared to that in the VirtIO period because NetAP reduces the number of VM-Exit and VM-Entry operations that occur in the case of the interrupt-based method in the existing VirtIO. In addition, NetAP increases the guest CPU utilization as NetAP processes incoming packets on a timely basis and allows the corresponding VM to generate more packets in the queue. As a result, NetAP achieves improved efficiency in packet processing.
In addition, end-user latency decreases by up to 89.49%. The standard deviation in the Adapted period also reduces to a maximum level of 40%, and it shows improvement compared to the VirtIO period for all scenarios. From these results, we can assume that the time for delivering vIRQs is variable in the unmodified VirtIO. When the corresponding VM is selected by the CPU scheduler, the VM can generate a vIRQ, Therefore, the time is not deterministic. This causes fluctuating end-user latency. In contrast, NetAP offers stable and low latency by operating at periodic intervals determined by the searching algorithm, thereby increasing the efficiency of packet processing.

Third Experiment: Different Packet Sizes and System Environments
In this experiment, we evaluated NetAP for different packet sizes and system environments. We intended to demonstrate that NetAP with the current parameters is effective in different environments, and that, consequently, many of the pre-examination and workload analysis steps are not mandatory.
First, we evaluated NetAP for various packet sizes: 128, 256, 512, and 1024 B. In this experiment, we ran a single VM executing Netperf while using a single thread for 60 s and applied NetAP after 20 s, the same as in the previous experiments described in Section 4.2. Figure 5 shows that performance decreases briefly when NetAP starts to run at 20 s in order to determine the near-optimal interval; then, the performance increases significantly. Table 5 depicts the relative bandwidth, CPU utilization, and latency compared to the unmodified VirtIO, which shows that NetAP improves the performance by at least 53.53% and by a maximum of 148.83%. The results of unmodified VirtIO were collected in the first 20 s, and the results of NetAP were collected in last 20 s. The CPU utilization of the host drastically decreased while the overall performance increased. Such efficient CPU utilization is an important factor in the cloud system, which indicates that NetAP more efficiently handles network packets, allowing for other CPU-intensive threads or different VMs to use more CPU resources. In addition, the average latency decreased by 15.36%. For 256-B packets, the standard deviation of latency increased by 15.36%, but the average latency decreased significantly, by 52.22%. As a result, NetAP provides higher network performance with fewer CPU resources for both large packets and the small packets that incur severe CPU contention.
Second, we constructed an additional experimental setup, as follows: two servers equipped with Intel Xeon E5-2650 v2 CPUs (8 cores, 3.4 GHz) and 64 GB of memory were connected to a 10 GbE switch. The VM specifications and the network configurations are the same as in previous experiments. This experimental setup shows the impact of CPU performance, which has a critical effect on network performance in a virtualized environment. In terms of BogoMips, a single CPU core that was used in the previous experiment recorded 6192, whereas a single CPU core in this experiment recorded 5200. Thus, the CPUs that wew used in this experiment approximately exhibit a 20% lower performance. We applied NetAP to this system without additional modifications. The experimental configuration and method are identical to those presented in Section 4.2.   Figure 6 shows similar results to Figure 4, which indicates that NetAP improved the aggregate bandwidth as compared to the unmodified VirtIO. However, the overall network performance was lower than that shown in Figure 4. This is because the performance of the CPU cores used in this experiment is lower than that of the ones used to obtain the results that are shown in Figure 4. Tables 6 and 7 show that the performance of NetAP improved by up to 31.16%, while showing results that are similar to those in Tables 3 and 4 for CPU utilization and latency. These results show that NetAP is also effective in systems with different hardware specifications. Thus, we can conclude that the NetAP polling mechanism with the default parameters is effective in various system environments.

Conclusions
This paper presented NetAP, an adaptive polling technique for improving the performance of packet processing in VMs. The periodic polling approach of NetAP resolves the inefficient packet processing of the existing VirtIO, which is interrupt-based. We used the golden-section search algorithm and implemented NetAP on KVM hypervisor to find the optimal polling interval for NetAP. We implemented NetAP in Linux v5.1.5 and evaluated it with up to six VMs. The evaluation results showed that NetAP improves the network performance of VMs by up to 31.16%, while using the host CPU for packet processing at only 32.92% of the usage exhibited by the existing VirtIO technique.
In this paper, we discovered areas for future work in adaptive polling-based network packet processing. First, accurate control or adaptation for the performance isolation between VMs should be provided for real-world cloud systems. When the VMs share the same host machine, the workloads of the VMs influence each other. This "neighborhood effect" of the virtualized system incurs the performance interference between VMs on the same host machine. When the performance of a VM decreases due to other co-located VMs, the golden-section search algorithm receives erroneous performance by the neighborhood effect and it has difficulty in finding the near-optimal polling interval to achieve the maximum performance.
Another issue is the estimation of the optimal polling interval. If we can determine the optimal polling interval for a workload without carrying out adaptation, the performance degradation during adaptation will disappear. The data that can be employed for the estimation of the optimal polling interval include the average packet size, packet incoming rate, current throughput of the physical NIC, and the number of vHost and client threads. The correct model cannot be easily developed because several of the measurements are related to the interval. In our future work, we aim to develop an AI-based model in order to solve this problem.