Review Reports
- Lukas Beierlieb1,*,
- Alexander Schmitz2 and
- Anas Karazon2
- et al.
Reviewer 1: Anonymous Reviewer 2: Alejandro Medina Reviewer 3: Anonymous
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThis article compares the performance of four VMI breakpoint implementation variants of two VMI applications (DRAKVUF, SmartVMI) on XEN hypervisors - EPT switching with fast single-step acceleration support (SLAT view switching) and version, instruction repair, and instruction emulation unsupported - on 20 Intel Core processors from 4th to 13th generation. The authors found that there is a certain regularity in the time required to process breakpoint clicks on all platforms. When making modifications, attention must be paid to the following issues.
- The author's abstract states that regarding the time required to handle breakpoint hits, instruction repair>EPT switching>support for fast single-step EPT switching>instruction simulation. What is the main basis for this argument?
- The author mentioned a contribution: this article provides breakpoint benchmark test results measured on 20 devices equipped with Intel 4th to 13th generation Core processors using the aforementioned disk images. From the perspective of sample size, is the data on 20 devices sufficient? Please provide a reasonable explanation.
- On page 5, the author discusses vCPU. What are the main connections and differences between it and the CPU?
- Fig.1 gives the software architecture for the benchmark using SmartVMI as an example. Please provide a more specific explanation.
- The references in this article are not sufficient. Some reference formats need to be carefully corrected. As mentioned in reference 2, three IEEE are repeated.
Author Response
# Review 1
Comment 1: The author's abstract states that regarding the time required to handle breakpoint hits, instruction repair>EPT switching>support for fast single-step EPT switching>instruction simulation. What is the main basis for this argument?
Response 1: This statement was the summary of the measurement results, where we observed this execution-time ranking on all hardware platforms. We rephrased the abstract to be more clear about the observation (l. 11-19).
Comment 2: The author mentioned a contribution: this article provides breakpoint benchmark test results measured on 20 devices equipped with Intel 4th to 13th generation Core processors using the aforementioned disk images. From the perspective of sample size, is the data on 20 devices sufficient? Please provide a reasonable explanation.
Response 2: Added "While this sample size is relatively small for precise statistical generalization, it is sufficient to reveal general trends and differences in breakpoint performance across CPU generations. The dataset captures a broad range of architectures, providing meaningful insights, even if fine-grained quantitative conclusions would require more devices." to the Experimental Design section (previously Approach section) (l. 536-540).
Comment 3: On page 5, the author discusses vCPU. What are the main connections and differences between it and the CPU?
Response 3: Added the explanation "A vCPU is allocated by the hypervisor to a VM; it represents scheduled access to a physical core rather than a physical processor itself. A CPU refers to the physical processor chip, which may contain multiple cores, each capable of executing instructions independently. Thus, a vCPU runs on a core of a CPU, but its timing and performance can vary depending on the hypervisor and other VM activity." to the introduction (l. 30-34).
Additionally, we replaced all occurrences of "virtual CPU" with the "vCPU acronym" (l. 165, 199).
Comment 4: Fig.1 gives the software architecture for the benchmark using SmartVMI as an example. Please provide a more specific explanation.
Response 4: We are not sure how the explanation can be more specific. Should it be "more generic"? When it comes to VMI application, there is not much when replacing SmartVMI with DRAKVUF or another implementation, especially when libVMI is involved. We made the illustration more "generic" by showing not only the architecture for Xen but also for KVM.
Comment 5: The references in this article are not sufficient. Some reference formats need to be carefully corrected. As mentioned in reference 2, three IEEE are repeated.
Response 5:
Does "The references in this article are not sufficient" mean that there are not enough references, or does the statement belong to the following sentences critizing the bibliography formatting? We did not modify the bibliography style from the template and used publicly available bibtex entries for our references. In the mentioned second reference, IEEE is the organization and publisher, as well as part of the booktitle. We are very much open to make changes but do not want to remove information without a concrete guideline. No changes applied so far.
Reviewer 2 Report
Comments and Suggestions for AuthorsComments:
1. The abstract is clear and concise. It adequately presents the problem, methodology, and main findings. The keywords are relevant. However, it lacks explicit mention of the main contribution in terms of relative performance among the evaluated techniques.
2. The introduction provides a good context for the problem of virtual machine introspection (VMI) and the importance of hypervisor breakpoints. Foundational work is cited and the need for performance evaluation is justified. The research question is well-formulated. It would be helpful to include a brief mention of the study's limitations (e.g., restriction to Intel and XEN).
3. In the background section, this section is comprehensive and well-structured. It clearly explains the three breakpoint mechanisms (repair, emulation, and SLAT/EPT view switching), as well as their advantages and disadvantages. Table 1 is very useful. It is suggested to add a conceptual diagram illustrating the flow of each method.
4. The methodology is robust and reproducible. The description of the bpbench tool, the workloads (WL1–WL4), and the configurations to improve accuracy (priority, pinning, SMT, fixed frequency) are well detailed. The effort to control confounding variables is noteworthy. The justification for using XEN instead of KVM is valid.
5. The hardware selection is diverse and representative of multiple Intel Core generations. Table 6 is very comprehensive. The transparency in the firmware configurations (Table 7) and performance tuning in XEN (Table 8) is appreciated. Special cases are well documented.
6. The results are presented clearly and systematically. Boxplots are appropriate for showing the distribution of latencies. Normalization with respect to emulation (Figure 6) facilitates comparison. The identification of SmartVMI anomalies and their exclusion from the analysis reinforces the validity of the results.
7. The discussion is consistent with the results. The research question is clearly answered: emulation is the fastest, followed by EPT with FSS, EPT without FSS, and finally repair. Consistency across all platforms is a robust finding. The discussion of hardware evolution is cautious and well-founded.
8. Related Works: The section is well organized into categories and cites relevant work. It is suggested to include mention of more recent commercial tools or frameworks, if available, although the coverage is already adequate.
9. The conclusions are consistent with the results. The study is adequately summarized, and limitations are highlighted. The future work section is realistic and relevant, especially the extension to KVM and the improved reproducibility using Nix.
Additionally, I could mention that:
Abstract
Can be improved. It succinctly states the problem and methodology. However, it should more forcefully highlight the key takeaway: that instruction emulation was the fastest method across all 20 tested platforms, which is a strong and generalizable result. The phrase "regarding the time required..." is awkward; consider "We found that instruction repair was the slowest method, followed by EPT switching, EPT switching with FSS, and instruction emulation, which was the fastest."
Introduction
Can be improved. Provides good background on VMI. The research question is clear. However, the contributions could be stated more prominently. The second bullet point ("We provide breakpoint-benchmark results...") is the core of the paper and should be the first and primary contribution. Consider merging the two points for greater impact.
Background
Can be improved. The technical explanations of the three breakpoint methods (2.2.1 - 2.2.3) are excellent and clear. However, Section 2.1 (VMI Software Architecture) feels out of place and disrupts the flow. It contains information critical for understanding the experiment's setup but not its concepts. This should be moved to the Approach/Methodology section (Section 3). Section 2.3 and 2.4 on acceleration and stealth are relevant but could be more concise.
Approach (Methodology)
Must be improved. This is the weakest section structurally. It is overly long and mixes high-level design decisions with extremely low-level, repetitive configuration details (e.g., nice values, specific core pinning for every odd machine). This drowns the reader in details before they understand the big picture.
Recommendation: Split this section. Create a new "Experimental Design" section that discusses the high-level choices: why XEN, why this hardware range, the design of bpbench (WL1-WL4), and the overall software stack (which can be summarized here, with Table 3, and the details of 2.1 moved here). Then, create a "Measurement Setup and Configuration" subsection (or appendix) to list all the specific tuning steps (priority, pinning, SMT/clock disabling). This will dramatically improve readability.
Hardware Platforms
Can be improved. The data is excellent and valuable. Tables 6, 7, and 8 are very useful. The special cases subsection (4.1) is important. The main issue is integration with the previous section; much of the content here (e.g., disabling SMT/C-states) is a repeat of Section 3.6. Consolidate all configuration details in one place.
Measurements
Can be improved. The results are clear and the analysis is sound. The use of median and minimum values is appropriate for this type of latency measurement. The figures are good but could be more clearly explained in the text. The narrative around Figure 2 and 3 is confusing; it takes too long to understand what is being shown. Lead with the conclusion first, then use the figure as support.
The "SmartVMI Anomalies" part is critical information but is presented as a discovery during analysis. If this was a known risk of the SmartVMI setup, it should be mentioned in the methodology section as a limitation of the measurement approach for that tool.
Conclusion
Can be improved. The summary adequately recaps the findings. The "Future Work" section is relevant and well-justified. The conclusion could be stronger by more broadly stating the significance of the result (e.g., "Our results provide VMI developers with a clear, evidence-based hierarchy of breakpoint techniques to prioritize for performance-critical applications.").
Comments on the Quality of English LanguageRequires significant editing for clarity, conciseness, and flow.
Author Response
# Review 2
Comment 1: The abstract is clear and concise. It adequately presents the problem, methodology, and main findings. The keywords are relevant. However, it lacks explicit mention of the main contribution in terms of relative performance among the evaluated techniques.
Response 1: We updated the abstract to highlight the consistent ranking of the approaches in the measurement study (l. 11-19).
Comment 2: The introduction provides a good context for the problem of virtual machine introspection (VMI) and the importance of hypervisor breakpoints. Foundational work is cited and the need for performance evaluation is justified. The research question is well-formulated. It would be helpful to include a brief mention of the study's limitations (e.g., restriction to Intel and XEN).
Response 2: We now mention the limitations (l. 113-117).
Comment 3: In the background section, this section is comprehensive and well-structured. It clearly explains the three breakpoint mechanisms (repair, emulation, and SLAT/EPT view switching), as well as their advantages and disadvantages. Table 1 is very useful. It is suggested to add a conceptual diagram illustrating the flow of each method.
Response 3: We extended Figures 2 and 3 that already show the flow of instruction repair and emulation to illustrate the workflow on a type 2 hypervisor (KVM). We have not added a similar control flow diagram for EPT switching as it is very similar to instruction repair (single-stepping workflow).
Comment 4: The methodology is robust and reproducible. The description of the bpbench tool, the workloads (WL1–WL4), and the configurations to improve accuracy (priority, pinning, SMT, fixed frequency) are well detailed. The effort to control confounding variables is noteworthy. The justification for using XEN instead of KVM is valid.
Response 4: Thank you (note: we now mention KVM performance in the conclusion, l. 1035-1038).
Comment 5: The hardware selection is diverse and representative of multiple Intel Core generations. Table 6 is very comprehensive. The transparency in the firmware configurations (Table 7) and performance tuning in XEN (Table 8) is appreciated. Special cases are well documented.
Response 5: Thanks!
Comment 6: The results are presented clearly and systematically. Boxplots are appropriate for showing the distribution of latencies. Normalization with respect to emulation (Figure 6) facilitates comparison. The identification of SmartVMI anomalies and their exclusion from the analysis reinforces the validity of the results.
Response 6: We restructured the text to give tangible results up front and reduce the long build-up discussion.
Comment 7: The discussion is consistent with the results. The research question is clearly answered: emulation is the fastest, followed by EPT with FSS, EPT without FSS, and finally repair. Consistency across all platforms is a robust finding. The discussion of hardware evolution is cautious and well-founded.
Response 7: Thanks!
Comment 8: Related Works: The section is well organized into categories and cites relevant work. It is suggested to include mention of more recent commercial tools or frameworks, if available, although the coverage is already adequate.
Response 8: We followed the great suggestion and added three commercial applications of VMI technology. Note that based on another reviewer's suggestion, we moved the Related Work section forward to just behind the background section.
Comment 9: The conclusions are consistent with the results. The study is adequately summarized, and limitations are highlighted. The future work section is realistic and relevant, especially the extension to KVM and the improved reproducibility using Nix.
Response 9: Thank you.
Comment 10 (Abstract): Can be improved. It succinctly states the problem and methodology. However, it should more forcefully highlight the key takeaway: that instruction emulation was the fastest method across all 20 tested platforms, which is a strong and generalizable result. The phrase "regarding the time required..." is awkward; consider "We found that instruction repair was the slowest method, followed by EPT switching, EPT switching with FSS, and instruction emulation, which was the fastest."
Response 10: Done, please see Response 1.
Comment 11 (Introduction): Can be improved. Provides good background on VMI. The research question is clear. However, the contributions could be stated more prominently. The second bullet point ("We provide breakpoint-benchmark results...") is the core of the paper and should be the first and primary contribution. Consider merging the two points for greater impact.
Response 11: We reordered and reworded the contribution summary in the introduction (l. 120-130).
Comment 12 (Background): Can be improved. The technical explanations of the three breakpoint methods (2.2.1 - 2.2.3) are excellent and clear. However, Section 2.1 (VMI Software Architecture) feels out of place and disrupts the flow. It contains information critical for understanding the experiment's setup but not its concepts. This should be moved to the Approach/Methodology section (Section 3). Section 2.3 and 2.4 on acceleration and stealth are relevant but could be more concise.
Response 12: We find Section 2.1 relevant as a base for the following sections. We tried to improve the flow of reading by providing an introductory paragraph at the start of the section, which justifies the section's existence (l. 158-161).
Comment 13 (Approach (Methodology)): Must be improved. This is the weakest section structurally. It is overly long and mixes high-level design decisions with extremely low-level, repetitive configuration details (e.g., nice values, specific core pinning for every odd machine). This drowns the reader in details before they understand the big picture. Recommendation: Split this section. Create a new "Experimental Design" section that discusses the high-level choices: why XEN, why this hardware range, the design of bpbench (WL1-WL4), and the overall software stack (which can be summarized here, with Table 3, and the details of 2.1 moved here). Then, create a "Measurement Setup and Configuration" subsection (or appendix) to list all the specific tuning steps (priority, pinning, SMT/clock disabling). This will dramatically improve readability.
Response 13: Thanks a lot for this great recommendation, we split the Approach section into the two suggested sections "Experimental Design" and "Measurement Setup and Configuration".
Comment 14 (Hardware Platforms): Can be improved. The data is excellent and valuable. Tables 6, 7, and 8 are very useful. The special cases subsection (4.1) is important. The main issue is integration with the previous section; much of the content here (e.g., disabling SMT/C-states) is a repeat of Section 3.6. Consolidate all configuration details in one place.
Response 14: We consider it an important feature to describe the intended firmware configurations (with the other software configurations) first, without mentioning the concrete devices yet. Later duplications in "Hardware Platforms" are not ideal but serve the purpose of giving helpful context.
Comment 15 (Measurements): Can be improved. The results are clear and the analysis is sound. The use of median and minimum values is appropriate for this type of latency measurement. The figures are good but could be more clearly explained in the text. The narrative around Figure 2 and 3 is confusing; it takes too long to understand what is being shown. Lead with the conclusion first, then use the figure as support. The "SmartVMI Anomalies" part is critical information but is presented as a discovery during analysis. If this was a known risk of the SmartVMI setup, it should be mentioned in the methodology section as a limitation of the measurement approach for that tool.
Response 15: We reworked the section. Please see Response 6 for details.
Comment 16 (Conclusion): Can be improved. The summary adequately recaps the findings. The "Future Work" section is relevant and well-justified. The conclusion could be stronger by more broadly stating the significance of the result (e.g., "Our results provide VMI developers with a clear, evidence-based hierarchy of breakpoint techniques to prioritize for performance-critical applications.").
Response 16: We added a statement of the significance (l. 109-110).
Reviewer 3 Report
Comments and Suggestions for AuthorsThe article examines the performance of different Virtual Machine Introspection (VMI) breakpoint implementations, comparing EPT switching, instruction repair, and instruction emulation through DRAKVUF and SmartVMI on the Xen hypervisor across different Intel processor generations. The study shows variation in the times taken to process, with instruction repair being more time-consuming and newer processors achieving shorter latency. The results provide useful performance benchmarks, even though the abstract suggests little discussion of broader applicability or issues of deployment.
However, the following elements need to be addressed in the manuscript:
- An explicit novelty statement is missing from the Introduction statement. A clear and concise novelty statement can help the readers understand the novel element of this research study.
- The Related Works section should be moved right after the Background section instead of the end.
- Given that XEN was selected over KVM because it was more stable and had better feature support, how can the contribution of XEN's own scheduling, event handling, and VM exit latencies be eliminated from the reported differences of breakpoint mechanisms?
- Since firmware restrictions made it impossible to fully standardize CPU features on all 20 systems, how well can the assumption be justified that the performance differences observed are purely due to breakpoint implementations instead of heterogeneous hardware control?
- How can it be ensured that exclusion of the constant overhead of breakpoint from analysis does not inadvertently bias conclusions, especially if microarchitectural implications interact with various implementations differently?
- To what extent does working with a single workload (series of RET/NOP instructions) risk underrepresenting performance effects on real workloads with varying instruction mixes and branch behaviors?
- How good a representation of security monitoring cases where breakpoints fire under heterogeneous multi-threaded, I/O-bound, or kernel-intensive workloads are the benchmarked microbenchmarks?
- Since SmartVMI and DRAKVUF have different internal designs, how can performance differences be traced solely to breakpoint handling mechanisms and not to the greater VMI framework design?
- How can the impact of disabling efficiency cores, Turbo Boost, and SMT be distinguished from natural breakpoint handling performance behavior, when real-world productions may be heavily relying upon them?
- Since Windows VM and Dom0 are both utilizing the fourth physical core in most setups, how are cross-VM contention effects measured and distinguished from actual breakpoint latency?
- How does the use of adapted versions of DRAKVUF and SmartVMI affect reproducibility, and are there any safeguards in place to ensure the adaptations do not introduce performance artifacts inadvertently?
- As bpbench relies on QueryPerformanceCounter in Windows, how does this timer's accuracy degrade in the presence of virtualization overhead, and how might this bias latency measurements at sub-microsecond scales?
- As WL2 had around the same values as WL1 and was thrown away, how can one be sure that this will not remove subtle cache/TLB side effects that would be present only under mixed instruction workloads?
- For WL3 and WL4, how does trapping at the page level using EPT interplay with modern microarchitectural extensions like speculative execution, and might this interplay introduce stealthy latency variations not captured by bpbench?
- As measurement relies on process priority assignment and pinning the CPU, how resistant are the results to OS scheduling aberrations or interrupts that do occur anyway via affinity and priority control?
- Where clock frequency changes did not remain constant at base speed, how are resulting frequency variations considered in the latency analysis?
- How might restriction to one hypervisor limit external validity of results to environments in which KVM or Hyper-V is the standard?
Author Response
# Review 3
Comment 1: An explicit novelty statement is missing from the Introduction statement. A clear and concise novelty statement can help the readers understand the novel element of this research study.
Response 1: Added statement "This study features an important novelty for VMI researchers and developers by comparing common methods for hyper-breakpoint handling and hiding for VMI solutions for the first time and provides comparable measurements of their runtime overhead." to the introduction (l. 110-112).
Comment 2: The Related Works section should be moved right after the Background section instead of the end.
Response 2: Thank you for the suggestion, we moved Related Work behind the Background.
Comment 3: Given that XEN was selected over KVM because it was more stable and had better feature support, how can the contribution of XEN's own scheduling, event handling, and VM exit latencies be eliminated from the reported differences of breakpoint mechanisms?
Response 3: We extended Figure 1, 2, and 3 to show how the different architecture of KVM affects the control flow during breakpoints. With the lack of KVM measurements, we cannot and do not want to claim the results will be the same on KVM. However, considering the control flows and estimated costs of operations, we do assume to see similar performance on KVM (added statement in Conclusion, l. 1035-1038).
Comment 4: Since firmware restrictions made it impossible to fully standardize CPU features on all 20 systems, how well can the assumption be justified that the performance differences observed are purely due to breakpoint implementations instead of heterogeneous hardware control?
Response 4: Our goal was not to standardize features across all CPUs to make them comparable; a lot of other factors such as memory speed would also prevent us to have a level playing field on consumer devices. Instead, our goal was to make each CPU individually as consistent as possible. We want to avoid the same CPU running with boost clock speeds for one of the approaches and with less or no boost for another approach; the same goes for process scheduling/pinning. The results reflects our ambitions (and success): the times between processors vary quite a bit, probably strongly influenced by the base clock speed, but comparing the approaches on the same machines always show very similar relative results.
Comment 5: How can it be ensured that exclusion of the constant overhead of breakpoint from analysis does not inadvertently bias conclusions, especially if microarchitectural implications interact with various implementations differently?
Response 5: The measured timer overhead was very constant on each system (and also quite comparable between systems) and significantly smaller than the differences between breakpoint/read execution times. Given the choice of subtracting the median timer overhead from all results or reporting the exact measured values, we choose the later. The conclusions do not change in any case.
Comment 6: To what extent does working with a single workload (series of RET/NOP instructions) risk underrepresenting performance effects on real workloads with varying instruction mixes and branch behaviors?
Response 6: bpbench provides the workload WL2, where the whole page is executed, with the intention of punishing approaches that use EPT permissions to trap executes (as the original "altp2m", https://xenproject.org/blog/stealthy-monitoring-with-xen-altp2m/), thus trapping on every NOP before recognizing in the VMI software that no breakpoint was placed at this address and resuming. None of the tested approaches utilized this techniques, all rely on INT3s instead. For this reason, we also do not consider WL2 further because of its little difference to WL1. Thus, the relevant workload is only the CALL to the page and the RET with the breakpoint. We added a statement in l. 496-506 that this study uses a microbenchmark that focuses on a single aspect. Varying instruction mixes are deliberately not considered. However, an important aspect is that instruction-emulation performance depends on which instruction is emulated, so here we potentially underrepresent the variety of performance that instruction emulation could show. We added a disclaimer in l. 570-580.
Comment 7: How good a representation of security monitoring cases where breakpoints fire under heterogeneous multi-threaded, I/O-bound, or kernel-intensive workloads are the benchmarked microbenchmarks?
Response 7: The microbenchmarks give a trend, but in more complex systems a lot of other factors can influence the performance and perhaps even change the ranking. In our measurements, we have empty breakpoint handlers. For implementing only hit counters, the workload is comparable, but often breakpoints are combined with other VMI operations such as register access or memory scanning. We believe there is still value in microbenchmarks, controlled setups, and reproducible results. Again, l. 570-580 contains our statement regarding microbenchmarking.
Comment 8: Since SmartVMI and DRAKVUF have different internal designs, how can performance differences be traced solely to breakpoint handling mechanisms and not to the greater VMI framework design?
Response 8: That is a very valid question. We are limited by the available implementations, and can actually only claim insights only for the respective implementations, not the general approach.
Comment 9: How can the impact of disabling efficiency cores, Turbo Boost, and SMT be distinguished from natural breakpoint handling performance behavior, when real-world productions may be heavily relying upon them?
Response 9: We try to concentrate the measurement setup to contain as little variables as possible to get reliable and reproducible results for scenarios that we can fully specify. Introducing features, which indeed are relevant for real-world performance, such as boosting or SMT, introduces more randomness in the measurements, which requires more complex statistical analysis to draw valid conclusions. The effect of boosting/clock speed can be observed partly by the different hardware platforms, for which we documented the clock speeds. SMT can cause slow downs when busy tasks are scheduled on the same core, which we try to avoid.
Comment 10: Since Windows VM and Dom0 are both utilizing the fourth physical core in most setups, how are cross-VM contention effects measured and distinguished from actual breakpoint latency?
Response 10: The were no other workloads running expect the VMI software and bpbench, so the pinning is mostly relevant for transitioning control back and forth between Windows VM, hypervisor, VMI application. The impact can be quite significant, we noticed around 20% slow down when the VMI application ran on the same core as the VM. Because the impact of pinning is not our focus, we made sure to use the same pinning on all machines and document what it was.
Comment 11: How does the use of adapted versions of DRAKVUF and SmartVMI affect reproducibility, and are there any safeguards in place to ensure the adaptations do not introduce performance artifacts inadvertently?
Response 11: Our only modifications were to write a plugin for DRAKVUF and SmartVMI to place the breakpoint correctly for bpbench, as well as slight modifications to enable instruction emulation in SmartVMI. The modified version of SmartVMI is available on github, can be built reproducibly with Nix. Every voluntary reproducer can use the same binaries, which are located on the reproducer image.
Comment 12: As bpbench relies on QueryPerformanceCounter in Windows, how does this timer's accuracy degrade in the presence of virtualization overhead, and how might this bias latency measurements at sub-microsecond scales?
Response 12: We added a discussion around the timing in the experimental design (l. 587-615).
Comment 13: As WL2 had around the same values as WL1 and was thrown away, how can one be sure that this will not remove subtle cache/TLB side effects that would be present only under mixed instruction workloads?
Response 13: We can not be sure, but we focused on reducing the workload as much as possible to avoid such workload-specific effects.
Comment 14: For WL3 and WL4, how does trapping at the page level using EPT interplay with modern microarchitectural extensions like speculative execution, and might this interplay introduce stealthy latency variations not captured by bpbench?
Respone 14: We unfortunately cannot give helpful explanations for the microarchitectural domain. In more complex scenario (e.g., VM with multiple CPUs), other effects might become visible. We are not trying to make any statements beyond the scenario we describe.
Comment 15: As measurement relies on process priority assignment and pinning the CPU, how resistant are the results to OS scheduling aberrations or interrupts that do occur anyway via affinity and priority control?
Response 15: The thing with scheduling other tasks and interrupt handlers in between is that if such scheduling happens during a breakpoint, the measured time will be orders of magnitutes slower than if no such event occurs. The individual results show clearly that most measurements contrate around the median, not much above the minimum, but there is a number of outliers that take way longer. Our scheduling priorities and pinning system processes to other CPUs is supposed to have the effect of reducing the number of times such outliers happen. The results show that we clearly have not avoided the problem, but seem to have addressed it well enough.
Comment 16: Where clock frequency changes did not remain constant at base speed, how are resulting frequency variations considered in the latency analysis?
Response 16: We did not compare much platform to platform (except for the one case where we could be sure about fixed frequencies). The behavior on the exceptional platforms was very similar to all the other platforms.
Comment 17: How might restriction to one hypervisor limit external validity of results to environments in which KVM or Hyper-V is the standard?
Response 17: We cannot know whether the performance ranking of the approaches will be the same for KVM. It probably is due to involved operations and their associated costs (which we can only estimate). KVM unfortunately does not have environments where VMI is performed on it; SmartVMI's instruction repair (restricting to single vCPU) is the only known breakpoint implementation to us that works on KVM. If there was more VMI usage of KVM and more breakpoint implementations available, we would have liked to include them in the comparison. For Hyper-V there is no VMI interface to our knowledge.
Round 2
Reviewer 1 Report
Comments and Suggestions for Authors- There are grammar and repetition issues. The modified parts must be carefully checked. For example, "In this paperTo realize this contribution" (lines 112-113); "Section 3 highlights relevant existing works" (lines 133,138); "Section 5 gives additional specifications which software and configurations the measurement setup contains" (lines 135-136).
- On lines 165-166, "... virtual CPUs(vCPUs)vCPUs...". What are the differences between vCPUs of different colors? Why are two identical abbreviations linked together? Similar issues exist in other places as well.
- As a scientific paper, maintaining consistency throughout the entire text is beneficial for readers' understanding. In Section 3, when retelling the relevant work, both the past tense (Ref 16) and the general present tense (Ref 17) are used. On lines 429 and 435, " Wahbe et al. expands his work and " Dangl et al. introduce RapidVMI''. One of the verbs in the two sentences is incorrect.
- If the author is not familiar with the format of the references, it is strongly recommended to carefully read the “Manuscript Preparation” on the homepage of this journal. Authors should strictly follow the journal's requirements for editing reference formats, rather than simply using publicly available BibTeX entries as references. For example, both are conference papers. Reference [8] contains the city and country, while reference [2] does not. The city and country of reference [2] should be Boston, USA.
The English must be improved to more clearly express the research.
Author Response
Comment 1: There are grammar and repetition issues. The modified parts must be carefully checked. For example, "In this paperTo realize this contribution" (lines 112-113); "Section 3 highlights relevant existing works" (lines 133,138); "Section 5 gives additional specifications which software and configurations the measurement setup contains" (lines 135-136).
Response 1: We highlighted all text that was in the original submission but we removed in the revision in red; likewise, we marked all new additions from the revision in green. All the mentioned issues are resolved when removing the red text (which was kept in for the better overview of the modifications).
Comment 2: On lines 165-166, "... virtual CPUs(vCPUs)vCPUs...". What are the differences between vCPUs of different colors? Why are two identical abbreviations linked together? Similar issues exist in other places as well.
Response 2: Same here, the red text is actually removed and replaced by the green text. In this case, it might be a bit more confusing, because we could have highlighted the "virtual CPUs(" and ")" in red to achieve the same modification, however in the latex code we also switched from the typed "vCPU" to using the acronym "\ac{VCPU}", so we chose to use the green marking for new text.
Comment 3: As a scientific paper, maintaining consistency throughout the entire text is beneficial for readers' understanding. In Section 3, when retelling the relevant work, both the past tense (Ref 16) and the general present tense (Ref 17) are used. On lines 429 and 435, " Wahbe et al. expands his work and " Dangl et al. introduce RapidVMI''. One of the verbs in the two sentences is incorrect.
Response 3: Thank you for the helpful pointer. We reworded Section 3 to use consistent past tense for everything published in previous works.
Comment 4: If the author is not familiar with the format of the references, it is strongly recommended to carefully read the “Manuscript Preparation” on the homepage of this journal. Authors should strictly follow the journal's requirements for editing reference formats, rather than simply using publicly available BibTeX entries as references. For example, both are conference papers. Reference [8] contains the city and country, while reference [2] does not. The city and country of reference [2] should be Boston, USA.
Response 4: Thank you for pointing us to the manuscript preparation page. We updated the bibtex entries to remove duplicate information (IEEE mentioned multiple times in different fields; removal is indicated by red highlight) and added missing publication details to our previously unmodified google scholar entries (all new information is highlighted with green).
Reviewer 2 Report
Comments and Suggestions for AuthorsComments Partially Addressed
Comment 12 (Background): While an introductory paragraph was added to justify Section 2.1, the placement of this section remains questionable. The information on VMI software architecture could be better integrated into the methodology section.
Comment 14 (Hardware Platforms): The author justifies the duplication of configuration details as a feature, but this could be optimized to avoid redundancy.
Comments on the Quality of English LanguageThe English could be improved to more clearly express the research - While generally clear, some sections contain:
-
Occasional awkward phrasing (e.g., "across on all machines" in Section 7.2)
-
Minor grammatical inconsistencies
-
Some long, complex sentences that could be simplified for better readability
Specific suggestions:
-
Consider professional proofreading to polish sentence structure
-
Simplify complex technical explanations where possible
-
Ensure consistent use of technical terminology throughout
Author Response
Comment 12 (Background): While an introductory paragraph was added to justify Section 2.1, the placement of this section remains questionable. The information on VMI software architecture could be better integrated into the methodology section.
Response: We double-checked the placement of Section 2.1. The methodology section (called Experimental Design) explain the setup for measuring differences between the different breakpoint variants. The section does not cover what the different variants are--this information is located in the background, Section 2.2. It is well placed there, because it is existing knowledge that we present to give context and fundamental understanding of the domain to the reader. Section 2.2 explains the execution paths of breakpoint handling, going through all parts of the software stack (guest VM, hypervisor, VMI application). Section 2.1 provides an overview of the software setup and is therefore crucial at the beginning of the introduction. On the other hand, the content of Section 2.1 is not relevant for Section 4, and Section 5 specifies which concrete software and versions we use--introducing the generic architecture in the background seems like a reasonable split.
Comment 14 (Hardware Platforms): The author justifies the duplication of configuration details as a feature, but this could be optimized to avoid redundancy.
Response: After additional consideration, we agree that there is unnecessary duplication. As argued previously, Section 5 gives reasoning for the intents of pinning CPUs, disabling SMT and avoiding dynamic clock speeds. Mentioning the specific technologies and firmware identifiers is not necessary here, it is sufficient to have these names in the more concrete Section 6. Thus, we removed all concrete-technology-related information from Section 5.
Reviewer 3 Report
Comments and Suggestions for AuthorsThe authors have addressed all the comments.
Author Response
Thank you for your thorough initial review and the critical questions!