Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Latency-Aware and Auto-Migrating Page Tables for ARM NUMA Servers

Electronics 2025, 14(8), 1685; https://doi.org/10.3390/electronics14081685

by Hongliang Qu^1,2,*

and Peng Wang³

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Electronics 2025, 14(8), 1685; https://doi.org/10.3390/electronics14081685

Submission received: 23 March 2025 / Revised: 16 April 2025 / Accepted: 17 April 2025 / Published: 21 April 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors Limitations 1-The evaluation focused primarily on specific benchmark applications (GUPS, XSBench, Hashjoin) rather than a broader range of real-world workloads or production environments. This limited the generalizability of the performance claims to other types of applications that might exhibit different memory access patterns1. 2-The manuscript lacked detailed analysis of the potential overhead introduced by page-table replication and migration operations. While performance improvements were substantial, the associated costs in terms of memory consumption, migration latency, and system resource utilization were not thoroughly addressed1. 3-While the implementation worked effectively on the tested ARM servers, the manuscript provided insufficient analysis of how the approach would scale with increasing NUMA node counts or in systems with more complex memory hierarchies1. 4-The evaluation focused primarily on the performance of target applications but did not adequately address potential impacts on other concurrent processes or overall system performance in multi-tenant environments1. 5-The manuscript did not fully explore how the PTL-aware mechanism adapted to changing workload characteristics over extended execution periods, which is particularly relevant for long-running services in data center environment. Recommendations 1-The authors should have expanded their evaluation to include a more diverse set of applications, particularly those commonly deployed in modern data centers. This would have strengthened the claim that their approach was broadly applicable in real-world scenarios1. 2-A more detailed analysis of the memory overhead, migration costs, and system resource utilization should have been included to provide a complete picture of the solution's practical implications1. 3-The authors should have included an analysis of how their approach scaled with increasing NUMA node counts and in systems with more complex memory hierarchies to better demonstrate its applicability in larger-scale deployments1. 4-Evaluating the approach in multi-tenant scenarios would have better demonstrated its practicality in shared computing environments, which are common in modern data centers1. 5-The manuscript would have benefited from a more comprehensive comparison with alternative approaches beyond the cited Mitosis, vMitosis, and WASP techniques to better position the work within the broader research landscape

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The paper aims to improve NUMA performance on ARM servers by implementing a latency-aware, auto-migrating page table mechanism. While the effort to adapt existing x86 techniques to ARM architecture is acknowledged, the presentation and depth of analysis require significant improvement.

Firstly, the manuscript lacks sufficient clarity and depth in explaining the proposed mechanism. The description of the PTL-aware Auto-PTM technique is somewhat cursory, and the rationale behind the design choices is not adequately justified. Specifically, the analysis of scanning frequency and handling of Transparent Huge Pages (THP) requires more detailed explanation and empirical validation. The argument that the dynamic rate-limiting heuristics of AutoNUMA introduce overhead on ARM servers needs stronger evidence and a more thorough comparative analysis.

Secondly, the comparison between the x86 and ARM architectures, while noted, is not explored with sufficient rigor. The differences in page table structures and kernel implementations are presented, but the implications of these differences on the proposed mechanism’s performance and stability are not thoroughly examined. A more in-depth analysis of the architectural nuances and their impact on the implementation would strengthen the paper. Additionally, the experimental evaluation, while showing improvements, should include a wider range of workloads and a more detailed performance analysis. The experimental setup and parameter choices should be justified more robustly.

Finally, the writing and organization of the paper need significant revision. The manuscript contains numerous instances of unclear prose, and the flow of information is often disjointed. The related work section, while comprehensive, could benefit from a more critical analysis of existing techniques and a clearer articulation of the gaps that this work aims to address. The introduction and conclusion should also be strengthened to provide a more compelling narrative and a clearer summary of the paper’s contributions and limitations.

Comments on the Quality of English Language

The English could be improved to more clearly express the research.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

The paper introduces PTL-aware Auto-PTM, a mechanism to mitigate NUMA effects in ARM servers by optimizing page-table migration based on access latency (PTL). The authors highlight that existing solutions (e.g., Mitosis, WASP) are either x86-specific or lack PTL-awareness, leading to suboptimal performance under memory contention. Their approach combines Auto-PTM with dynamic PTL measurements to migrate page-tables to nodes with the lowest latency.
1. PTL-aware Auto-PTM – Dynamically migrates page-tables based on measured latency, improving performance under interference.
2. ARM/x86 compatibility – Adapts page-table replication/migration for ARM’s architectural differences (e.g., TTBR registers, 4-level paging).
3. Evaluation – Shows significant speedups (GUPS: 3.53x, XSBench: 1.77x) over state-of-the-art PTM on ARM NUMA servers.

- The paper claims PTL-aware Auto-PTM reduces overhead by limiting scans, but lacks a quantitative breakdown of overhead (e.g., CPU cycles, energy impact).
- The scanning frequency heuristic (skipping \(2^n\) scans) is empirical but not rigorously justified. Why not use adaptive thresholds?
- The bandwidth/latency tests (Fig. 10) explain ARM’s lower performance but don’t clarify why PTL-awareness is less critical for x86. Is it due to x86’s better memory controllers?
- The conclusion mentions virtualization as future work, but this is a major limitation. Cloud environments rely heavily on virtualization—why not preliminary KVM/ARM results?

- Mostly appropriate but missing key NUMA papers:
- No discussion of NUMA balancing in Linux (e.g., Mel et al., USENIX ATC ’12).
- Huge-page NUMA optimizations (e.g., Gaud et al., USENIX ATC ’14) are relevant but unmentioned.
- Over-reliance on self-citations (WASP, Mitosis)—more independent validation needed.

- Table 1: Clear but could include concrete latency numbers (e.g., ARM TTBR0 vs. x86 CR3 access times).
- Figures 6–9:
- Normalized runtime bars are hard to compare—add absolute latency values in a supplement.
- Fig. 6(c) (THP impact) needs error bars—was performance stable across runs?
- Figure 5 (PTL-aware workflow): Too abstract—add a pseudo-algorithm in the text.

- How does PTL-aware Auto-PTM handle multi-tenant cloud workloads (mixed interference patterns)?
- Does it integrate with Linux’s AutoNUMA or require manual tuning?

- Would this work on AMD’s NUMA or Intel’s Sub-NUMA Clustering?
- How does it interact with ARM’s SMMU (IOMMU for devices)?

- No discussion of power trade-offs—does PTL-aware migration increase energy use?

- Strengthen the related work with more comparisons to Linux’s native NUMA balancing.
- Add overhead metrics (e.g., % CPU time spent scanning, memory footprint).
- Clarify the PTL probing—how often is `lat_mem_rd` run? Does it perturb workloads?
- Discuss limitations:
- Scalability beyond 4 nodes.
- Impact on non-memory-intensive workloads (e.g., compute-bound tasks).
- Improve reproducibility:
- Release kernel patches and benchmark scripts.
- Document ARM server BIOS settings (e.g., prefetchers, NUMA interleave).

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

Accept in present form.

Article Menu

Latency-Aware and Auto-Migrating Page Tables for ARM NUMA Servers

Further Information

Guidelines

MDPI Initiatives

Follow MDPI