MDPI - Publisher of Open Access Journals

26 pages, 7368 KiB

Open AccessArticle

Latency-Aware and Auto-Migrating Page Tables for ARM NUMA Servers

by Hongliang Qu and Peng Wang

Electronics 2025, 14(8), 1685; https://doi.org/10.3390/electronics14081685 - 21 Apr 2025

Viewed by 661

The non-uniform memory access (NUMA) architecture is the de facto norm in modern server processors. Applications running on NUMA processors may suffer significant performance degradation (NUMA effect) due to the non-uniform memory accesses, including data and page table accesses. Recent studies show that the NUMA effect of long-running memory-intensive workloads can be mitigated by replicating or migrating page tables to nodes that require accesses to remote page tables. However, this technique cannot adapt to the situation where other applications compete for the memory controller. Furthermore, it was only implemented on x86 processors and cannot be readily applied on ARM server processors, which are becoming increasingly popular. To address this issue, we designed the page table access latency aware (PTL-aware) page table auto-migration (Auto-PTM) mechanism. Then we implemented it on Linux ARM64 (the Linux kernel name for AArch64) by identifying the differences between the ARM architecture and the x86 architecture in terms of page table structure and the implementation of the Linux kernel source code. We evaluate it on real ARM NUMA servers. The experimental results demonstrate that, compared to the state-of-the-art PTM mechanism, our PTL-aware mechanism significantly enhances the performance of workloads in various scenarios (e.g., GUPS by 3.53x, XSBench by 1.77x, Hashjoin by 1.68x). Full article

► Show Figures

Figure 1

16 pages, 3979 KiB

Open AccessArticle

Performance Comparison of CFD Microbenchmarks on Diverse HPC Architectures

by Flavio C. C. Galeazzo, Marta Garcia-Gasulla, Elisabetta Boella, Josep Pocurull, Sergey Lesnik, Henrik Rusche, Simone Bnà, Matteo Cerminara, Federico Brogi, Filippo Marchetti, Daniele Gregori, R. Gregor Weiß and Andreas Ruopp

Computers 2024, 13(5), 115; https://doi.org/10.3390/computers13050115 - 7 May 2024

Cited by 3 | Viewed by 2302

Abstract

OpenFOAM is a CFD software widely used in both industry and academia. The exaFOAM project aims at enhancing the HPC scalability of OpenFOAM, while identifying its current bottlenecks and proposing ways to overcome them. For the assessment of the software components and the code profiling during the code development, lightweight but significant benchmarks should be used. The answer was to develop microbenchmarks, with a small memory footprint and short runtime. The name microbenchmark does not mean that they have been prepared to be the smallest possible test cases, as they have been developed to fit in a compute node, which usually has dozens of compute cores. The microbenchmarks cover a broad band of applications: incompressible and compressible flow, combustion, viscoelastic flow and adjoint optimization. All benchmarks are part of the OpenFOAM HPC Technical Committee repository and are fully accessible. The performance using HPC systems with Intel and AMD processors (x86_64 architecture) and Arm processors (aarch64 architecture) have been benchmarked. For the workloads in this study, the mean performance with the AMD CPU is 62% higher than with Arm and 42% higher than with Intel. The AMD processor seems particularly suited resulting in an overall shorter time-to-solution. Full article

(This article belongs to the Special Issue Best Practices, Challenges and Opportunities in Software Engineering)

► Show Figures

Figure 1

16 pages, 465 KiB

Open AccessArticle

Fine-Grained Isolation to Protect Data against In-Process Attacks on AArch64

by Yeongpil Cho

Electronics 2020, 9(2), 236; https://doi.org/10.3390/electronics9020236 - 1 Feb 2020

Cited by 1 | Viewed by 2459

Abstract

In-process attacks are a new class of attacks that circumvent protection schemes centered around inter-process isolation. Against these attacks, researchers have proposed fine-grained data isolation schemes that can protect sensitive data from malicious accesses even during the same process. Their proposals based on salient hardware features, such as ARM^® processor architecture’s domain protection, are quite successful, but it cannot be applied to a specific architecture, namely AArch64, as this does not provide the same hardware features. In this paper, therefore, we present Sealer, a fine-grained data isolation scheme applicable in AArch64. Sealer achieves its objective by brilliantly harmonizing two hardware features of AArch64: The eXecute-no-Read and the cryptographic extension. Sealer provides application developers with a set of application programming interface (API) so that the developers can enjoy the fine-grained data isolation in their own way. Full article

(This article belongs to the Section Computer Science & Engineering)

► Show Figures

Figure 1

13 pages, 499 KiB

Open AccessArticle

Hybrid-Grained Dynamic Load Balanced GEMM on NUMA Architectures

by Xing Su and Fei Lei

Electronics 2018, 7(12), 359; https://doi.org/10.3390/electronics7120359 - 27 Nov 2018

Cited by 5 | Viewed by 3211

Abstract

The Basic Linear Algebra Subprograms (BLAS) is a fundamental numerical software and GEneral Matrix Multiply (GEMM) is the most important computational kernel routine in the BLAS library. On multi-core and many-core processors, the whole workload of GEMM is partitioned and scheduled to multiple threads to exploit the parallel hardware. Generally, the workload is equally partitioned among threads and all threads are expected to accomplish their work in roughly the same time. However, this is not the case on Non-Uniform Memory Access (NUMA) architectures. The NUMA effect may cause threads to run at different speeds, and the overall executing times of GEMM is determined by the slowest thread. In this paper, we propose a hybrid-grained dynamic load-balancing method to reduce the harm of the NUMA effect by allowing fast threads to steal work from slow ones. We evaluate the proposed method on Phytium 2000+, an emerging 64-core high-performance processor based on Arm’s AArch64 architecture. Results show that our method reduces the synchronization overhead by 51.5% and achieves an improvement of GEMM performance by 1.9%. Full article

(This article belongs to the Section Computer Science & Engineering)

► Show Figures

Figure 1

Search Results (4)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (4)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI