MDPI - Publisher of Open Access Journals

15 pages, 647 KB

Open AccessArticle

Design and Implementation of a Prefetcher in a Key Performance Subsystems of RISC-V Processors

by Guoqiang He, Yanbo Zhao, Yang Xiang and Li Li

Electronics 2026, 15(2), 319; https://doi.org/10.3390/electronics15020319 - 11 Jan 2026

Viewed by 153

The prefetcher is one of the key performance subsystems in RISC-V processors, and its design can significantly enhance memory access efficiency, reduce latency, and improve overall processor performance. This paper conducts in-depth research on the design methods of the prefetcher for RISC-V processors [...] Read more.

The prefetcher is one of the key performance subsystems in RISC-V processors, and its design can significantly enhance memory access efficiency, reduce latency, and improve overall processor performance. This paper conducts in-depth research on the design methods of the prefetcher for RISC-V processors and proposes a practical prefetcher implementation scheme that balances performance and usability. The hybrid prefetching technology proposed in this scheme, on the basis of integrating two classic modes, automatic hardware prefetching and software-prefetch instructions, introduces a software template prefetcher and elaborates on its specific implementation logic in detail. For the hardware prefetcher, this paper further proposes a hierarchical prefetching strategy based on the cache hierarchical architecture and clarifies the design methods of the prefetcher corresponding to each level of cache. This design balances prediction accuracy, performance, power consumption, and design complexity. It employs different prefetching strategies and algorithms to achieve efficient memory access, thus boosting the processor’s overall performance. Both the processor and the prefetcher are designed using Verilog HDL and the implementation and verification are completed on the FPGA prototype verification platform, while the design and implementation of the 12 nm processor chip are carried out. The resulting processor core occupies an area of 5.128 mm². Performance comparison between the processor equipped with this prefetcher and Xuantie C908 and Xuantie C910 shows that on the FPGA platform, the performance of this processor is improved by 25% to 35.8% compared with the comparison objects. In addition, when the processor with the prefetcher enabled is compared with that with the prefetcher disabled, it is shown that the processor performance can be improved by 25.67% to 61%. Full article

(This article belongs to the Section Computer Science & Engineering)

► Show Figures

Figure 1

17 pages, 9165 KB

Open AccessArticle

An FPGA-Based Reconfigurable Accelerator for Real-Time Affine Transformation in Industrial Imaging Heterogeneous SoC

by Yang Zhang, Dejun Chen, Huixiong Ruan, Hongyu Jia, Yong Liu and Ying Luo

Sensors 2026, 26(1), 316; https://doi.org/10.3390/s26010316 - 3 Jan 2026

Viewed by 368

Abstract

Real-time affine transformation, a core operation for image correction and registration of industrial cameras and scanners, faces challenges including the high computational cost of interpolation and inefficient data access. In this study, we propose a reconfigurable accelerator architecture based on a heterogeneous system-on-chip [...] Read more.

Real-time affine transformation, a core operation for image correction and registration of industrial cameras and scanners, faces challenges including the high computational cost of interpolation and inefficient data access. In this study, we propose a reconfigurable accelerator architecture based on a heterogeneous system-on-chip (SoC). The architecture decouples tasks into control and data paths: the ARM core in the processing system (PS) handles parameter matrix generation and scheduling, whereas the FPGA-based acceleration module in programmable logic (PL) implements the proposed PATRM algorithm. By integrating multiplication-free design and affine matrix properties, PATRM adopts Q15.16 fixed-point computation and AXI4 burst transmission for efficient block data prefetching and pipelined processing. Experimental results demonstrate 25 frames per second (FPS) for

2095 \times 2448

resolution images, representing a 128.21 M pixel/s throughput, which is 5.3× faster than the Block AT baseline with a peak signal-to-noise ratio (PSNR) exceeding 26 dB. Featuring low resource consumption and dynamic reconfigurability, the accelerator meets the real-time requirements of industrial scanner correction and other high-performance image processing tasks. Full article

(This article belongs to the Section Sensing and Imaging)

► Show Figures

Figure 1

31 pages, 2573 KB

Open AccessArticle

Hardware Design of DRAM Memory Prefetching Engine for General-Purpose GPUs

by Freddy Gabbay, Benjamin Salomon, Idan Golan and Dolev Shema

Technologies 2025, 13(10), 455; https://doi.org/10.3390/technologies13100455 - 8 Oct 2025

Viewed by 1755

Abstract

General-purpose graphics computing on processing units (GPGPUs) face significant performance limitations due to memory access latencies, particularly when traditional memory hierarchies and thread-switching mechanisms prove insufficient for complex access patterns in data-intensive applications such as machine learning (ML) and scientific computing. This paper [...] Read more.

General-purpose graphics computing on processing units (GPGPUs) face significant performance limitations due to memory access latencies, particularly when traditional memory hierarchies and thread-switching mechanisms prove insufficient for complex access patterns in data-intensive applications such as machine learning (ML) and scientific computing. This paper presents a novel hardware design for a memory prefetching subsystem targeted at DDR (Double Data Rate) memory in GPGPU architectures. The proposed prefetching subsystem features a modular architecture comprising multiple parallel prefetching engines, each handling distinct memory address ranges with dedicated data buffers and adaptive stride detection algorithms that dynamically identify recurring memory access patterns. The design incorporates robust system integration features, including context flushing, watchdog timers, and flexible configuration interfaces, for runtime optimization. Comprehensive experimental validation using real-world workloads examined critical design parameters, including block sizes, prefetch outstanding limits, and throttling rates, across diverse memory access patterns. Results demonstrate significant performance improvements with average memory access latency reductions of up to 82% compared to no-prefetch baselines, and speedups in the range of 1.240–1.794. The proposed prefetching subsystem successfully enhances memory hierarchy efficiency and provides practical design guidelines for deployment in production GPGPU systems, establishing clear parameter optimization strategies for different workload characteristics. Full article

(This article belongs to the Topic Advances in Microelectronics and Semiconductor Engineering)

► Show Figures

Figure 1

20 pages, 1456 KB

Open AccessArticle

DirectFS: An RDMA-Accelerated Distributed File System with CPU-Oblivious Metadata Indexing

by Lingjun Jiang, Zhaoyao Zhang, Ruixuan Ni and Miao Cai

Electronics 2025, 14(19), 3778; https://doi.org/10.3390/electronics14193778 - 24 Sep 2025

Cited by 1 | Viewed by 1143

Abstract

The rapid growth of data-intensive applications has imposed significant demands on the performance of distributed file systems, particularly in metadata operations. Traditional systems rely heavily on metadata servers to handle indexing tasks, leading to Central Processing Unit (CPU) bottlenecks and increased latency. To [...] Read more.

The rapid growth of data-intensive applications has imposed significant demands on the performance of distributed file systems, particularly in metadata operations. Traditional systems rely heavily on metadata servers to handle indexing tasks, leading to Central Processing Unit (CPU) bottlenecks and increased latency. To address these challenges, we propose Direct File System (DirectFS), an Remote Direct Memory Access (RDMA)-accelerated distributed file system that offloads metadata indexing to clients by leveraging one-sided RDMA operations. Further, we propose a range of techniques, including hash-based namespace indexing and hotness-aware metadata prefetching, to fully unleash the performance potential of RDMA hardware. We implement DirectFS on top of Moose File System (MooseFS) and compare DirectFS with state-of-the-art distributed file systems using a variety of Filebench v1.4.9.1 and MDTest from the IOR suite v4.0.0 workloads. Evaluation results demonstrate that DirectFS achieves significant performance improvements for metadata-intensive benchmarks compared to other file systems. Full article

(This article belongs to the Section Computer Science & Engineering)

► Show Figures

Figure 1

19 pages, 8359 KB

Open AccessArticle

A Generalized Optimization Scheme for Memory-Side Prefetching to Enhance System Performance

by Yuzhi Zhuang, Ming Zhang and Binghao Wang

Electronics 2025, 14(14), 2811; https://doi.org/10.3390/electronics14142811 - 12 Jul 2025

Cited by 1 | Viewed by 1487 | Correction

Abstract

In modern multi-core processors, memory request latency critically constrains overall performance. Prefetching, a promising technique, mitigates memory access latency by pre-loading data into faster cache structures. However, existing core-side prefetchers lack visibility to the DRAM state and may issue suboptimal requests, while conventional [...] Read more.

In modern multi-core processors, memory request latency critically constrains overall performance. Prefetching, a promising technique, mitigates memory access latency by pre-loading data into faster cache structures. However, existing core-side prefetchers lack visibility to the DRAM state and may issue suboptimal requests, while conventional memory-side prefetchers often default to simple next-line policies that miss complex access patterns. We propose a comprehensive memory-side prefetch optimization scheme, which includes a prefetcher that utilizes advanced prefetching algorithms and an optimization module. Our prefetcher is capable of detecting more complex memory access patterns, thereby improving both prefetch accuracy and coverage. Additionally, considering the characteristics of DRAM memory access, the optimization module minimizes the negative impact of prefetch requests on DRAM by enhancing coordination with memory operations. Additionally, our prefetcher works in synergy with core-side prefetchers to deliver superior overall performance. Simulation results using Gem5 and SPEC CPU2017 workloads show that our approach delivers an average performance improvement of 10.5% and reduces memory access latency by 61%. Our prefetcher also operates in conjunction with core-side prefetchers to form a multi-level prefetching hierarchy, enabling further performance gains through coordinated and complementary prefetching strategies. Full article

(This article belongs to the Special Issue Computer Architecture & Parallel and Distributed Computing)

► Show Figures

Figure 1

20 pages, 2143 KB

Open AccessArticle

Greedy Prefetch for Reducing Off-Chip Memory Accesses in Convolutional Neural Network Inference

by Dengtian Yang and Lan Chen

Information 2025, 16(3), 164; https://doi.org/10.3390/info16030164 - 21 Feb 2025

Viewed by 2102

Abstract

The high parameter and memory access demands of CNNs highlight the need to reduce off-chip memory accesses. While recent approaches have improved data reuse to lessen these accesses, simple and efficient prefetching methods are still lacking. This paper introduces a greedy prefetch method [...] Read more.

The high parameter and memory access demands of CNNs highlight the need to reduce off-chip memory accesses. While recent approaches have improved data reuse to lessen these accesses, simple and efficient prefetching methods are still lacking. This paper introduces a greedy prefetch method that uses data repetition to optimize the prefetching route, thus decreasing off-chip memory accesses. The method is also implemented in a hardware simulator to organize an deployment strategy with additional optimizations. Our deployment strategy outperforms recent works, with a maximum data reuse improvement of 1.98×. Full article

► Show Figures

Graphical abstract

19 pages, 3425 KB

Open AccessArticle

A Deadlock-Free Deterministic–Adaptive Hybrid Routing Algorithm for Efficient Network-on-Chip Communication

by Ning Ji and Yintang Yang

Electronics 2025, 14(5), 845; https://doi.org/10.3390/electronics14050845 - 21 Feb 2025

Cited by 2 | Viewed by 1513

Abstract

In the era of multi-core technology, efficient communication among numerous IP cores has become a critical challenge. Network-on-chip (NoC) technology provides a scalable and effective solution, attracting significant attention in academia and industry. This paper introduces a novel deterministic–adaptive hybrid routing (DAHR) algorithm [...] Read more.

In the era of multi-core technology, efficient communication among numerous IP cores has become a critical challenge. Network-on-chip (NoC) technology provides a scalable and effective solution, attracting significant attention in academia and industry. This paper introduces a novel deterministic–adaptive hybrid routing (DAHR) algorithm designed to enhance performance while ensuring deadlock-free operation. The DAHR algorithm leverages pre-fetched deterministic information and real-time congestion feedback from neighboring nodes to make dynamic routing decisions. Before packet injection, the source–destination positional relationship and required hops are pre-calculated and encoded into the packet’s head flit. Routing decisions are then based on the availability of free virtual channels in the determined directions, eliminating the need for a complex routing calculation unit. Simulation results demonstrate that DAHR reduces average packet delay by at least 5.8% and improves saturation throughput by at least 9.0% compared to conventional routing schemes without introducing additional hardware overhead. Full article

► Show Figures

Figure 1

14 pages, 2007 KB

Open AccessArticle

Hardware-Assisted Low-Latency NPU Virtualization Method for Multi-Sensor AI Systems

by Jong-Hwan Jean and Dong-Sun Kim

Sensors 2024, 24(24), 8012; https://doi.org/10.3390/s24248012 - 15 Dec 2024

Cited by 1 | Viewed by 3656

Abstract

Recently, AI systems such as autonomous driving and smart homes have become integral to daily life. Intelligent multi-sensors, once limited to single data types, now process complex text and image data, demanding faster and more accurate processing. While integrating NPUs and sensors has [...] Read more.

Recently, AI systems such as autonomous driving and smart homes have become integral to daily life. Intelligent multi-sensors, once limited to single data types, now process complex text and image data, demanding faster and more accurate processing. While integrating NPUs and sensors has improved processing speed and accuracy, challenges like low resource utilization and long memory latency remain. This study proposes a method to reduce processing time and improve resource utilization by virtualizing NPUs to simultaneously handle multiple deep-learning models, leveraging a hardware scheduler and data prefetching techniques. Experiments with 30,000 SA resources showed that the hardware scheduler reduced memory cycles by over 10% across all models, with reductions of 30% for NCF and 70% for DLRM. The hardware scheduler effectively minimized memory latency and idle NPU resources in resource-constrained environments with frequent context switching. This approach is particularly valuable for real-time applications like autonomous driving, enabling smooth transitions between tasks such as object detection and route planning. It also enhances multitasking in smart homes by reducing latency when managing diverse data streams. The proposed system is well suited for resource-constrained environments that demand efficient multitasking and low-latency processing. Full article

(This article belongs to the Section Intelligent Sensors)

► Show Figures

Figure 1

22 pages, 762 KB

Open AccessArticle

BTIP: Branch Triggered Instruction Prefetcher Ensuring Timeliness

by Wenhai Lin, Yiquan Lin, Yiquan Chen, Shishun Cai, Zhen Jin, Jiexiong Xu, Yuzhong Zhang and Wenzhi Chen

Electronics 2024, 13(21), 4323; https://doi.org/10.3390/electronics13214323 - 4 Nov 2024

Viewed by 2348

Abstract

In CPU microarchitecture, caches store frequently accessed instructions and data by exploiting their locality, reducing memory access latency and improving application performance. However, contemporary applications with large code footprints often experience frequent Icache misses, which significantly degrade performance. Although Fetch-Directed Instruction Prefetching (FDIP) [...] Read more.

In CPU microarchitecture, caches store frequently accessed instructions and data by exploiting their locality, reducing memory access latency and improving application performance. However, contemporary applications with large code footprints often experience frequent Icache misses, which significantly degrade performance. Although Fetch-Directed Instruction Prefetching (FDIP) has been widely adopted in commercial processors to reduce Icache misses, our analysis reveals that FDIP still suffers from Icache misses caused by branch mispredictions and late prefetch, leaving considerable opportunity for performance optimization. Priority-Directed Instruction Prefetching (PDIP) has been proposed to reduce Icache misses caused by branch mispredictions in FDIP. However, it neglects Icache misses due to late prefetch and suffers from high storage overhead. In this paper, we proposed a branch-triggered instruction prefetcher (BTIP), which aims to prefetch Icache lines that FDIP cannot efficiently handle, including the Icache misses due to branch misprediction and late prefetch. We also introduce a novel Branch Target Buffer (BTB) organization, BTIP BTB, which stores prefetch metadata and reuses information from existing BTB entries, effectively reducing storage overhead. We implemented BTIP on the Champsim simulator and evaluated BTIP in detail using traces from the 1st Instruction Prefetching Championship (IPC-1). Our evaluation shows that BTIP outperforms both FDIP and PDIP. Specifically, BTIP reduces Icache misses by 38.0% and improves performance by 5.1% compared to FDIP. Additionally, BTIP outperforms PDIP by 1.6% while using only 41.9% of the storage space required by PDIP. Full article

(This article belongs to the Special Issue Computer Architecture & Parallel and Distributed Computing)

► Show Figures

Figure 1

21 pages, 7471 KB

Open AccessArticle

Vectorization Programming Based on HR DSP Using SIMD

by Chunhu Xie, Huachun Wu and Jian Zhou

Electronics 2023, 12(13), 2922; https://doi.org/10.3390/electronics12132922 - 3 Jul 2023

Cited by 3 | Viewed by 2919

Abstract

Single instruction multiple data (SIMD) vector extension has become an essential feature of high-performance processors. Architectures such as x86, ARM, MIPS, and PowerPC have specific vector extension instruction sets and SIMD micro-architectures. Using SIMD vectorization programming can significantly improve the performance of application [...] Read more.

Single instruction multiple data (SIMD) vector extension has become an essential feature of high-performance processors. Architectures such as x86, ARM, MIPS, and PowerPC have specific vector extension instruction sets and SIMD micro-architectures. Using SIMD vectorization programming can significantly improve the performance of application algorithms while keeping the hardware overhead low. In addition, other methods can enhance algorithm performance, such as selecting the best SIMD vectorization model for algorithms, ensuring sufficient instruction streams, implementing reasonable and effective cache data prefetching, and aligning data access and storage addresses according to instruction characteristics. The goal of this paper is three-fold. First, we introduce the basic structural characteristics of a general RISC processor, Hua Rui (HR) DSP, with a custom vector instruction set based on compatibility with an MIPS64 fixed-point and floating-point instruction set, as well as a Fei Teng (FT) processor compatible with an ARMv8 instruction set. Second, we summarize the fundamental principles of SIMD vectorization programming design for the HR DSP, which provides ideas for other scholars or engineering and technical personnel to study the algorithm performance using SIMD vectorization optimization. Third, we implement representative typical algorithms based on the HR and FT platforms and obtain experimental results that show improvement in algorithm SIMD vectorization optimization according to the vector programming design principles summarized in this article can improve the single-core performance of scalar implementation without vectorization, instruction streams, and cache data prefetching by 4–22 times for mean filter, accumulation, and matrix–matrix multiplication, which is significantly better than the performance improvement of 3–13 times for the FT platform. Moreover, the performance of matrix–matrix multiplication using the best vectorization model on the HR platform is about 84% higher than that of the common SIMD vectorization model. Full article

(This article belongs to the Special Issue High-Performance Computing and Its Applications)

► Show Figures

Figure 1

24 pages, 7561 KB

Open AccessArticle

Efficient Management and Scheduling of Massive Remote Sensing Image Datasets

by Jiankun Zhu, Zhen Zhang, Fei Zhao, Haoran Su, Zhengnan Gu and Leilei Wang

ISPRS Int. J. Geo-Inf. 2023, 12(5), 199; https://doi.org/10.3390/ijgi12050199 - 13 May 2023

Cited by 2 | Viewed by 2895

Abstract

The rapid development of remote sensing image sensor technology has led to exponential increases in available image data. The real-time scheduling of gigabyte-level images and the storage and management of massive image datasets are incredibly challenging for current hardware, networking and storage systems. [...] Read more.

The rapid development of remote sensing image sensor technology has led to exponential increases in available image data. The real-time scheduling of gigabyte-level images and the storage and management of massive image datasets are incredibly challenging for current hardware, networking and storage systems. This paper’s three novel strategies (ring caching, multi-threading and tile-prefetching mechanisms) are designed to comprehensively optimize the remote sensing image scheduling process from image retrieval, transmission and visualization perspectives. A novel remote sensing image management and scheduling system (RSIMSS) is designed using these three strategies as its core algorithm, the PostgreSQL database and HDFS distributed file system as its underlying storage system, and the multilayer Hilbert spatial index and image tile pyramid to organize massive remote sensing image datasets. Test results show that the RSIMSS provides efficient and stable image storage performance and allows real-time image scheduling and view roaming. Full article

► Show Figures

Figure 1

19 pages, 1269 KB

Open AccessArticle

A Highly Pipelined and Highly Parallel VLSI Architecture of CABAC Encoder for UHDTV Applications

by Chen Fu, Heming Sun, Zhiqiang Zhang and Jinjia Zhou

Sensors 2023, 23(9), 4293; https://doi.org/10.3390/s23094293 - 26 Apr 2023

Cited by 2 | Viewed by 3191

Abstract

Recently, specifically designed video codecs have been preferred due to the expansion of video data in Internet of Things (IoT) devices. Context Adaptive Binary Arithmetic Coding (CABAC) is the entropy coding module widely used in recent video coding standards such as HEVC/H.265 and [...] Read more.

Recently, specifically designed video codecs have been preferred due to the expansion of video data in Internet of Things (IoT) devices. Context Adaptive Binary Arithmetic Coding (CABAC) is the entropy coding module widely used in recent video coding standards such as HEVC/H.265 and VVC/H.266. CABAC is a well known throughput bottleneck due to its strong data dependencies. Because the required context model of the current bin often depends on the results of the previous bin, the context model cannot be prefetched early enough and then results in pipeline stalls. To solve this problem, we propose a prediction-based context model prefetching strategy, effectively eliminating the clock consumption of the contextual model for accessing data in memory. Moreover, we offer multi-result context model update (MCMU) to reduce the critical path delay of context model updates in multi-bin/clock architecture. Furthermore, we apply pre-range update and pre-renormalize techniques to reduce the multiplex BAE’s route delay due to the incomplete reliance on the encoding process. Moreover, to further speed up the processing, we propose to process four regular and several bypass bins in parallel with a variable bypass bin incorporation (VBBI) technique. Finally, a quad-loop cache is developed to improve the compatibility of data interactions between the entropy encoder and other video encoder modules. As a result, the pipeline architecture based on the context model prefetching strategy can remove up to 45.66% of the coding time due to stalls of the regular bin, and the parallel architecture can also save 29.25% of the coding time due to model update on average under the condition that the Quantization Parameter (QP) is equal to 22. At the same time, the throughput of our proposed parallel architecture can reach 2191 Mbin/s, which is sufficient to meet the requirements of 8 K Ultra High Definition Television (UHDTV). Additionally, the hardware efficiency (Mbins/s per k gates) of the proposed architecture is higher than that of existing advanced pipeline and parallel architectures. Full article

(This article belongs to the Special Issue Image/Video Coding and Processing Techniques for Intelligent Sensor Nodes)

► Show Figures

Figure 1

19 pages, 5007 KB

Open AccessArticle

Adaptive Regression Prefetching Algorithm by Using Big Data Application Characteristics

by Mengzhao Zhang, Qian Tang, Jeong-Geun Kim, Bernd Burgstaller and Shin-Dug Kim

Appl. Sci. 2023, 13(7), 4436; https://doi.org/10.3390/app13074436 - 31 Mar 2023

Cited by 2 | Viewed by 3356

Abstract

This paper presents an innovative prefetching algorithm for a hybrid main memory structure, which consists of DRAM and phase-change memory. To enhance the efficiency of hybrid memory hardware in serving big data technologies, the proposed design employs an application-adaptive algorithm based on big [...] Read more.

This paper presents an innovative prefetching algorithm for a hybrid main memory structure, which consists of DRAM and phase-change memory. To enhance the efficiency of hybrid memory hardware in serving big data technologies, the proposed design employs an application-adaptive algorithm based on big data execution characteristics. Specifically optimized for graph-processing applications, which exhibit complex and irregular memory access patterns, a dual prefetching scheme is proposed. This scheme comprises a fast-response model with low-cost algorithms for regular memory access patterns and an intelligent model based on an adaptive Gaussian-kernel-based machine-learning prefetch engine. The intelligent model can acquire knowledge from real-time data samples, capturing distinct memory access patterns via an adaptive Gaussian-kernel-based regression algorithm. These methods allow the model to self-adjust its hyperparameters at runtime, facilitating the implementation of locally weighted regression (LWR) for the Gaussian process of irregular access patterns. In addition, we introduced an efficient hybrid main memory architecture that integrates two different kinds of memory technologies, including DRAM and PCM, providing cost and energy efficiency over a DRAM-only memory structure. Based on the simulation-based experimental results, our proposed model achieved performance enhancement of 57% compared to the conventional DRAM model and of approximately 12% compared to existing prefetcher-based models. Full article

(This article belongs to the Special Issue Application Research in Big Data Technologies)

► Show Figures

Figure 1

17 pages, 3942 KB

Open AccessArticle

BIOS-Based Server Intelligent Optimization

by Xianxian Qi, Jianfeng Yang, Yiyang Zhang and Baonan Xiao

Sensors 2022, 22(18), 6730; https://doi.org/10.3390/s22186730 - 6 Sep 2022

Cited by 2 | Viewed by 2317

Abstract

Servers are the infrastructure of enterprise applications, and improving server performance under fixed hardware resources is an important issue. Conducting performance tuning at the application layer is common, but it is not systematic and requires prior knowledge of the running application. Some works [...] Read more.

Servers are the infrastructure of enterprise applications, and improving server performance under fixed hardware resources is an important issue. Conducting performance tuning at the application layer is common, but it is not systematic and requires prior knowledge of the running application. Some works performed tuning by dynamically adjusting the hardware prefetching configuration with a predictive model. Similarly, we design a BIOS (Basic Input/Output System)-based dynamic tuning framework for a Taishan 2280 server, including dynamic identification and static optimization. We simulate five workload scenarios (CPU-instance, etc.) with benchmark tools and perform scenario recognition dynamically with performance monitor counters (PMCs). The adjustable configurations provided by Kunpeng processing reach

2^{N} (N > 100)

. Therefore, we propose a joint BIOS optimization algorithm using a deep

Q

-network. Configuration optimization is modeled as a Markov decision process starting from a feasible solution and optimizing gradually. To improve the continuous optimization capabilities, the neighborhood search method of state machine control is added. To assess its performance, we compare our algorithm with the genetic algorithm and particle swarm optimization. Our algorithm shows that it can also improve performance up to 1.10× compared to experience configuration and perform better in reducing the probability of server downtime. The dynamic tuning framework in this paper is extensible, can be trained to adapt to different scenarios, and is more suitable for servers with many adjustable configurations. Compared with the heuristic intelligent search algorithm, the proposed joint BIOS optimization algorithm can generate fewer infeasible solutions and is not easily disturbed by initialization. Full article

(This article belongs to the Section Sensor Networks)

► Show Figures

Figure 1

28 pages, 10517 KB

Open AccessArticle

Scalable Post-Processing of Large-Scale Numerical Simulations of Turbulent Fluid Flows

by Christian Lagares, Wilson Rivera and Guillermo Araya

Symmetry 2022, 14(4), 823; https://doi.org/10.3390/sym14040823 - 14 Apr 2022

Cited by 7 | Viewed by 3309

Abstract

Military, space, and high-speed civilian applications will continue contributing to the renewed interest in compressible, high-speed turbulent boundary layers. To further complicate matters, these flows present complex computational challenges ranging from the pre-processing to the execution and subsequent post-processing of large-scale numerical simulations. [...] Read more.

Military, space, and high-speed civilian applications will continue contributing to the renewed interest in compressible, high-speed turbulent boundary layers. To further complicate matters, these flows present complex computational challenges ranging from the pre-processing to the execution and subsequent post-processing of large-scale numerical simulations. Exploring more complex geometries at higher Reynolds numbers will demand scalable post-processing. Modern times have brought application developers and scientists the advent of increasingly more diversified and heterogeneous computing hardware, which significantly complicates the development of performance-portable applications. To address these challenges, we propose Aquila, a distributed, out-of-core, performance-portable post-processing library for large-scale simulations. It is designed to alleviate the burden of domain experts writing applications targeted at heterogeneous, high-performance computers with strong scaling performance. We provide two implementations, in C++ and Python; and demonstrate their strong scaling performance and ability to reach 60% of peak memory bandwidth and 98% of the peak filesystem bandwidth while operating out of core. We also present our approach to optimizing two-point correlations by exploiting symmetry in the Fourier space. A key distinction in the proposed design is the inclusion of an out-of-core data pre-fetcher to give the illusion of in-memory availability of files yielding up to 46% improvement in program runtime. Furthermore, we demonstrate a parallel efficiency greater than 70% for highly threaded workloads. Full article

(This article belongs to the Special Issue Turbulence and Multiphase Flows and Symmetry)

► Show Figures

Figure 1

Search Results (18)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (18)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI