Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (72)

Search Parameters:
Keywords = multicore CPU

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
20 pages, 2989 KB  
Article
ZernikeViewer: An Open-Source Framework for Fast Simulation and Real-Time Reconstruction of Phase, Fringe, and PSF Maps
by Ilya Galaktionov
Appl. Syst. Innov. 2026, 9(3), 51; https://doi.org/10.3390/asi9030051 - 26 Feb 2026
Viewed by 17
Abstract
Zernike polynomials constitute an essential mathematical basis for representing functions defined over the unit disk. They are widely used in a diverse range of scientific and engineering disciplines, including adaptive optics for characterizing atmospheric distortions, ophthalmology for quantifying ocular aberrations, microscopy for instrument [...] Read more.
Zernike polynomials constitute an essential mathematical basis for representing functions defined over the unit disk. They are widely used in a diverse range of scientific and engineering disciplines, including adaptive optics for characterizing atmospheric distortions, ophthalmology for quantifying ocular aberrations, microscopy for instrument characterization and aberration correction, and optical metrology for surface profiling. This paper introduces ZernikeViewer, a software framework developed for the rapid calculation and visualization of fringe, phase, and point spread function (PSF) maps from Zernike coefficients. The framework leverages CPU multicore and multithreading capabilities through the .NET Task Parallel Library (TPL), augmented by codebase optimizations and the preloading of precomputed Zernike polynomial matrices. These optimizations reduce computation time by a factor of 7 to 10 compared to a conventional approach; for instance, from 1 ms to 0.1 ms for a radial order of n = 10 and from 700 ms to 80 ms for n = 100. Numerical error analysis confirms the accuracy of the computation, with an average root-mean-square (RMS) error of 0.11 ms observed in the timing measurements. Furthermore, it is demonstrated that implementing Jacobi recursion relations could potentially reduce the numerical calculation error by up to 5 orders of magnitude. Full article
Show Figures

Figure 1

20 pages, 1035 KB  
Article
Multi-Level Parallel CPU Execution Method for Accelerated Portion-Based Variant Call Format Data Processing
by Lesia Mochurad, Ivan Tsmots, Vita Mostova and Karina Kystsiv
Computation 2026, 14(2), 48; https://doi.org/10.3390/computation14020048 - 8 Feb 2026
Viewed by 274
Abstract
This paper proposes and experimentally evaluates a multi-level CPU-oriented execution method for high-throughput portion-based processing of file-backed Variant Call Format (VCF) data and automated mutation classification. The approach is based on a formally defined local processing scheme and integrates three coordinated levels of [...] Read more.
This paper proposes and experimentally evaluates a multi-level CPU-oriented execution method for high-throughput portion-based processing of file-backed Variant Call Format (VCF) data and automated mutation classification. The approach is based on a formally defined local processing scheme and integrates three coordinated levels of parallelism: block-based partitioning of file-backed VCF portions read sequentially into localized fragments with data-level parallel processing; task-level decomposition of feature construction into independent transformations; and execution-level specialization via JIT compilation of numerical kernels. To prevent performance degradation caused by nested parallelism, a resource-control mechanism is introduced as an execution rule that bounds effective parallelism and mitigates oversubscription, improving throughput stability on a single multi-core CPU node. Experiments on a public chromosome-17 VCF dataset for BRCA1-region pathogenicity classification demonstrate that the proposed multi-level local CPU execution (parsing/filtering, feature construction, and JIT-specialized numeric kernels) reduces runtime from 291.25 s (sequential) to 73.82 s, yielding a 3.95× speedup. When combined with resource-coordinated parallel model training, the end-to-end runtime further decreases to 51.18 s, corresponding to a 5.69× speedup, while preserving classification quality (accuracy 0.8483, precision 0.8758, recall 0.8261, F1 0.8502). A stage-wise ablation analysis quantifies the contribution of each execution level and confirms consistent scaling under resource-bounded execution. Full article
(This article belongs to the Section Computational Engineering)
Show Figures

Figure 1

35 pages, 451 KB  
Review
Reconfigurable SmartNICs: A Comprehensive Review of FPGA Shells and Heterogeneous Offloading Architectures
by Andrei-Alexandru Ulmămei and Călin Bîră
Appl. Sci. 2026, 16(3), 1476; https://doi.org/10.3390/app16031476 - 1 Feb 2026
Viewed by 325
Abstract
Smart Network Interface Cards (SmartNICs) represent a paradigm shift in system architecture by offloading packet processing and selected application logic from the host CPU to the network interface itself. This architectural evolution reduces end-to-end latency toward the physical limits of Ethernet while simultaneously [...] Read more.
Smart Network Interface Cards (SmartNICs) represent a paradigm shift in system architecture by offloading packet processing and selected application logic from the host CPU to the network interface itself. This architectural evolution reduces end-to-end latency toward the physical limits of Ethernet while simultaneously decreasing CPU and memory bandwidth utilization. The current ecosystem comprises three principal categories of devices: (i) conventional fixed-function NICs augmented with limited offload capabilities; (ii) ASIC-based Data Processing Units (DPUs) that integrate multi-core processors and dedicated protocol accelerators; and (iii) FPGA-based SmartNIC shells—reconfigurable hardware frameworks that provide PCIe connectivity, DMA engines, Ethernet MAC interfaces, and control firmware, while exposing programmable logic regions for user-defined accelerators. This article provides a comparative survey of representative platforms from each category, with particular emphasis on open-source FPGA shells. It examines their architectural capabilities, programmability models, reconfiguration mechanisms, and support for GPU-centric peer-to-peer datapaths. Furthermore, it investigates the associated software stack, encompassing kernel drivers, user-space libraries, and control APIs. This study concludes by outlining open research challenges and future directions in RDMA-oriented data preprocessing and heterogeneous SmartNIC acceleration. Full article
(This article belongs to the Special Issue Recent Applications of Field-Programmable Gate Arrays (FPGAs))
Show Figures

Figure 1

13 pages, 2210 KB  
Article
High-Throughput Control-Data Acquisition for Multicore MCU-Based Real-Time Control Systems Using Double Buffering over Ethernet
by Seung-Hun Lee, Duc M. Tran and Joon-Young Choi
Electronics 2026, 15(2), 469; https://doi.org/10.3390/electronics15020469 - 22 Jan 2026
Viewed by 257
Abstract
For the design, implementation, performance optimization, and predictive maintenance of high-speed real-time control systems with sub-millisecond control periods, the capability to acquire large volumes of high-rate control data in real time is required without interfering with normal control operation that is repeatedly executed [...] Read more.
For the design, implementation, performance optimization, and predictive maintenance of high-speed real-time control systems with sub-millisecond control periods, the capability to acquire large volumes of high-rate control data in real time is required without interfering with normal control operation that is repeatedly executed in each extremely short control cycle. In this study, we propose a control-data acquisition method for high-speed real-time control systems with sub-millisecond control periods, in which control data are transferred to an external host device via Ethernet in real time. To enable the transmission of high-rate control data without disturbing the real-time control operation, a multicore microcontroller unit (MCU) is adopted, where the control task and the data transmission task are executed on separately assigned central processing unit (CPU) cores. Furthermore, by applying a double-buffering algorithm, continuous Ethernet communication without intermediate waiting time is achieved, resulting in a substantial improvement in transmission throughput. Using a control card based on TI’s multicore MCU TMS320F28388D, which consists of dual digital signal processor cores and one connectivity manager (CM) core, the proposed control-data acquisition method is implemented and an actual experimental environment is constructed. Experimental results show that the double-buffering transmission achieves a maximum throughput of 94.2 Mbps on a 100 Mbps Fast Ethernet link, providing a 38.5% improvement over the single-buffering case and verifying the high performance and efficiency of the proposed data acquisition method. Full article
(This article belongs to the Section Industrial Electronics)
Show Figures

Figure 1

20 pages, 4896 KB  
Article
GPU-Driven Acceleration of Wavelet-Based Autofocus for Practical Applications in Digital Imaging
by HyungTae Kim, Duk-Yeon Lee, Dongwoon Choi and Dong-Wook Lee
Appl. Sci. 2025, 15(19), 10455; https://doi.org/10.3390/app151910455 - 26 Sep 2025
Viewed by 764
Abstract
A parallel implementation of wavelet-based autofocus (WBA) was presented to accelerate recursive operations and reduce computational costs. WBA evaluates digital focus indices (DFIs) using first- or second-order moments of the wavelet coefficients in high-frequency subbands. WBA is generally accurate and reliable; however, its [...] Read more.
A parallel implementation of wavelet-based autofocus (WBA) was presented to accelerate recursive operations and reduce computational costs. WBA evaluates digital focus indices (DFIs) using first- or second-order moments of the wavelet coefficients in high-frequency subbands. WBA is generally accurate and reliable; however, its computational cost is high owing to biorthogonal decomposition. Thus, this study parallelized the Daubechies-6 wavelet and norms of the high-frequency subbands for the DFI. The kernels of the DFI computation were constructed using open sources for driving multicore processors (MCPs) and general processing units (GPUs). The standard C++, OpenCV, OpenMP, OpenCL, and CUDA open-source platforms were selected to construct the DFI kernels, considering hardware compatibility. The experiment was conducted using the MCP, peripheral GPUs, and CPU-resident GPUs on desktops for advanced users and compact devices for industrial applications. The results demonstrated that the GPUs provided sufficient performance to achieve WBA even when using budget GPUs, indicating that the GPUs are advantageous for practical applications of WBA. This study also implies that although budget GPUs are left unused, they can potentially be great resources for wavelet-based processing. Full article
(This article belongs to the Special Issue Data Structures for Graphics Processing Units (GPUs))
Show Figures

Figure 1

43 pages, 2828 KB  
Article
Efficient Hybrid Parallel Scheme for Caputo Time-Fractional PDEs on Multicore Architectures
by Mudassir Shams and Bruno Carpentieri
Fractal Fract. 2025, 9(9), 607; https://doi.org/10.3390/fractalfract9090607 - 19 Sep 2025
Viewed by 1000
Abstract
We present a hybrid parallel scheme for efficiently solving Caputo time-fractional partial differential equations (CTFPDEs) with integer-order spatial derivatives on multicore CPU and GPU platforms. The approach combines a second-order spatial discretization with the L1 time-stepping scheme and employs MATLAB parfor parallelization [...] Read more.
We present a hybrid parallel scheme for efficiently solving Caputo time-fractional partial differential equations (CTFPDEs) with integer-order spatial derivatives on multicore CPU and GPU platforms. The approach combines a second-order spatial discretization with the L1 time-stepping scheme and employs MATLAB parfor parallelization to achieve significant reductions in runtime and memory usage. A theoretical third-order convergence rate is established under smooth-solution assumptions, and the analysis also accounts for the loss of accuracy near the initial time t=t0 caused by weak singularities inherent in time-fractional models. Unlike many existing approaches that rely on locally convergent strategies, the proposed method ensures global convergence even for distant or randomly chosen initial guesses. Benchmark problems from fractional biological models—including glucose–insulin regulation, tumor growth under chemotherapy, and drug diffusion in tissue—are used to validate the robustness and reliability of the scheme. Numerical experiments confirm near-linear speedup on up to four CPU cores and show that the method outperforms conventional techniques in terms of convergence rate, residual error, iteration count, and efficiency. These results demonstrate the method’s suitability for large-scale CTFPDE simulations in scientific and engineering applications. Full article
Show Figures

Figure 1

36 pages, 5771 KB  
Article
Improving K-Means Clustering: A Comparative Study of Parallelized Version of Modified K-Means Algorithm for Clustering of Satellite Images
by Yuv Raj Pant, Larry Leigh and Juliana Fajardo Rueda
Algorithms 2025, 18(8), 532; https://doi.org/10.3390/a18080532 - 21 Aug 2025
Cited by 1 | Viewed by 4105
Abstract
Efficient clustering of high-spatial-dimensional satellite image datasets remains a critical challenge, particularly due to the computational demands of spectral distance calculations, random centroid initialization, and sensitivity to outliers in conventional K-Means algorithms. This study presents a comprehensive comparative analysis of eight parallelized variants [...] Read more.
Efficient clustering of high-spatial-dimensional satellite image datasets remains a critical challenge, particularly due to the computational demands of spectral distance calculations, random centroid initialization, and sensitivity to outliers in conventional K-Means algorithms. This study presents a comprehensive comparative analysis of eight parallelized variants of the K-Means algorithm, designed to enhance clustering efficiency and reduce computational burden for large-scale satellite image analysis. The proposed parallelized implementations incorporate optimized centroid initialization for better starting point selection using a dynamic K-Means sharp method to detect the outlier to improve cluster robustness, and a Nearest-Neighbor Iteration Calculation Reduction method to minimize redundant computations. These enhancements were applied to a test set of 114 global land cover data cubes, each comprising high-dimensional satellite images of size 3712 × 3712 × 16 and executed on multi-core CPU architecture to leverage extensive parallel processing capabilities. Performance was evaluated across three criteria: convergence speed (iterations), computational efficiency (execution time), and clustering accuracy (RMSE). The Parallelized Enhanced K-Means (PEKM) method achieved the fastest convergence at 234 iterations and the lowest execution time of 4230 h, while maintaining consistent RMSE values (0.0136) across all algorithm variants. These results demonstrate that targeted algorithmic optimizations, combined with effective parallelization strategies, can improve the practicality of K-Means clustering for high-dimensional-satellites image analysis. This work underscores the potential of improving K-Means clustering frameworks beyond hardware acceleration alone, offering scalable solutions good for large-scale unsupervised image classification tasks. Full article
(This article belongs to the Special Issue Algorithms in Multi-Sensor Imaging and Fusion)
Show Figures

Graphical abstract

19 pages, 8359 KB  
Article
A Generalized Optimization Scheme for Memory-Side Prefetching to Enhance System Performance
by Yuzhi Zhuang, Ming Zhang and Binghao Wang
Electronics 2025, 14(14), 2811; https://doi.org/10.3390/electronics14142811 - 12 Jul 2025
Cited by 1 | Viewed by 1617 | Correction
Abstract
In modern multi-core processors, memory request latency critically constrains overall performance. Prefetching, a promising technique, mitigates memory access latency by pre-loading data into faster cache structures. However, existing core-side prefetchers lack visibility to the DRAM state and may issue suboptimal requests, while conventional [...] Read more.
In modern multi-core processors, memory request latency critically constrains overall performance. Prefetching, a promising technique, mitigates memory access latency by pre-loading data into faster cache structures. However, existing core-side prefetchers lack visibility to the DRAM state and may issue suboptimal requests, while conventional memory-side prefetchers often default to simple next-line policies that miss complex access patterns. We propose a comprehensive memory-side prefetch optimization scheme, which includes a prefetcher that utilizes advanced prefetching algorithms and an optimization module. Our prefetcher is capable of detecting more complex memory access patterns, thereby improving both prefetch accuracy and coverage. Additionally, considering the characteristics of DRAM memory access, the optimization module minimizes the negative impact of prefetch requests on DRAM by enhancing coordination with memory operations. Additionally, our prefetcher works in synergy with core-side prefetchers to deliver superior overall performance. Simulation results using Gem5 and SPEC CPU2017 workloads show that our approach delivers an average performance improvement of 10.5% and reduces memory access latency by 61%. Our prefetcher also operates in conjunction with core-side prefetchers to form a multi-level prefetching hierarchy, enabling further performance gains through coordinated and complementary prefetching strategies. Full article
(This article belongs to the Special Issue Computer Architecture & Parallel and Distributed Computing)
Show Figures

Figure 1

19 pages, 11574 KB  
Article
Multiscale Eight Direction Descriptor-Based Improved SAR–SIFT Method for Along-Track and Cross-Track SAR Images
by Wei Wang, Jinyang Chen and Zhonghua Hong
Appl. Sci. 2025, 15(14), 7721; https://doi.org/10.3390/app15147721 - 10 Jul 2025
Cited by 2 | Viewed by 874
Abstract
Image matching between spaceborne synthetic aperture radar (SAR) images are frequently interfered with by speckle noise, resulting in low matching accuracy, and the vast coverage of SAR images renders the direct matching approach inefficient. To address this issue, the study puts forward a [...] Read more.
Image matching between spaceborne synthetic aperture radar (SAR) images are frequently interfered with by speckle noise, resulting in low matching accuracy, and the vast coverage of SAR images renders the direct matching approach inefficient. To address this issue, the study puts forward a multi-scale adaptive improved SAR image block matching method (called STSU–SAR–SIFT). To improve accuracy, this method addresses the issue of the number of feature points under different thresholds by using the SAR–Shi–Tomasi response function in a multi-scale space. Then, the SUSAN function is used to constrain the effect of coherent noise on the initial feature points, and the multi-scale and multi-directional GLOH descriptor construction approach is used to boost the robustness of descriptors. To improve efficiency, the method adopts the main and additional image overlapping area matching method to reduce the search range and uses multi-core CPU+GPU collaborative parallel computing to boost the efficiency of the SAR–SIFT algorithm by block processing the overlapping area. The experimental results demonstrate that the STSU–SAR–SIFT approach presented in this paper has better accuracy and distribution. After the algorithm acceleration, the efficiency is obviously improved. Full article
(This article belongs to the Section Earth Sciences)
Show Figures

Figure 1

28 pages, 11862 KB  
Article
An Improved Reference Paper Collection System Using Web Scraping with Three Enhancements
by Tresna Maulana Fahrudin, Nobuo Funabiki, Komang Candra Brata, Inzali Naing, Soe Thandar Aung, Amri Muhaimin and Dwi Arman Prasetya
Future Internet 2025, 17(5), 195; https://doi.org/10.3390/fi17050195 - 28 Apr 2025
Cited by 5 | Viewed by 4434
Abstract
Nowadays, accessibility to academic papers has been significantly improved with electric publications on the internet, where open access has become common. At the same time, it has increased workloads in literature surveys for researchers who usually manually download PDF files and check their [...] Read more.
Nowadays, accessibility to academic papers has been significantly improved with electric publications on the internet, where open access has become common. At the same time, it has increased workloads in literature surveys for researchers who usually manually download PDF files and check their contents. To solve this drawback, we have proposed a reference paper collection system using a web scraping technology and natural language models. However, our previous system often finds a limited number of relevant reference papers after taking long time, since it relies on one paper search website and runs on a single thread at a multi-core CPU. In this paper, we present an improved reference paper collection system with three enhancements to solve them: (1) integrating the APIs from multiple paper search web sites, namely, the bulk search endpoint in the Semantic Scholar API, the article search endpoint in the DOAJ API, and the search and fetch endpoint in the PubMed API to retrieve article metadata, (2) running the program on multiple threads for multi-core CPU, and (3) implementing Dynamic URL Redirection, Regex-based URL Parsing, and HTML Scraping with URL Extraction for fast checking of PDF file accessibility, along with sentence embedding to assess relevance based on semantic similarity. For evaluations, we compare the number of obtained reference papers and the response time between the proposal, our previous work, and common literature search tools in five reference paper queries. The results show that the proposal increases the number of relevant reference papers by 64.38% and reduces the time by 59.78% on average compared to our previous work, while outperforming common literature search tools in reference papers. Thus, the effectiveness of the proposed system has been demonstrated in our experiments. Full article
(This article belongs to the Special Issue ICT and AI in Intelligent E-systems)
Show Figures

Figure 1

19 pages, 498 KB  
Article
Optimization of Direct Convolution Algorithms on ARM Processors for Deep Learning Inference
by Shang Li, Fei Yu, Shankou Zhang, Huige Yin and Hairong Lin
Mathematics 2025, 13(5), 787; https://doi.org/10.3390/math13050787 - 27 Feb 2025
Cited by 2 | Viewed by 2789
Abstract
In deep learning, convolutional layers typically bear the majority of the computational workload and are often the primary contributors to performance bottlenecks. The widely used convolution algorithm is based on the IM2COL transform to take advantage of the highly optimized GEMM (General Matrix [...] Read more.
In deep learning, convolutional layers typically bear the majority of the computational workload and are often the primary contributors to performance bottlenecks. The widely used convolution algorithm is based on the IM2COL transform to take advantage of the highly optimized GEMM (General Matrix Multiplication) kernel acceleration, using the highly optimized BLAS (Basic Linear Algebra Subroutine) library, which tends to incur additional memory overhead. Recent studies have indicated that direct convolution approaches can outperform traditional convolution implementations without additional memory overhead. In this paper, we propose a high-performance implementation of the direct convolution algorithm for inference that preserves the channel-first data layout of the convolutional layer inputs/outputs. We evaluate the performance of our proposed algorithm on a multi-core ARM CPU platform and compare it with state-of-the-art convolution optimization techniques. Experimental results demonstrate that our new algorithm performs better across the evaluated scenarios and platforms. Full article
(This article belongs to the Special Issue Optimization Theory, Algorithms and Applications)
Show Figures

Figure 1

18 pages, 3376 KB  
Article
Heterogeneous Edge Computing for Molecular Property Prediction with Graph Convolutional Networks
by Mahdieh Grailoo and Jose Nunez-Yanez
Electronics 2025, 14(1), 101; https://doi.org/10.3390/electronics14010101 - 30 Dec 2024
Cited by 3 | Viewed by 1769
Abstract
Graph-based neural networks have proven to be useful in molecular property prediction, a critical component of computer-aided drug discovery. In this application, in response to the growing demand for improved computational efficiency and localized edge processing, this paper introduces a novel approach that [...] Read more.
Graph-based neural networks have proven to be useful in molecular property prediction, a critical component of computer-aided drug discovery. In this application, in response to the growing demand for improved computational efficiency and localized edge processing, this paper introduces a novel approach that leverages specialized accelerators on a heterogeneous edge computing platform. Our focus is on graph convolutional networks, a leading graph-based neural network variant that integrates graph convolution layers with multi-layer perceptrons. Molecular graphs are typically characterized by a low number of nodes, leading to low-dimensional dense matrix multiplications within multi-layer perceptrons—conditions that are particularly well-suited for Edge TPUs. These TPUs feature a systolic array of multiply–accumulate units optimized for dense matrix operations. Furthermore, the inherent sparsity in molecular graph adjacency matrices offers additional opportunities for computational optimization. To capitalize on this, we developed an FPGA GFADES accelerator, using high-level synthesis, specifically tailored to efficiently manage the sparsity in both the graph structure and node features. Our hardware/software co-designed GCN+MLP architecture delivers performance improvements, achieving up to 58× increased speed compared to conventional software implementations. This architecture is implemented using the Pynq framework and TensorFlow Lite Runtime, running on a multi-core ARM CPU within an AMD/Xilinx Zynq Ultrascale+ device, in combination with the Edge TPU and programmable logic. Full article
Show Figures

Figure 1

17 pages, 1369 KB  
Article
Enabling Parallel Performance and Portability of Solid Mechanics Simulations Across CPU and GPU Architectures
by Nathaniel Morgan, Caleb Yenusah, Adrian Diaz, Daniel Dunning, Jacob Moore, Erin Heilman, Evan Lieberman, Steven Walton, Sarah Brown, Daniel Holladay, Russell Marki, Robert Robey and Marko Knezevic
Information 2024, 15(11), 716; https://doi.org/10.3390/info15110716 - 7 Nov 2024
Cited by 3 | Viewed by 2141
Abstract
Efficiently simulating solid mechanics is vital across various engineering applications. As constitutive models grow more complex and simulations scale up in size, harnessing the capabilities of modern computer architectures has become essential for achieving timely results. This paper presents advancements in running parallel [...] Read more.
Efficiently simulating solid mechanics is vital across various engineering applications. As constitutive models grow more complex and simulations scale up in size, harnessing the capabilities of modern computer architectures has become essential for achieving timely results. This paper presents advancements in running parallel simulations of solid mechanics on multi-core CPUs and GPUs using a single-code implementation. This portability is made possible by the C++ matrix and array (MATAR) library, which interfaces with the C++ Kokkos library, enabling the selection of fine-grained parallelism backends (e.g., CUDA, HIP, OpenMP, pthreads, etc.) at compile time. MATAR simplifies the transition from Fortran to C++ and Kokkos, making it easier to modernize legacy solid mechanics codes. We applied this approach to modernize a suite of constitutive models and to demonstrate substantial performance improvements across different computer architectures. This paper includes comparative performance studies using multi-core CPUs along with AMD and NVIDIA GPUs. Results are presented using a hypoelastic–plastic model, a crystal plasticity model, and the viscoplastic self-consistent generalized material model (VPSC-GMM). The results underscore the potential of using the MATAR library and modern computer architectures to accelerate solid mechanics simulations. Full article
(This article belongs to the Special Issue Advances in High Performance Computing and Scalable Software)
Show Figures

Figure 1

17 pages, 3630 KB  
Article
Parallel Simulations of the Sharp Wave-Ripples of the Hippocampus on Multicore CPUs and GPUs
by Emanuele Torti, Simone Migliazza, Elisa Marenzi, Giovanni Danese and Francesco Leporati
Appl. Sci. 2024, 14(21), 9967; https://doi.org/10.3390/app14219967 - 31 Oct 2024
Viewed by 1345
Abstract
The simulation of realistic systems plays a crucial role in modern sciences. Complex organs such as the brain can be described by mathematical models to reproduce biological behaviors. In the brain, the hippocampus is a critical region for memory and learning. In the [...] Read more.
The simulation of realistic systems plays a crucial role in modern sciences. Complex organs such as the brain can be described by mathematical models to reproduce biological behaviors. In the brain, the hippocampus is a critical region for memory and learning. In the literature, a model to reproduce the memory consolidation mechanism has been proposed. This model exhibits a high degree of biological realism, though it is accompanied by a significant increase in computational complexity. This paper proposes the development of parallel simulation targeting different devices, namely multicore CPUs and GPUs. The experiments highlighted that the biological realism is maintained, together with a significant decrease of the processing times. Finally, the conducted analysis highlights that the GPU is one of the most suitable technologies for this kind of simulation. Full article
Show Figures

Figure 1

24 pages, 830 KB  
Article
On a Simplified Approach to Achieve Parallel Performance and Portability Across CPU and GPU Architectures
by Nathaniel Morgan, Caleb Yenusah, Adrian Diaz, Daniel Dunning, Jacob Moore, Erin Heilman, Calvin Roth, Evan Lieberman, Steven Walton, Sarah Brown, Daniel Holladay, Marko Knezevic, Gavin Whetstone, Zachary Baker and Robert Robey
Information 2024, 15(11), 673; https://doi.org/10.3390/info15110673 - 28 Oct 2024
Cited by 4 | Viewed by 4339
Abstract
This paper presents software advances to easily exploit computer architectures consisting of a multi-core CPU and CPU+GPU to accelerate diverse types of high-performance computing (HPC) applications using a single code implementation. The paper describes and demonstrates the performance of the open-source C++ mat [...] Read more.
This paper presents software advances to easily exploit computer architectures consisting of a multi-core CPU and CPU+GPU to accelerate diverse types of high-performance computing (HPC) applications using a single code implementation. The paper describes and demonstrates the performance of the open-source C++ matrix and array (MATAR) library that uniquely offers: (1) a straightforward syntax for programming productivity, (2) usable data structures for data-oriented programming (DOP) for performance, and (3) a simple interface to the open-source C++ Kokkos library for portability and memory management across CPUs and GPUs. The portability across architectures with a single code implementation is achieved by automatically switching between diverse fine-grained parallelism backends (e.g., CUDA, HIP, OpenMP, pthreads, etc.) at compile time. The MATAR library solves many longstanding challenges associated with easily writing software that can run in parallel on any computer architecture. This work benefits projects seeking to write new C++ codes while also addressing the challenges of quickly making existing Fortran codes performant and portable over modern computer architectures with minimal syntactical changes from Fortran to C++. We demonstrate the feasibility of readily writing new C++ codes and modernizing existing codes with MATAR to be performant, parallel, and portable across diverse computer architectures. Full article
(This article belongs to the Special Issue Advances in High Performance Computing and Scalable Software)
Show Figures

Figure 1

Back to TopTop